Building a data infrastructure for AI-centered biomedical research at health-system level
- Pablo Guerrero - D4L data4life gGmbH, Potsdam, Germany
- Alexa Straus - D4L data4life gGmbH, Potsdam, Germany
- Rasheed Aadil - Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam, Germany
- Esther-Maria Antao - Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam, Germany
- Lothar Wieler - Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam, Germany
- Thomas Fuchs - Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, United States
Gesundheit – gemeinsam. Kooperationstagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (GMDS), Deutschen Gesellschaft für Sozialmedizin und Prävention (DGSMP), Deutschen Gesellschaft für Epidemiologie (DGEpi), Deutschen Gesellschaft für Medizinische Soziologie (DGMS) und der Deutschen Gesellschaft für Public Health (DGPH). Dresden, 08.-13.09.2024. Düsseldorf: German Medical Science GMS Publishing House; 2024. DocAbstr. 1021
Introduction: State-of-the-art biomedical research requires substantial preparatory work due to siloed and non-standardized data, and the lack of tools to share machine-learning (ML) models [1]. We describe the AI-ready Mount Sinai (AIR·MS) project, established to simplify data access, data exploration and selection, and facilitating AI-driven research. It is a collaboration of HPI·MS (USA), Data4Life, and HPI (Germany).
State of the art: Existing healthcare data ecosystems typically extract data from clinical systems and store it in a common data model (e.g., OMOP CDM, i2b2). Researchers are provided with self-service access and support services, as well as a mix of proprietary and open source tools (e.g., i2b2 tranSMART, JupyterHub, Atlas) for data exploration and analysis. Both public cloud and on-premise infrastructure is used [2].
Concept: The AIR·MS platform integrates different clinical datasets from the Mount Sinai Health System and provides a unified, secure environment to access, combine, and analyze the data.
Implementation: Data is extracted from clinical sources within the Mount Sinai Health System using Apache NiFi, and linked through patients’ medical record numbers. Electronic Health Record (EHR) data is extracted from Epic, including both structured (e.g., diagnoses, procedures, laboratory results, medications) and unstructured data (e.g., clinical notes). Subsequently, data is structured following the OMOP Common Data Model. Metadata from the pathology laboratory systems and phenotype data from the BioMe BioBank program is available, while integration of radiology and EEG data is in progress. Data quality is assessed using the OHDSI Data Quality Dashboard.
Currently, AIR·MS holds identifiable and de-identified EHR data from ~10M patients, ~180M clinical notes, pathology data from ~1M patients and phenotype data from ~55K donors.
Researchers can access all data modalities through one regulatory workflow integrated into Mount Sinai’s identity management solution (SailPoint). Cohort creation is supported through storing data in the SAP HANA relational, in-memory database, and corresponding data access and query tools. With a cohort at hand, researchers build, train, and manage ML models, e.g., using a cloud-based platform (Microsoft Azure). In individually provisioned workspaces, researchers can run their workloads or submit larger jobs to a cluster queue.
Several measures ensure data privacy and security, enabling researchers from the US and Germany to work collaboratively on the data. Projects include dysmenorrhea [3] and Crohn’s disease studies [4], pan-cancer prediction of treatment outcomes [5], and federated learning to improve risk prediction for cardiovascular diseases.
Lessons learned: An unsystematic ingestion of all health data is cost prohibitive, especially for raw data like whole-sequence genomes or radiology exams. In AIR·MS, a design tradeoff has often been to begin with metadata (typically phenotype information) to sufficiently enable cohort building. Further levels of detail are added as justified by use cases.
Researchers often require access to data of recent practices or emergent diseases. In AIR·MS, an automated monthly “delta update” serves fresh health data.
Adoption by researchers hinges on their ability to quickly understand and onboard the infrastructure. In AIR·MS, a first step in this direction is a new documentation portal tailored towards the entry-level users.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
Quellen [1]
- 1.Athanasopoulou K, Daneva GN, Adamopoulos PG, Scorilas A. Artificial intelligence: the milestone in modern biomedical research. BioMedInformatics. 2022 Dec 17;2(4):727-44.2.Callahan A, Ashley E, Datta S, Desai P, Ferris TA, Fries JA, Halaas M, Langlotz CP, Mackey S, Posada JD, Pfeffer MA. The Stanford Medicine data science ecosystem for clinical and translational research. JAMIA Open. 2023 Oct 1;6(3):ooad054.3.Alleva E, Shaw LJ, Bottinger EP, Kahlil S, Rodrigues J, Ensari I. Risk of Ischemic Heart Disease in Young Women With Dysmenorrhea: A Cross-Sectional Study on Electronic Health Records. Circulation. 2023 Nov 7;148(Suppl_1):A12874.4.Schmidt L, Ibing S, Borchert F, Hugo J, Marshall A, Peraza J, Cho JH, Bottinger EP, Ungaro RC. Extraction of Crohn's Disease Clinical Phenotypes from Clinical Text Using Natural Language Processing [Preprint]. medRxiv. 2023 Oct 16. DOI: 10.1101/2023.10.16.23297099 5.PreCareML: Predicting Cardiovascular Events Using Machine Learning [Internet]. Github; 2022 [cited 2024 Apr 29]. Available from: https://precareml.github.io/