Solution | Real-World Data: Cohort Discovery#

Overview#

Business Case#

Real-world data (RWD) refers to health-related information collected outside of controlled clinical trials, often gathered from electronic health records, insurance claims, and patient registries. Using RWD to assess real-world evidence (RWE) is one of the top priority applications for AI initiatives in healthcare (payer/provider/public or federal health systems) and life science companies.

However, a complex ETL process involves collecting, normalizing, and harmonizing patient data from heterogeneous sources (e.g., patient registry, electronic health record system(s), insurance claims) even within one healthcare organization. A few global common data models provide a framework to standardize these complex ETL processes.

This solution provides a centralized repository to store and manage cohorts (clinical electronic phenotyping) from real-world data (e.g., electronic health records, medical claims data), adopting the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM). It turns complex cohort SQL scripts into generalizable and reusable clinical electronic phenotyping for future advanced analytics. It also offers a quick dashboard to review the descriptive statistics and some of the clinical characterizations of a given cohort.

Key beneficiaries include:

  • Biomedical informaticists who ingest and manage cohorts (clinical electronic phenotyping).

  • Clinical researchers who review and validate cohorts.

  • Epidemiologists and health outcomes researchers who derive insights from the cohorts and extend their use for further advanced statistical or machine learning outcomes analysis (for example, in real-world evidence studies).

Installation#

The process to install this solution differs depending on whether you are using Dataiku Cloud or a self-managed instance.

Dataiku Cloud users should follow the instructions for installing solutions on cloud.

  1. The Cloud Launchpad will automatically meet the technical requirements listed below, and add the Solution to your Dataiku instance.

  2. Once the Solution has been added to your space, move ahead to Data Model.

Technical Requirements#

To leverage this solution, you must meet the following requirements:

  • Have access to a Dataiku 13.2+* instance.

  • Databricks or Snowflake connection.

Data Model#

This solution requires that the database assumes OMOP CDM v5.3 or a later version.

We recommend that users conduct data preparation and feature engineering steps in a separate project to match the expected format of the input datasets. In most situations, date and datetime values require parsing via a Prepare recipe on the Dataiku instance.

The solution translates the OMOP CDM schema into a Dataiku instance-compatible schema, which can be found in the solution library repository (python/solution/utils/cdm_schema.py).

This solution requires a minimum subset of OMOP CDM tables:

  • person

  • observation_period

  • visit_occurrence

  • condition_occurrence

  • drug_exposure

  • death

  • location

  • condition_era

  • concept

  • concept_ancestor

See also

Please review the solution wiki [Data Model] for more information regarding the data model and required files.

Required Files#

Please read the solution wiki [Required Files] for more details about the OMOP CDM table name mapping file.

OMOP CDM Table Name Mapping (Optional)#

This solution requires an OMOP table name mapping JSON file if the cohort script(s) use custom table names other than the standard OMOP table names. This solution pre-packages a mapping JSON for cohort scripts exported from the Atlas tool.

Cohort SQL Scripts#

The solution also requires one or more SQL scripts to construct “cohorts”. A cohort SQL script is valid as long as it follows the naming convention of OMOP CDM. In OMOP CDM, cohort is sometimes interchangeable with clinical phenotype. Therefore, a cohort SQL script can produce a simple cohort representing all anti-hypertensive prescriptions.

A more complicated cohort SQL script can represent eligibility criteria for a clinical research question like a cohort with hypertension on first-time monotherapy of angiotensin-converting enzyme inhibitors (ACEIs).

Users can also facilitate cohort construction with the accessible tool Atlas, which provides a no-code user interface for building and exporting cohort SQL scripts in the language of choice.

Cohort Metadata#

The cohort metadata file provides a list of cohorts and metadata to indicate which cohorts should be uploaded or revised to the cohort table and cohort_definition OMOP table.

Workflow Overview#

You can follow the solution in the Dataiku gallery.

Dataiku screenshot of the final project Flow showing all Flow zones.

The project has the following high-level steps:

  1. Establish an OMOP data ETP pipeline.

  2. Upload cohort SQL scripts and cohort metadata for ingestion.

  3. Select a cohort from the OMOP tables cohort and cohort_definition for visualization.

Walkthrough#

This solution has two main components: project setup and dashboard. The project setup configures the cohort ingestion pipeline. The dashboard provides a quick visualization of cohort statistics and characteristics.

Note

In addition to reading this document, it is recommended to read the wiki of the project before beginning to get a deeper technical understanding of how this Solution was created and more detailed explanations of Solution-specific vocabulary.

Project Setup#

The project setup consists of three components: pipeline configuration, cohort ingestion, and dashboard creation.

Pipeline configuration#

This part consists of a one-time configuration to establish the data ETL pipeline. Users should select the connections for the project, connect OMOP CDM tables from the source project, and provide an OMOP custom table name mapping file if needed.

Connection Configuration#

The solution defaults to the filesystem once installed on the Dataiku instance. However, users must change the connection this solution supports during the project setup.

Important

The solution requires an SQL connection for the datasets and another for the project folders. The SQL connection must be one of the two supported connections (Snowflake or Databricks).

Dataiku screenshot of the connection configuration.
Connect OMOP Common Data Model Standard Tables#

The solution requires a minimal subset of OMOP CDM source datasets as inputs. First, select the source project where users have all the required OMOP CDM datasets. Second, select all the OMOP tables required for the cohort SQL scripts for the solution to run correctly.

Important

The source input datasets must use the same SQL connection as selected in the project setup and must respect the OMOP CDM v5.3 schema. The following OMOP tables are mandatory for the solution: person, observation_period, visit_occurrence, condition_occurrence, drug_exposure, death, location, condition_era, concept, concept_ancestor.

Screenshot of importing OMOP tables.
OMOP CDM Custom Table Name Mapping (Optional)#

This solution requires an OMOP custom table name mapping JSON file if users’ cohort scripts use custom table names other than the standard OMOP table names. This solution pre-packages a mapping JSON for cohort scripts exported from the Atlas tool. Skip this step if cohort SQL scripts follow OMOP naming conventions.

  • First, upload a text file containing a key-value pair, including tables from OMOP CDM v5.3. Skip this step if no custom mapping is required.

  • Then, indicate the filename to be used. The field is empty by default. Skip this step if cohort SQL scripts use the standard OMOP naming conventions. Fill in omop_cdm_atlas if users use the cohort scripts exported from the Atlas tool.

Screenshot of OMOP custom mappings

Cohort Ingestion#

Once the pipeline configuration is completed, users can start ingesting cohorts with their cohort SQL scripts. This process is repeatable. However, the two steps Upload Cohort SQL Scripts & Cohort Metadata and Write Cohort must be executed in sequence for each ingestion.

Upload Cohort SQL Scripts & Cohort Metadata#

The solution requires cohort SQL scripts and cohort metadata to write data into OMOP tables cohort and cohort_definition. The cohort SQL scripts define the SQL recipe for writing cohorts into the two OMOP tables, whereas the cohort metadata indicates which cohort(s) to be batch-processed.

First, upload and store multiple SQL scripts. Second, upload a cohort metadata file that lists the cohort(s) to be batch-processed. It must contain four columns: cohort_definition_id, cohort_definition_name, cohort_definition_description, cohort_sql_script_filename.

Please read solution wiki [Required Files] for information on the required schema.

Screenshot of cohort scripts.
Write Cohort#

Once the cohort scripts and the cohort metadata are in place, users can write the cohort(s) into the OMOP tables cohort and cohort_definition.

Dataiku screenshot of writing cohorts.

Create Cohort Dashboard#

The cohort dashboard gives a quick review of the results from a cohort query to facilitate cohort validation. Users can regenerate the dashboard with this part.

Dataiku screenshot of creating the dashboard.

Dashboard: Cohort Discovery Insights#

In OMOP, a cohort can represent an electronic clinical phenotype. Therefore, a patient can meet the cohort criteria several times in a given observation period and thus be counted multiple times in a cohort.

Cohort Descriptive Statistics#

The first part of the dashboard provides general statistics of a selected cohort: incidence and prevalence, demographics, and disease burden.

Dataiku screenshot of cohort stats 1. Dataiku screenshot of cohort stats 2.

The first part of the slide displays the descriptive statistics on the patients who met the cohort eligibility criteria.

Metric

Contents

Occurrence

The number of times patients meet specified criteria to enter a cohort.

Distinct patient count

The number of unique patients who have ever entered a cohort.

Prevalence

The proportion of unique patients in a cohort relative to the observed population during a specific period (%).

Incidence Rate

The ratio of new cases in an at-risk population over the observation period (new cases per 1,000 person-years).

Dataiku screenshot of cohort demography.

The second part describes the statistics on demographic variables (age, sex, race) between the cohort and control (patients illegible in the rest of the population).

Dataiku screenshot of cohort disease burden and observations.

The last part includes the disease burden index and cohort observations.

Metric

Contents

Charlson Comorbidity Index

It predicts the mortality for a patient who may have a range of concurrent conditions, such as heart disease, AIDS, or cancer. The higher the score, the higher the predicted mortality rate is.

Cohort Duration

The number of days between the cohort start and end dates. It represents the duration when a patient meets the eligibility criteria. It can also be described as “time-at-risk”.

Prior Observation Time

The number of days between the patient observation start date and the cohort start date. It represents the time before a patient entered the cohort.

Follow-up Time

The number of days between the cohort start date and the patient observation end date. It represents the duration from when the patient enters the cohort until the end of the observation.

Cohort Covariates#

The second slide describes the distribution of three predefined clinical covariates (clinical condition groups, drug groups, and clinical visits) from the Atlas tool.

  • Prevalence is the percentage of patients in the cohort who have at least one prescription of a given drug group within one year before the cohort start date.

Dataiku screenshot of dashboard on conditions groups.
  • Condition group covariates include the condition concept groups represented by SNOMED.

Dataiku screenshot of dashboard on drug groups.
  • Drug group covariates include drug concept groups represented by WHO ATC.

Dataiku screenshot of dashboard on clinical visits.
  • Clinical visit covariates: The OMOP concepts define the clinical visit type. The pivot table describes the temporal relationships between clinical utilization and the cohort.

Project Output Datasets#

This solution creates several output datasets that other projects can share for further analysis:

  • OMOP results schema table cohort and cohort_definition

  • Cohort feature tables cohort_demographics, cohort_visit_events, cohort_condition_group_events, cohort_drug_group_events

Conclusion#

The project setup provides a no-code user interface to configure complex cohort ingesting pipelines. This pipeline creates centralized storage for cohort scripts, cohorts, and metadata, which is sharable and reusable across different projects.

Once the pipeline is configured, users can grow their cohort repository over time by ingesting cohorts. The cohort dashboard gives a quick review of the results from a cohort query to facilitate cohort validation.

Responsible AI Statement#

In developing and deploying solutions like RWD Cohort Discovery in healthcare, several concerns related to responsible AI should be addressed to ensure fairness, transparency, and accountability. Below are some key ethical considerations and potential biases to be mindful of concerning the creation and use of patient cohorts based on observational health data coming from medical systems or longitudinal patient insurance claims.

Bias in Input Data#

Demographic Bias#

If the input patient data or the created cohort based on clinical criteria over-represent certain demographics (e.g., age, gender, race, or location), it can lead to biased cohort insights. For example, if data skews towards urban areas, the solution may not accurately capture the observational health outcomes in rural regions.

Important

Ensure data represents diverse patient social factors, health systems, and demographics. Regularly review and audit datasets created in cohorts to detect any demographic imbalances that could lead to biased or inaccurate insights derived from real-world patient data.

Socioeconomic Bias#

Data on patient populations may inadvertently favor wealthier areas or practices due to inequity in healthcare access or social imbalances around seeking care or reimbursements for care, leading to bias against those serving lower-income communities.

Important

Balance datasets and evaluate patients in a cohort by including data from various economic and social factor strata and regions to ensure equitable representation.

Data Quality and Source Bias#

Input data may come from various sources (e.g., multiple claims, EMR systems, or syndicated data providers), each with its own biases and quality. Potentially duplicated patient records from multiple sources could also bias estimates of cohort incidence or prevalence.

Important

Consider the limitations of each data source. Use techniques like data augmentation, bias correction, quality metrics, and checks to ensure that the quality of possibly disparate data sources does not lead to biased patient population cohorts for further analysis.

Moreover, patient cohorts created from this solution should be used to promote and prioritize unbiased and accurate insights from observational patient health signals. They should promote research to develop programs that improve patient outcomes and therapeutic or health access and journey, instead of re-enforcing disparities or biases in healthcare.

Further models built in real-world evidence (RWE) studies should be evaluated with a rigorous, responsible AI ethics process to ensure biases are not propagated, all subpopulations are considered, the observational nature of the data is incorporated (through methods like propensity matching and causal analysis) to avoid confounding factors, and model interpretability and explainability is in place.

Caution and Consideration of Sample Solution Data#

As a reminder, the synthetic data sources used in the example application of this solution do not reflect, in any way, real distributions of patient or disease characterizations. No insights or assumptions around observational health outcome patterns should be made from the examples insights derived from the patient cohorts. They should not be used in any further downstream business decision processes.

Please refer to the Centers for Medicare and Medicaid Services (CMS) Linkable 2008–2010 Medicare Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF) for further details.

Tip

We encourage users to check out Dataiku’s Responsible AI course to learn more.

Reproducing these Processes With Minimal Effort For Your Own Data#

The intent of this project is to enable biomedical informaticists or clinical researchers to understand how Dataiku can be used to create a centralized repository to store and manage cohorts (clinical electronic phenotyping) from real-world data.

We have provided several suggestions on how to use this solution to turn complex cohort SQL scripts into generalizable and reusable clinical electronic phenotyping for future patient data analytics. However, the best approach will ultimately depend on your specific needs and the data of interest. If you are interested in adapting this project to the specific goals and needs of your organization, roll-out and customization services can be offered on demand.