Social Determinants of Health


Business Case

Social determinants of health (SDoH) are conditions in the places where people live, learn, work, and play that affect a wide range of health and quality-of life-risks and outcomes. Research has shown that SDoH can account for up to 90% of health outcomes, whereas medical care accounts for only 10%-15%. Understanding the social factors associated with chronic disease prevalence not only aligns with social responsibility programs, but also can deliver ROI with improved patient outcomes by:

  • Identifying resources, therapeutics, and interventions for populations incorporating both social and disease risk vulnerabilities

  • Developing responsible patient-centric risk-adjusted payment or care models to ensure health equity

  • Impacting operational/spending/quality metrics for both precision preventative care and therapeutic access equity.

Hospitals, public and private health services systems, health insurers, and government agencies, as well as pharmaceutical and medical device companies are all increasingly tasked to leverage population/community health insights of social vulnerabilities tied to disease prevalence to inform business practices to address health/disease and therapeutic access disparities. With this solution, healthcare and life science professionals accelerate the discovery of how SDoH disparities affect at-risk populations, allowing refined market access strategies for drug manufacturers, new coverage policies from payers and improved facility outreach and care programs from health services.

Technical Requirements

To leverage this solution, you must meet the following requirements:

  • Have access to a Dataiku 11+ instance

  • To access the Census Data through an API service, the user is required to generate an API key

  • A python 3.6 code environment named solution_sdoh with the following required packages:


The downloadable version uses filesystem-managed datasets and the built-in Dataiku engine as the only processing engine. Performance could be greatly improved by changing all the connections to Snowflake connections.


If the technical requirements are met, this solution can be installed in one of two ways:

  • On your Dataiku instance click + New Project > Industry solutions > Search for Social Determinants of Health.

  • Download the .zip project file and upload it directly to your Dataiku instance as a new project.

Data Requirements


This solution uses data pulled via the Census Bureau Data API, CDC Data API, and Socrata Open Data API, but is not endorsed or certified by these organizations. By utilizing this solution, you agree to abide by the terms set forth on these data sources.

In this solution, data is called from the relevant API interface through live endpoints. There are two sets of data:

Workflow Overview

You can follow the solution in the Dataiku gallery.

Dataiku screenshot of the final project Flow showing all Flow Zones.

The project has the following high-level steps:

  1. Ingest publically available data.

  2. Prepare and Clean Data for Analysis

  3. Apply Regression Analysis to understand better how social factors are associated with rates of chronic diseases

  4. Use Clustering Analysis for insights about areas with undetected/prevalent diseases.

  5. Build and Explore Solution Outputs via easy-to-use Dashboards

  6. Apply rigorous responsible AI ethics for future modeling approaches.



In addition to reading this document, it is recommended to read the wiki of the project before beginning to get a deeper technical understanding of how this solution was created and longer explanations of solution-specific vocabulary.

Building the Full Flow Made Simple

To ease the usage of this Solution, it comes with three pre-built Dashboards. The first of which, User Manual - Solution Build, provides instructions and two scenarios that recursively build all the solution components from data ingestion, processing, regression, segmentation, and visualization. It is necessary to input your generated API key into the project variables and, optionally, change the dataset connections and engine to your desired infrastructure before using this Dashboard. Once the two Run scenarios are completed, all webapps and graphs in the other two Dashboards are updated. As the input data for this Solution only updates yearly, these scenarios should only be run once a year.

Screenshot of the User Manual Dashboard, which makes running and updating the full Solution simple.

Discover US Community Patterns of Chronic Disease Prevalence and Social Vulnerability

Running the first scenario in the User Manual will build all of the necessary flow zones to support the SDoH Analytics and Tract Segmentation Dashboard. Specifically, this scenario will build the Census Data - SVI factors, CDC Data, and Metadata flow zones to call our input data. With our input data successfully ingested, the data preprocessing and feature generation flow zone will be run next to prepare our data and score it using the pretrained clustering model in the Segmentation Analysis flow zone. Pre-trained clustering and regression models are provided with this Solution but can always be retrained with adjusted training parameters if desired. Individual component scenarios are provided to retrain models, activate new model versions, and update the dashboards individually. Upon completion of the scoring, the relevant charts are rebuilt, and the SDoH Webapp is restarted.

Screenshot of the SDoH Analytics and Tract Segmentation Dashboard Webapp

Three tabs are built into the SDoH Analytics and Tract Segmentation Dashboard. The first tab Chronic Disease and Social Vulnerability Regional Exploration offers an interactive webapp that includes a US map outlining Census counties (or tracts based on individual county regional selection) colored by the selected chronic disease prevalence, a scatter plot of county or tract level social vulnerability theme rankings vs. disease prevalence rankings, and a table of individual tracts. Selecting counties or tracts (depending on filter selection) within the scatter plot via a box or lasso select dynamically displays the corresponding tracts belonging to the plot selection in the table below.

Screenshot of the Tract Segments and Disease tab

The second tab Census Tract Segmentation enables us to understand ML-driven tract segmentation better solely based on Social Vulnerability percentile values through various model explainability visualizations. The final tab Tract Segments and Disease shows how the distribution of tracts by segments corresponds to each disease prevalence.

Visualize Associations with Social Vulnerability Factors Across Areas and Populations

After the first dashboard has been built (that first Scenario MUST be run before the second), we can trigger the second scenario, which will build all supporting components for the Chronic Disease Prevalence Modeling Dashboard. This scenario builds the Data Analytics/Modeling Flow Zone to score data for each disease by the pretrained regression model, builds remaining visualizations, and restarts the Disease Selection Webapp.

Screenshot of the Chronic Disease Prevalence Modeling Dashboard

Two tabs are provided with this Dashboard. The Disease Prevalence Model Summary tab includes a standard WebApp where we can select and save a disease for analysis before pressing the Run button to the right. This button will trigger a scenario that activates the Regression Model corresponding to the selected disease and updates the tab with that model’s Summary and Individual Explanations through interactive explainability charts. The Census Tract SHapley Additive exPlanations (SHAP) tab contains two charts that provide insights on how community social factors at a tract level impact that tract’s disease prevalence prediction. Filters can be used to refine the scope of the visualizations.

Responsible AI Statement

This solution uses both analytics and ML-driven insights to help drive an understanding of how patterns of social factors that characterize potentially vulnerable populations associate with chronic disease prevalence at regional population levels. Care should always be taken to ensure data considerations are taken into account in any interpretations.

This is community-level survey data, and should not be used to support misleading attribution on how an individual person’s socioeconomic status, minority/ethnic background, and household situation predicts/informs potential disease occurrence or outcomes. Self-reported survey data is particularly subject to recall, social desirability, and non-response bias. Any decisions or actions driven by this analysis must consider these limitations that may influence the distribution of the data.

Moreover, the disease associations relating to regional community-level characteristics should be used to promote and prioritize health equity and therapeutic access as opposed to re-enforcing or deepening disparities or biases in the health and life sciences systems where it is deployed. This solution can (and should) be extended to include additional data such as HCP or pharmacy geolocation information as well as individual-level (de-identified) personal patient behavioral and clinical data in regions identified as areas of potential disparity. Further models built for designing personalized patient-care journeys, health outreach programs, pricing considerations, or therapeutic delivery should be evaluated with a rigorous responsible AI ethics process to ensure no biases are propagated, all subpopulations are considered, and model interpretability and explainability are in place.

Reproduce these Processes with Minimal Effort

The intent of this project is to enable healthcare and life science professionals to understand how Dataiku can be used to accelerate the discovery of how SDoH disparities affect at-risk populations. By creating a singular solution that can benefit and influence the decisions of a variety of teams in a single organization or across multiple organizations, immediate insights can be used to refine market access strategies for drug manufacturers, create new coverage policies from payers and improve facility outreach and care programs from health services.

We’ve provided several suggestions on how to use publicly available data and extract actionable insights, but ultimately the “best” approach will depend on your specific needs. If you’re interested in adapting this project to the specific goals and needs of your organization, roll-out and customization services can be offered on demand.