Solution | Clinical Site Intelligence#


Business Case#

Clinical trial research and operations is one of the most expensive and time consuming components of the therapeutic lifecycle from discovery to regulatory submission. A compound entering phase I of clinical trials historically has a 10% success rate. Pivotal efficacy studies (phase III) have a median cost of nearly USD$50 million and a per-patient enrolled cost of over USD$40K [1]. These estimates vary by therapeutic area, complexity of the drug, or mechanism of action. Total costs of clinical trials can easily be upwards of USD$200 million.

The driving component of the cost burden stems from how many patients are needed to enroll and how many sites and site visits are required to prove treatment efficacy (per protocol). Studies show that more than 80% of trials require study timeline extensions or additional study sites due to low enrollment rates, which is also the leading cause of trial termination [2]. A single month delay could easily translate to USD$1 million in trial costs and lost revenue for time to market. Therefore, proper intelligence and selection of clinical sites capable of enrollment goals is essential for a new study protocol.

Dataiku’s Clinical Site Intelligence Solution leverages’s database of nearly 500K global studies to predict study enrollment rates, discover similar studies based on a novel study synopsis and patient criteria, and provide analytics on the clinical sites used in those studies. Sponsor dashboards also provide study overviews, intelligence on competing sponsors and sites, and (US-only) site locations’ augmented social factors and disease prevalence to facilitate site review and selection that encourages participant diversity.


The process to install this solution differs depending on whether you are using Dataiku Cloud or a self-managed instance.

Dataiku Cloud users should follow the instructions for installing solutions on cloud.

  1. The Cloud Launchpad will automatically meet the technical requirements listed below, and add the Solution to your Dataiku instance.

  2. Once the Solution has been added to your space, move ahead to Data Requirements.

After meeting the technical requirements below, self-managed users can install the Solution in one of two ways:

  1. On your Dataiku instance connected to the internet, click + New Project > Dataiku Solutions > Search for Clinical Site Intelligence.

  2. Alternatively, download the Solution’s .zip project file, and import it to your Dataiku instance as a new project.

Additional note for 12.1+ users

If using a Dataiku 12.1+ instance, and you are missing the technical requirements for this Solution, the popup below will appear to allow admin users to easily install the requirements, or for non-admin users to request installation of code environments and/or plugins on their instance for the Solution.

Admins can processes these requests in the admin request center, after which non-admin users can re-trigger the successful installation of the Solution.

Screenshot of the Request Install of Requirements menu available for Solutions.

Technical Requirements#

To leverage this Solution, you must meet the following requirements:

  • Have access to a Dataiku 12.0+* instance

  • To access through an API service, it is currently not required to have an API key.

  • A Python 3.9 code environment named solution_cs-intel with the following required packages:


The code environment also requires an initiation script. Users should put the following script in the tab Resources.

import logging
import os
import shutil
import nltk

from dataiku.code_env_resources import clear_all_env_vars
from dataiku.code_env_resources import grant_permissions
from dataiku.code_env_resources import set_env_path
from dataiku.code_env_resources import set_env_var
from dataiku.code_env_resources import update_models_meta

# donwload pretrained models'punkt')

# Set-up logging
logger = logging.getLogger("code_env_resources")

# Clear all environment variables defined by a previously run script

######################## Sentence Transformers #################################
# Set sentence_transformers cache directory
set_env_path("SENTENCE_TRANSFORMERS_HOME", "sentence_transformers")

import sentence_transformers

# Download pretrained models
    ("DataikuNLP/paraphrase-multilingual-MiniLM-L12-v2", "4f806dbc260d6ce3d6aed0cbf875f668cc1b5480"),
    ("xlm-roberta-base", "e9c793ac25de997022baf24ba4b022a602c0c050"),
    ("emilyalsentzer/Bio_ClinicalBERT", "9b5e0380b37eac696b3ff68b5f319c554523971f")

sentence_transformers_cache_dir = os.getenv("SENTENCE_TRANSFORMERS_HOME")
for (model_repo, revision) in MODELS_REPO_AND_REVISION:"Loading pretrained SentenceTransformer model: {}".format(model_repo))
    model_path = os.path.join(sentence_transformers_cache_dir, model_repo.replace("/", "_"))

    # Uncomment below to overwrite (force re-download of) all existing models
    # if os.path.exists(model_path):
    #     logger.warning("Removing model: {}".format(model_path))
    #     shutil.rmtree(model_path)

    # This also skips same models with a different revision
    if not os.path.exists(model_path):
        model_path_tmp = sentence_transformers.util.snapshot_download(
            ignore_files=["flax_model.msgpack", "rust_model.ot", "tf_model.h5",],
        os.rename(model_path_tmp, model_path)
    else:"Model already downloaded, skipping")
# Add sentence embedding models to the code-envs models meta-data
# (ensure that they are properly displayed in the feature handling)
# Grant everyone read access to pretrained models in sentence_transformers/ folder
# (by default, sentence transformers makes them only readable by the owner)

Data Requirements#

This Solution requires a API connection.

Connecting to output data from the Social Determinants of Health (SDOH) Solution is optional.

  1. The Solution directly queries the API, and loads the results into clinicaltrialgov_dataset. The current API version doesn’t require API keys for connection. The Solution inherits the data schema directly from the API.

  2. The Solution can opt-in US county-level census data and geography information from two data frames of the SDOH Solution: SOL_new_measure_final_dataset_county & SOL_tl_2020_us_county provides. The current version bundles the two data frames as part of the Solution, and so installing the SDOH Solution in advance is unnecessary.

Workflow Overview#

You can follow along with the solution in the Dataiku gallery.

Dataiku screenshot of the final project Flow showing all Flow zones.

The project has the following high-level steps:

  1. Query the API, and extract the latest clinical trials registry data.

  2. Run automatic pipelines of data cleaning, harmonization, and feature engineering.

  3. Create an enrollment rate model and study similarity index for intelligence.

  4. Pre-compute data frames for the Clinical Site Intelligence webapp. The webapp provides interactive intelligence for study similarity analysis and site scorecards.

  5. Create a sponsor dashboard to understand study characteristics and SDOH information.



In addition to reading this document, it is recommended to read the wiki of the project before beginning to get a deeper technical understanding of how this Solution was created and more detailed explanations of Solution-specific vocabulary.

Solution Set-up via Dataiku Application#

A Dataiku Application must be used in order to set up the Solution and give users access to its pre-built dashboards and webapp. The application Clinical Site Intelligence can be found by navigating to the Applications section of the waffle menu in the top right of the Dataiku instance. You may also find it in the Applications section of the Dataiku Design home page.

It will help you connect to your data regardless of the connection type and seamlessly configure the whole Flow according to your specific parameters.

Create a new Dataiku application by clicking the Create App Instance button. It will create a new instance of the parent project, which you can configure according to your specific needs. You can make as many instances as you need (for example, if you want to apply this Solution to other data).

Connection Configuration#

Within the Dataiku application, select the preferred connection for the data frames and folders, respectively, where you want to build. The optimal engine will apply to the recipes within the Flow.

Dataiku screenshot of the app connection configuration.


We suggest selecting Snowflake or filesystem for data frames connection and filesystem for folder connection.

We recommend an SQL database (Snowflake) connection because it makes the Solution more scalable. This Solution saves the text embedding vectors and similarity index as pickle files in folders.

Define Study Scope#

This section defines the customized query to API. In other words, the customized query establishes the scope of the clinical trials that feed into the intelligence of this Solution. The query convention follows the API documentation.

Screenshot of the app query.

Include Demographic & Social Determinants of Health Dataset#

If included, this optional dataset augments the prediction model for study enrollment rate and clinical site intelligence. The current release is limited to the SDOH data of US counties.

Screenshot of app SDOH option


Read the SDOH user manual for more information.

Build the Flow#

The Dataiku application will run all pipelines, and create all models in the Solution with the above configurations. The pipelines include direct queries to the API, data cleaning, harmonization, feature engineering, model building, and dataset pre-computation for the webapp.

The Flow creates multiple machine learning algorithms to power the study similarity analysis and clinical site intelligence. The enrollment rate model and study similarity index provide study and clinical site insights through interactive visualization via a webapp and dashboard.

Screenshot of app build flow.

Launch the webapp#

Launch the Clinical Sites Intelligence webapp to review insights from study similarity analysis and clinical site intelligence.

Dataiku screenshot of app launch webapp.

Create Sponsor Dashboard#

Create a sponsor dashboard to overview studies and sites sponsored by a selected lead sponsor. The dashboard includes three components: Studies overview, Clinical sites overview, and Social determinants of health on sites.

  • The first slide provides study summaries from the API and augments the insights with study enrollment rate prediction.

  • The second slide summarizes broader study activity and history across sponsors at clinical sites used by the selected sponsor of interest.

  • The last slide reveals the locations of facilities (sites) used for studies by the chosen sponsor with census county populations and social vulnerability information. It is only available if users include the SDOH dataset during the build in the Dataiku application.


The option to include SDOH data will result in a different dashboard. Sponsor Insights:, Enrollment Rate Prediction & SDOH is only available if SDOH data is opted-in. Otherwise, users should use Sponsor Insights:, Enrollment Rate Prediction instead.

Dataiku screenshot of app launch dashboard.

Webapp: Study Similarity & Clinical Site Intelligence#

The Study Similarity & Clinical Site Intelligence webapp is an interactive interface for users to query clinical site intelligence with study protocols. It distills the operational history of similar studies and associated clinical sites, representing the intelligence with easy-to-read charts.

Users initiate the webapp by providing a study protocol and can interact with each step/component. Finally, users can export the list of selected clinical sites in the last step of the webapp for further analysis.

Step One: Study Summary#

Users always start a query from step one. There are two ways to initiate new queries: input an existing study or a novel study protocol.

When choosing the existing study option, users must input a valid National Clinical Trial (NCT) Identification Number. Users will fill in a self-defined study protocol for the novel study option as a query. The novel study input field includes study title, study summary, cohort age, sex, inclusion and exclusion criteria, healthy volunteers, and Mesh conditions.

After submitting the query, the right main panel will display the result of the study summary. The webapp splits the summary into different tabs at the panel’s top for easy reading. There are four tabs: Summary, Patient Eligibility, Study Arms, and Study Sites. The webapp returns all four tabs for existing studies and the first two for novel studies.

Dataiku screenshot of webapp study summary.

Step Two: Studies and Sites#

The webapp queries the study similarity index prebuilt by the Dataiku application for a given study protocol. It returns the top 20 similar study protocols. Then, it identifies clinical sites recruited by these top similar studies. It shows the results in two tabs: Similar Studies and Candidate Sites.

The left panel of both tabs serves as filters for users to drop the studies or sites. The filter for the Similar Studies tab will regenerate the list of locations in the Candidate Sites tab. Meanwhile, the filter for the Candidate sites tab will pass on to generate the site scorecards in step three.

Dataiku screenshot of webapp studies and sites.

Step Three: Site Score Cards#

The Site Score Card provides visualized insights on individual clinical research sites — including geolocation, SDOH, studies involved, and competing sponsors. The left panel logs the users’ review history on the list of candidate sites and allows users to drop locations. Finally, the user can export the finalized list for further analysis.

Dataiku screenshot of webapp geolocation and SDOH.

Dataiku screenshot of webapp geolocation and SDOH.

Responsible AI Statement#

This Solution uses analytics and ML-driven insights to inform clinical site recruitment by study protocol design. However, it is crucial to be mindful of inequality in the recruitment of human subjects in the history of clinical research. The patient enrollment process often under-represents communities with a particular sex, gender, minority/ethnic background, and health conditions. A data-driven approach will inevitably inherit these biases from the clinical trial registry. It is essential to ensure data considerations in any interpretations.

The Solution also augments clinical site intelligence with US SDOH insights to encourage recruitment diversity. It is derived from community-level survey data, and it should not be used to support misleading attribution on how a person’s socioeconomic status, minority/ethnic background, and household situation predict or inform potential disease occurrence or outcomes. Self-reported survey data is particularly subject to recall, social desirability, and non-response bias. Any decisions or actions driven by this analysis must consider these limitations that may influence the distribution of the data.

While leveraging disease associations with regional community-level characteristics, it is essential to utilize this information to advance health equity and enhance therapeutic accessibility, actively avoiding any reinforcement or exacerbation of disparities or biases within the health and life sciences systems where this solution is deployed.

This approach is extendable to incorporate supplementary data, including Health Care Professional (HCP) or pharmacy geolocation information, as well as individual-level (de-identified) personal patient behavioral and clinical data in regions identified as potential areas of disparity.

Furthermore, any models developed for crafting personalized patient-care journeys, health outreach programs, pricing considerations, or therapeutic delivery must undergo a thorough evaluation guided by a robust and responsible AI ethics process. This process ensures the prevention of biases, consideration of all subpopulations, and the establishment of model interpretability and explainability.


We encourage users to check out Dataiku’s Responsible AI course to learn more.

Reproduce these Processes with Minimal Effort#

This project intends to enable healthcare and life science professionals to understand how Dataiku can accelerate a data-driven approach to facilitate clinical operation by leveraging public datasets.

By creating a singular Solution that can benefit and influence the decisions of various teams in a single organization or across multiple organizations, immediate insights can be used to refine clinical site recruitment strategies for drug manufacturers.

We have provided several suggestions on how to use publicly available data and extract actionable insights. However, ultimately, the best approach will depend on your specific needs. In the event you would like support adapting this project to your organization’s goals and needs, Dataiku provides roll-out and customization services on demand.