Business Case

The efficiency of post-market drug safety surveillance functions plays a critical role in reinforcing patient safety and securing successful drug launches. Compliance in safety reporting and surveillance is a must-meet regulatory requirement (Good Pharmacovigilance Practice), and failure to appropriately report, detect, and address adverse drug reactions can lead to patient harm, drug recall, and significant costs.

As the volume/velocity/variety of safety reporting data grows, it is becoming essential for global safety teams at drug manufacturers, health outcomes research institutions, and regulatory bodies alike to adopt new analytics-driven approaches, that can be automated at scale, to improve early signal detection and reliability in the pharmacovigilance process.

This plug-and-play solution aims at providing a ready-to-use interface to accelerate the discovery of potential Adverse Drug Reaction (ADR) signals by using statistical metrics to generate disproportionality metrics on drugs and adverse events paired across various populations.

Technical Requirements

To leverage this solution, you must meet the following requirements:

  • Have access to a Dataiku 10.0+ instance with a built-in Python3 environment (or create a Python3 code env).

  • All code scripts use Python 3.6. Depending on the specifications of the user’s instance, these packages may be tracked with future python updates.

  • To benefit natively from all the Dataiku application automation, you will need to reconfigure one of the following connections :

    • PostgreSQL

    • Snowflake


Please be aware that while the Solution is compatible with Dataiku V10 instances, the final tab of the Dashboard (Drug Monitoring Signals) will be blank on V10.0 instances as it depends on a pivot table only available in V11.0+. All other components of the Solution are compatible with V10.0+.


If the technical requirements are met, this solution can be installed in one of two ways:

  • On your Dataiku instance click + New Project > Industry solutions > Search for Pharmacovigilance.

  • Download the .zip project file and upload it directly to your Dataiku instance as a new project.

Data Requirements

The inputs of this Solution are contained in two managed folders which are helpful for storing data structures that are unsupported by Dataiku standard flow objects.

  • The Product Drug Names managed folder contains the Product.txt file which is imported by default from Orange Book . It contains a list of drugs and pharmaceuticals that the U.S. Food and Drug Administration (FDA) has approved as both safe and effective.

  • The Input Files managed folder requires (at least) 5 datasets as .txt files to be imported from FDA Adverse Event Reporting System (FAERS) . These five datasets contain adverse vent reports, medication error reports, and product quality complaints resulting in adverse events that were submitted to the FDA.

Workflow Overview

You can follow along with the solution in the Dataiku gallery .

Dataiku screenshot of the final project Flow showing all Flow Zones.

The project has the following high-level steps:

  1. Ingest data files.

  2. Process the data, detect duplicate reports, and filter on demographic, drug, reaction, and report characteristics.

  3. Identify and visualize patterns in safety data.

  4. Calculate metrics for statistical inference and signal detection.

  5. Analyze new insights with a Dataiku Application.

  6. Increase regulatory compliance with early detection of potential ADR signals.



In addition to reading this document, it is recommended to read the wiki of the project before beginning in order to get a deeper technical understanding of how this solution was created and longer explanations of solution-specific vocabulary.

Plug and Play FAERS quarterly data files

The aforementioned input files can be uploaded to the Solution either directly into the managed folders or via the Dataiku Application interface. Following upload, the connections of the flow can be reconfigured, and users can select to anonymize the manufacturer and drug names for confidentiality reasons. Within the Input Files flow zone, two Python recipes are applied to our initial managed folders:

  • compute_FDA_products Parses FDA standard drug name txt file to a data frame and export a dataset object.

  • Faers_data_ingestion Parses and accesses ASCII (.txt) files and convert them to data frames. The process includes file name checks to a prespecified regex condition and further mapping of column codes to standard terms.

Screenshot of the input and data prep flow zones.

In this flow zone, visual Distinct recipes are also used to keep the values necessary for running the rest of the flow and generating statistical analysis.

The Data Preparation (Drug) flow zone extracts information from drug interactions and joins it with the indication dataset using visual split, group, and join recipes.

Once our initial data is imported and cleaned, we are ready to begin aggregating our data.

Setting the Data for Visual Insights

The Data Standardization (Drug) flow zone takes, as an input, the previously prepared dataset of drug interactions joined with indications and a dataset of FDA product names. These two datasets are joined together, and the FDA drug name is used to standardize any misspellings in the FAERS data.

Screenshot of the additional data prep flow zones to make analysis possible

Moving along to the Data Preparation (Demographics) dataset, we take the FAERS’ reaction, outcome, and demographics datasets and the joined drug name dataset from our previous flow zone as inputs. A visual prepare recipe is used on our demographic data to clean age, country, and date features. We then join this to our other 3 datasets and compute a feature to represent the seriousness of an event based on outcome codes. Additional recipes are used to calculate metrics on the number of adverse events and to anonymize the manufacturer and drug name (if selected in the Dataiku App).

In this long and tedious process of mining through the data, duplicates are a common mistake as the database contains information voluntarily submitted by healthcare professionals, consumers, lawyers, and manufacturers. Hence, adverse event reports may be duplicated by multiple parties per event and may be more likely to contain incorrect information if that is submitted by a non-medical professional. Report Deduplication updates column names and removes any record duplication.

Screenshot of the flow zones dedicated to Statistical Analysis and pre-processing for visualizations.

The Data Analytics/Statistics flow zone filters data using user-specific variables and splits the entire dataset into cohort subpopulations. Final statistics to be used in the dashboard are generated via a python script that applies prefiltering on adverse event frequency, computes the measure of disproportionality statistics for each drug and adverse event pair, and outputs individual dataset objects to be used for further comparison and signal detection.

The three output datasets from our previous flow zone are brought into the Visualizations flow zone, along with the output from Report Deduplication. This flow zone processes the output datasets to generate warnings on potential drug adverse event signals and visual insights. Final datasets generate a number of graphs published in the Pharmacovigilance analytics dashboard.

Reproduce these Processes with Minimal Effort

The intent of this project is to enable Drug Safety and Surveillance stakeholders to understand how Dataiku can be used to easily integrate large amounts of data from spontaneous reporting systems, and just as easily push the resulting datasets into case management systems for investigation. By creating a singular solution that can benefit and influence the decisions of a variety of teams in a single organization, or across multiple organizations, immediate insights can be used to detect drug risks early, prevent patient harm, ensure safety in diverse populations, detect dangerous drug interactions, and anticipate the lengthy regulatory process of drug recalls with early action.

We’ve provided several suggestions on how to use integrate data from spontaneous reporting systems and extract actionable insights, but ultimately the “best” approach will depend on your specific needs. If you’re interested in adapting this project to the specific goals and needs of your organization, roll-out and customization services can be offered on demand.