Solution | Pharmacovigilance#


Business Case#

The efficiency of post-market drug safety surveillance functions plays a critical role in reinforcing patient safety and securing successful drug launches. Compliance in safety reporting and surveillance is a must-meet regulatory requirement (Good Pharmacovigilance Practice), and failure to appropriately report, detect, and address adverse drug reactions can lead to patient harm, drug recall, and significant costs.

As the volume/velocity/variety of safety reporting data grows, it is becoming essential for global safety teams at drug manufacturers, health outcomes research institutions, and regulatory bodies alike to adopt new analytics-driven approaches, that can be automated at scale, to improve early signal detection and reliability in the pharmacovigilance process.

This plug-and-play solution aims at providing a ready-to-use interface to accelerate the discovery of potential Adverse Drug Reaction (ADR) signals by using statistical metrics to generate disproportionality metrics on drugs and adverse events paired across various populations.


The process to install this solution differs depending on whether you are using Dataiku Cloud or a self-managed instance.

Dataiku Cloud users should follow the instructions for installing solutions on cloud.

  1. The Cloud Launchpad will automatically meet the technical requirements listed below, and add the Solution to your Dataiku instance.

  2. Once the Solution has been added to your space, move ahead to Data Requirements.

After meeting the technical requirements below, self-managed users can install the Solution in one of two ways:

  1. On your Dataiku instance connected to the internet, click + New Project > Dataiku Solutions > Search for Pharmacovigilance.

  2. Alternatively, download the Solution’s .zip project file, and import it to your Dataiku instance as a new project.

Technical Requirements#

To leverage this solution, you must meet the following requirements:

  • Have access to a Dataiku 12.0+* instance with a built-in Python3 environment (or create a Python3 code env).

  • All code scripts use Python 3.6. Depending on the specifications of the user’s instance, these packages may be tracked with future python updates.

  • To benefit natively from all the Dataiku application automation, you will need to reconfigure one of the following connections:

    • PostgreSQL

    • Snowflake

Data Requirements#

The inputs of this solution are contained in two managed folders which are helpful for storing data structures that are unsupported by Dataiku standard Flow objects.

Managed folder


Product Drug Names

Contains the Product.txt file which is imported by default from Orange Book . It contains a list of drugs and pharmaceuticals that the U.S. Food and Drug Administration (FDA) has approved as both safe and effective.

Input Files

Requires (at least) 5 datasets as .txt files to be imported from FDA Adverse Event Reporting System (FAERS) . These five datasets contain adverse vent reports, medication error reports, and product quality complaints resulting in adverse events that were submitted to the FDA.

Workflow Overview#

You can follow along with the solution in the Dataiku gallery .

Dataiku screenshot of the final project Flow showing all Flow Zones.

The project has the following high-level steps:

  1. Ingest data files.

  2. Process the data, detect duplicate reports, and filter on demographic, drug, reaction, and report characteristics.

  3. Identify and visualize patterns in safety data.

  4. Calculate metrics for statistical inference and signal detection.

  5. Analyze new insights with a Dataiku Application.

  6. Increase regulatory compliance with early detection of potential ADR signals.



In addition to reading this document, it is recommended to read the wiki of the project before beginning to get a deeper technical understanding of how this Solution was created and more detailed explanations of Solution-specific vocabulary.

Plug and Play FAERS quarterly data files#

The aforementioned input files can be uploaded to the solution either directly into the managed folders or via the Dataiku Application interface. Following upload, the connections of the Flow can be reconfigured, and users can select to anonymize the manufacturer and drug names for confidentiality reasons. Within the Input Files Flow zone, two Python recipes are applied to our initial managed folders:

Python recipe



Parses FDA standard drug name txt file to a data frame and export a dataset object.


Parses and accesses ASCII (.txt) files and convert them to data frames. The process includes file name checks to a prespecified regex condition and further mapping of column codes to standard terms.

Screenshot of the input and data prep Flow zones.

In this Flow zone, visual Distinct recipes are also used to keep the values necessary for running the rest of the Flow and generating statistical analysis.

The Data Preparation (Drug) Flow zone extracts information from drug interactions and joins it with the indication dataset using visual split, group, and join recipes.

Once our initial data is imported and cleaned, we are ready to begin aggregating our data.

Setting the Data for Visual Insights#

The Data Standardization (Drug) Flow zone takes, as an input, the previously prepared dataset of drug interactions joined with indications and a dataset of FDA product names. These two datasets are joined together, and the FDA drug name is used to standardize any misspellings in the FAERS data.

Screenshot of the additional data prep Flow zones to make analysis possible

Moving along to the Data Preparation (Demographics) dataset, we take the FAERS’ reaction, outcome, and demographics datasets and the joined drug name dataset from our previous Flow zone as inputs. A visual prepare recipe is used on our demographic data to clean age, country, and date features. We then join this to our other 3 datasets and compute a feature to represent the seriousness of an event based on outcome codes. Additional recipes are used to calculate metrics on the number of adverse events and to anonymize the manufacturer and drug name (if selected in the Dataiku App).

In this long and tedious process of mining through the data, duplicates are a common mistake as the database contains information voluntarily submitted by healthcare professionals, consumers, lawyers, and manufacturers. Hence, adverse event reports may be duplicated by multiple parties per event and may be more likely to contain incorrect information if that is submitted by a non-medical professional. Report Deduplication updates column names and removes any record duplication.

Screenshot of the Flow zones dedicated to Statistical Analysis and pre-processing for visualizations.

The Data Analytics/Statistics Flow zone filters data using user-specific variables and splits the entire dataset into cohort subpopulations. Final statistics to be used in the dashboard are generated via a python script that applies prefiltering on adverse event frequency, computes the measure of disproportionality statistics for each drug and adverse event pair, and outputs individual dataset objects to be used for further comparison and signal detection.

The three output datasets from our previous Flow zone are brought into the Visualizations Flow zone, along with the output from Report Deduplication. This Flow zone processes the output datasets to generate warnings on potential drug adverse event signals and visual insights. Final datasets generate a number of graphs published in the Pharmacovigilance analytics dashboard.

Reproduce these Processes with Minimal Effort#

The intent of this project is to enable Drug Safety and Surveillance stakeholders to understand how Dataiku can be used to easily integrate large amounts of data from spontaneous reporting systems, and just as easily push the resulting datasets into case management systems for investigation. By creating a singular solution that can benefit and influence the decisions of a variety of teams in a single organization, or across multiple organizations, immediate insights can be used to detect drug risks early, prevent patient harm, ensure safety in diverse populations, detect dangerous drug interactions, and anticipate the lengthy regulatory process of drug recalls with early action.

We’ve provided several suggestions on how to integrate data from spontaneous reporting systems and extract actionable insights, but ultimately the “best” approach will depend on your specific needs. If you’re interested in adapting this project to the specific goals and needs of your organization, roll-out and customization services can be offered on demand.