Solution | Molecular Property Prediction#


Business Case#

Exploring new drugs to improve human health is a significant effort that requires collaboration among scientists, clinicians, and regulatory authorities to bring new treatments to patients. The process involves identifying a target protein related to a specific disease, finding or designing molecules that interact with that protein in a desired way, testing these molecules rigorously, and, if successful, developing them into safe and effective drugs for treating the disease. It is a complex and time-consuming process that can cost around $2.6 billion and take about 12 years on average.

In recent years, this field has undergone a substantial transformation through the integration of artificial intelligence (AI). Notably, the investment in AI-enabled drug development has experienced a remarkable surge, reaching $59.3 billion as of 2023, a nearly 27-fold increase since 2015 (Source: Deep Pharma Intelligence).

Analysts project that a 20-40% reduction in preclinical development costs could provide the financial resources required to advance four to eight novel molecules successfully. Biotech companies embracing an AI-driven approach have cultivated an impressive pipeline of potential drugs. Boasting over 150 small-molecule drugs in the discovery phase and 15 undergoing clinical trials, the transformative role of AI in drug discovery becomes undeniable.

This Dataiku Molecular Property Prediction Solution aims to optimize the process of molecular screening on selected target proteins by querying molecules with known bioactivity via Chembl and PubChem databases. Machine learning models are deployed to predict molecular properties from chemical structures and guide the digital drug discovery process by identifying the most promising drug candidates before experimental work is conducted.


The process to install this solution differs depending on whether you are using Dataiku Cloud or a self-managed instance.

Dataiku Cloud users should follow the instructions for installing solutions on cloud.

  1. The Cloud Launchpad will automatically meet the technical requirements listed below, and add the Solution to your Dataiku instance.

  2. Once the Solution has been added to your space, move ahead to Data Requirements.

After meeting the technical requirements below, self-managed users can install the Solution in one of two ways:

  1. On your Dataiku instance connected to the internet, click + New Project > Dataiku Solutions > Search for Molecular Property Prediction.

  2. Alternatively, download the Solution’s .zip project file, and import it to your Dataiku instance as a new project.

Additional note for 12.1+ users

If using a Dataiku 12.1+ instance, and you are missing the technical requirements for this Solution, the popup below will appear to allow admin users to easily install the requirements, or for non-admin users to request installation of code environments and/or plugins on their instance for the Solution.

Admins can processes these requests in the admin request center, after which non-admin users can re-trigger the successful installation of the Solution.

Screenshot of the Request Install of Requirements menu available for Solutions.

Technical Requirements#

To leverage this solution, you must meet the following requirements:

  • Have access to a Dataiku 12.5+* instance.

  • Accessing ChEMBl and PubChem through an API service is currently not required to have an API key.

  • A Python 3.9 code environment named solution_molecular-prop-prediction with the following packages:


The code environment also requires an initiation script that initializes a tokenizer and model from the Hugging Face library and manages permissions for cache directory access. Users should put the following script in the tab Resources.

## Base imports
from dataiku.code_env_resources import clear_all_env_vars
from dataiku.code_env_resources import set_env_path
from dataiku.code_env_resources import grant_permissions
from transformers import AutoTokenizer, AutoModel

# Clears all environment variables defined by previously run script

## Hugging Face
# Set HuggingFace cache directory
set_env_path("TRANSFORMERS_CACHE", "huggingface")
tokenizer = AutoTokenizer.from_pretrained("DeepChem/ChemBERTa-77M-MLM", cache_dir="huggingface")
model = AutoModel.from_pretrained("DeepChem/ChemBERTa-77M-MLM", cache_dir="huggingface")


The downloadable version uses filesystem-managed datasets and the built-in Dataiku engine as the only processing engine. Performance could be significantly improved by changing all the connections to Snowflake connections.

Data Requirements#


This solution uses data pulled via the ChEMBL and PubChem API but is not endorsed or certified by these organizations. By utilizing this solution, you agree to abide by the terms set forth on these data sources.

  1. The Solution directly queries the ChEMBL or PubChem API based on user specifications. It loads the results into metadata, which captures an overview of the user query and stores all the molecules studied before and reported on the database. The current API version doesn’t require API keys for the connection. The Solution automatically applies all the required preprocessing to store the data in the output schema.

  2. Input data from the user for scoring novel molecules. The user must upload their test_data for scoring novel molecules with the structure below.

Dataiku screenshot of the format of test data for molecules.

Workflow Overview#

You can follow along with the sample project in the Dataiku gallery.

Dataiku screenshot of the final project Flow showing all Flow zones.

The project has the following high-level steps:

  1. Query molecules (SMILES) with known bioactivity for a specific protein target via ChEMBL or PubChem API to predict bioactivity.

  2. Generate molecular descriptors and fingerprints to perform quantitative structure-activity analysis and understand how molecular properties influence bioactivity.

  3. Train and benchmark regression models to predict molecular bioactivity (pIC50) as a measure of molecule potency and speed up experimental work on large datasets.

  4. Score novel molecules and prioritize the ones that qualify for the next discovery stage under the required properties.

  5. Assess further compound similarity using t-SNE and statistics that help to identify structurally related studied compounds to validate potential drug targets.

  6. Publish the results to a template dashboard that showcases the analysis, modeling, and novel molecule output.



In addition to reading this document, it is recommended to read the wiki of the project before beginning to get a deeper technical understanding of how this Solution was created and more detailed explanations of Solution-specific vocabulary.

Solution Set-up via Dataiku Application#

The Molecular Property Prediction Dataiku application helps configure the project’s critical parameters and subsequently builds the elements in the Flow zones. It also enables multiple users to work on individual instances of the Solution without directly modifying the original project.

The first part of the application validates that the protein accession code exists in the chosen public database. The second part consists of an interface for the user to enter the project variable selection for the required analysis. The next step is to replace the test_data and use the scenario BUILD ALL to build the Flow and update the dashboard.

Below are explanations of the different variables that need to be set by the user manually in the project. You can modify them in the Variables section of the project.

Dataiku screenshot of the Dataiku App associated with this Solution.

Database Selection#

Select a public chemical database to connect automatically through API and specify the target protein accession code. The first scenario validates the presence of the accession code in the selected database.

Data Preparation#

Specify the parameters required for data preparation.

Molecular bioactivity is defined as a molecule’s ability to bind to a biological target based on the standard value IC50.

  1. A molecule is defined as Active if the IC50 value < Threshold for Active label.

  2. A molecule is defined as Inactive if the IC50 value > Threshold for Inactive label.

  3. Otherwise intermediation.

Machine Learning#

Descriptors are quantitative metrics that characterize the chemical and physical properties of molecules. These descriptors comprise the input features to the regression model for predicting the molecular bioactivity value pIC50.

Molecular descriptors

Capture physiochemical properties as continuous numerical values. Examples include molecular weight and number of atoms.

Fingerprint descriptors

Represent the presence or absence of specific chemical features as binary or numerical codes generated from the canonical smile notations. Often used for similarity searches and clustering of compounds in chemical databases. Examples include Morgan Fingerprint (ECFP4 analog, 1024-bit-long), MACCS keys, PubChem fingerprint, and large-scaled pre-trained model ChemBERTa from hugging face.

By default, the machine learning model uses the fingerprint descriptors only as input features. The user has the option below to include both.

Molecular Similarity#

The project analyzes the degree of structural resemblance or likeness between novel scored molecules and studied molecules used for training. The final field of the Dataiku application allows you to specify a novel molecule ID to initiate the analysis. You can dynamically interact with all the novel molecules within the dashboard results.

Explore the Chemical Space of Studied Molecules and their Influence on the Target Protein#

The target protein analysis presents the metadata of the selected target protein and several molecules. The 1D molecular descriptors provide information about a molecule’s size, shape, polarity, and functional groups.

The chemical space analysis allows the users to visualize, explore, and analyze the relationships between molecular structures and their properties.

Dataiku screenshot of the Dashboard for Target Protein analysis.

Discover Novel Molecules#

Regression models help prioritize compounds for experimental testing by identifying those more likely to be effective against a specific target. Combining both the pIC50 prediction and the molecular descriptors from RDkit, you can filter down to your search space of interest and prioritize the compounds to pursue further.

This process is especially valuable in the early stages of drug discovery when resources are limited and researchers need to make informed decisions. This exploration can uncover novel compounds with unique structural features with therapeutic potential.

Dataiku screenshot of the Dashboard for Discovering Novel Molecules.

Identify Molecules with Similar Structures#

Molecular similarity is the degree of likeness between two molecules based on their structural properties. The molecular similarity score is computed with the Tanimoto Coefficient, which compares the presence or absence of structural features (e.g., molecular fingerprints) between pairs of new and studied molecules.

Assessing molecular similarity helps researchers identify structurally or functionally related molecules, which can have significant implications in applications such as virtual screening for drug discovery, chemical toxicity prediction, and chemical library design.

Dataiku screenshot of the Dashboard for identifying molecules with similar structures.

Responsible AI Statement#

We value responsible development and deployment of AI in drug discovery. While our pIC50 prediction model offers valuable insights into potential bioactivity, it’s crucial to remember that it’s just one piece of the puzzle.

Decisions affecting lab experiments should not solely rely on pIC50. Additional critical factors, such as ADMET properties (absorption, distribution, metabolism, excretion, and toxicity), selectivity (targeting the intended protein without harming others), and pharmacokinetic properties (drug movement and action in the body) must be rigorously assessed.

This model is designed to augment — not replace — the invaluable expertise of subject matter experts and the essential role of lab work. Combining in silico predictions with thorough biological validation and expert judgment can ensure responsible and ethical prioritization of experiments, ultimately accelerating the development of safe and effective drugs.

Reproduce these Studies with Minimal Effort for your Own Data#

This template Solution intends to enable computational labs to speed up their work in the early stages of drug discovery by automating data analytics and the quantitative structure analysis relationship process. A deeper technical walkthrough of the project can be found within the wiki to aid in reproducing this project. Roll-out and customization services can be offered on demand.