Compute a subpopulation analysis for white-box ML¶

Fairness, transparency, and explainability are serious concerns faced by those building AI systems today.

Dataiku DSS empowers data scientists with the tools to build white-box machine learning models. One of these features designed to enable white-box machine learning is subpopulation analysis.

The Context for Subpopulation Analysis¶

For many use cases, the ability to explain or interpret the results of a machine learning model may be just as important as the accuracy of the predictions themselves.

A machine learning model is often labelled a black-box if its inner workings are largely obscured. A white-box model, on the other hand, allows for a more understandable relationship between input features and output predictions.

Subpopulation analysis is an example of a white-box ML technique.

Subpopulation Analysis with Dataiku DSS¶

The nature of Dataiku DSS lends itself to white-box machine learning in a number of ways. The inherently collaborative nature of DSS and the ability to trace the lineage of data through a Flow are two fundamental properties that can increase understanding and trust in a model.

More specifically, Dataiku DSS has features, such as partial dependence plots and subpopulation analysis, designed to make it easier to interpret model predictions.

You can compute a subpopulation analysis after building a regression or binary classification model trained in Python, such as scikit-learn or keras. The purpose of this analysis is to assess whether a model behaves identically across subpopulations.

This assessment is particularly important in a fairness study where you want to examine if a model reproduces biases in the training data. This tool can help users more easily weed out unintended model biases and create a more transparent and fair AI deployment.

This video below gives a brief introduction to the why and how of subpopulation analysis with Dataiku DSS.

Interpreting Subpopulation Analysis¶

Subpopulation analysis computes a table of various statistics that you can compare across subpopulations that you define. For example, in a binary classification model, computing a subpopulation analysis on a variable such as gender allows you to examine the extent to which model prediction probabilities differ between male and female observations. In all cases, deciding whether a difference constitutes a significant finding with respect to fairness is left to you.

Does the model behave similarly for male and female subpopulations?¶

What’s next?¶

Consult the reference documentation for more information about how to conduct subpopulation analysis with Dataiku DSS.
The reference documentation also details how to create a partial dependence plot, which is another DSS tool for white-box machine learning.