Tutorial | Interactive statistics#

Get started#

Exploratory data analysis in Dataiku often begins with the Analyze tool and the Charts tab. When you’re ready to dive deeper, you’ll often want to advance to the Statistics tab.

The Statistics tab of a dataset allows you to generate statistical reports on your data by creating cards within a worksheet.

Objectives#

In this tutorial, you will create a variety of statistical reports, including:

  • Auto-suggested analyses

  • Descriptive univariate and bivariate analyses

  • Fit curves and distributions

  • Multivariate analyses, such as principal component analysis and correlation matrices

  • Inferential hypothesis testing

Prerequisites#

  • A Dataiku instance.

Create the project#

  1. From the Dataiku Design homepage, click + New Project > DSS tutorials > ML Practitioner > Interactive Statistics.

  2. From the project homepage, click Go to Flow (or g + f).

Note

You can also download the starter project from this website and import it as a zip file.

Use case summary#

This tutorial performs EDA tasks on the wine quality dataset from the UCI Machine Learning Repository [1].

The original dataset consists of 12 features (or variables). In this tutorial, we have created an additional column through the Stack recipe for a variable type. It indicates whether an observation belongs to the red or white wine category.

The type and quality variables in the dataset are treated as categorical variables, while all other variables are numerical.

Dataiku screenshot of the Flow of the wine quality project.

Tip

Once you have created the project above, feel free to complete the following walkthroughs of various analyses and tests in any order. They can be completed independent of one another.

Use automatically suggested analyses#

A statistics worksheet allows us to create many kinds of statistical reports. One option is to allow Dataiku to automatically suggest analyses for you.

This runs a smart assistant to discover patterns in the data by suggesting analyses on variables of interest. This is particularly useful when there are many columns in the dataset or when you need some notion of where to begin your analysis.

Let’s try it now!

  1. From the Flow, open the winequality dataset.

  2. Navigate to the Statistics tab of the dataset.

  3. Click + Create Your First Worksheet.

  4. Select Automatically suggest analyses from the window of possible card types.

  5. Select card(s) interesting to you, and click Create Selected Cards.

Dataiku screenshot of the dialog for creating suggested statistics cards.

Note

See the reference documentation to learn more about Assisted Data Exploration.

Run univariate and bivariate analyses#

Of course, you can also manually select the statistical report you wish to generate. A common place to begin is exploring distributions of individual or pairs of variables with descriptive statistics.

Univariate analysis#

Univariate analysis is used to compare the data distribution of individual variables.

Let’s use it to see a side-by-side comparison of the variables density, alcohol, and type.

Important

As covered in Concept | Variable types for interactive statistics, remember that the \(\boldsymbol{\#}\) symbol denotes a numerical variable and the \(\mathrm{\mathbf{A}}\) denotes a categorical variable.

  1. From the existing statistics worksheet, click + New Card at the top right.

  2. Select Univariate analysis.

  3. From the list of available variables on the left, drag and drop density, alcohol, and type to the variables to describe section. Alternatively, you can select a variable on the left and click the plus icon.

  4. Click Create Card to let Dataiku create the analyses selected by default.

Dataiku screenshot of the dialog to create a univariate analysis card.

Important

Dataiku automatically selects the statistical Options to the right that are appropriate for the numerical variables (density and alcohol) and the categorical variable (type). You can deselect any of these options if needed.

In the worksheet, Dataiku creates a card with one section for each variable. The type of statistical chart and descriptive statistic in each section depends on whether the variable is categorical or numerical.

In this case, the categorical variable type displays a categorical histogram, while density and alcohol each display a numerical histogram and box plot insert. Also, a quantile table is applied to the numerical variables, while a frequency table is applied to the categorical variable.

Dataiku screenshot of the resulting univariate analysis on three variables.

Important

By default, Dataiku computes worksheet statistics on a sample of the first records in your dataset. You can configure this setting by clicking the dropdown arrow next to Sampling and filtering.

Note

For more information, see Univariate Analysis in the reference documentation.

Bivariate analysis#

Bivariate analysis lets us examine the data distribution for pairs of variables simultaneously.

In this section, let’s examine the response variable (type) for each factor variable (density and alcohol).

  1. From the existing statistics worksheet, click + New Card at the top right.

  2. Select Bivariate analysis.

  3. From the list of available variables on the left, drag and drop density and alcohol to the selected factors box.

  4. Drag the type variable to the Response section.

  5. Click Create Card.

Dataiku screenshot of the dialog to create a bivariate analysis card.

Refine card visualizations#

Dataiku creates a card with one section for each factor-response pair.

Notice that each descriptive statistical visualization(e.g. histogram) in the card has a pencil icon that appears when you hover over it that lets you choose additional configurations. For example, clicking the pencil for a histogram plot enables you to select a binning mode and maximum number of bins.

  1. In the type by density histogram, click the pencil icon to adjust settings.

  2. Set the density binning mode to Fixed nb. of bins.

  3. Set the Nb. of bins to 100.

  4. Click Apply.

  5. Repeat the same steps for the type by alcohol histogram.

Dataiku screenshot of two side by side histograms with a fixed number of one hundred bins.

Note

For more information, see Bivariate Analysis in the reference documentation.

Fit curves and distributions#

Another aspect of descriptive statistics involves modeling the probability distribution of your dataset. Three cards support these kinds of analysis for numerical variables.

Tip

Review the associated concept article if this is unfamiliar to you.

Fit distributions#

Dataiku allows you to estimate the parameters of univariate probability distributions using the Fit Distribution card.

Let’s attempt to fit the normal and beta distributions to the dataset, considering only the alcohol variable.

  1. From the existing statistics worksheet, click + New Card at the top right.

  2. Select Fit curves & distributions and then the Fit Distribution card.

  3. Select alcohol as the variable.

  4. Select Normal as the distribution.

  5. Click +Add a Distribution to add the Beta distribution.

  6. Click Create Card.

Dataiku screenshot of the dialog to create a fit distribution card.

Dataiku creates a card that shows the normal and beta probability density functions fit to the data.

There is also a Q-Q plot that compares the quantiles of the data to the quantiles of the fitted distributions. Observing points that are far from the identity line suggests that the data could not have been drawn from either distribution.

Additionally, the card includes goodness of fit metrics and the estimated parameters for the normal and beta distributions.

Dataiku screenshot of the fit distribution on alcohol.

2D fit distributions#

Similarly, the 2D Fit Distributions card is available for visualizing and estimating bivariate probability distributions on your dataset.

Let’s attempt to fit a 2D kernel density estimate (KDE) to the dataset, considering only the density and alcohol variables.

  1. From the existing statistics worksheet, click + New Card at the top right.

  2. Select Fit curves & distributions and then the 2D Fit Distribution card.

  3. Select density as the X variable.

  4. Select alcohol as the Y variable.

  5. Click Create Card, keeping the 2D KDE and relative bandwidth defaults.

Dataiku screenshot of the dialog to create a 2d fit distribution card.

Tip

Instead of the defaults, you can increase the relative bandwidth values to make the KDE plot smoother or decrease them to make the plot less smooth.

Dataiku screenshot of a 2D fit distribution

Note

For more information, see Fit curves and distributions in the reference documentation.

Fit curves#

Finally, let’s use the Fit Curve card to find the best line or curve that models the relationship between the free sulfur dioxide and total sulfur dioxide variables.

  1. From the existing statistics worksheet, click + New Card at the top right.

  2. Select Fit curves & distributions and then the Fit Curve card.

  3. Select free sulfur dioxide as the X variable.

  4. Select total sulfur dioxide as the Y variable.

  5. Fit a polynomial curve of degree 1.

  6. Click Create Card.

Dataiku screenshot of the dialog to create a fit curve card.

It appears that an increase in the value of the free sulfur dioxide variable results in an increase in the value of the total sulfur dioxide variable and vice-versa. This indicates that both variables are positively correlated. We can confirm this by finding the correlation coefficient between these variables.

Dataiku screenshot of a fit curve card.

Note

For more information, see Fit curves and distributions in the reference documentation.

Create a correlation matrix#

The Correlation matrix card allows you to examine the degree to which pairwise relationships may exist for variables in the dataset.

Tip

Review the associated concept article if this is unfamiliar to you.

Let’s create the card.

  1. From the existing statistics worksheet, click + New Card at the top right.

  2. Select Multivariate analysis and then the Correlation matrix card.

  3. Move all 11 numerical variables from the left to the selected variables column.

  4. Switch to the Pearson correlation coefficient option.

  5. Click Create Card.

Dataiku screenshot of the dialog to create a correlation matrix card.

The correlation matrix card displays a heatmap with the pairwise correlation values in the matrix cells. Of all the variables in the dataset, free sulfur dioxide and total sulfur dioxide have the largest positive correlation (0.721). This confirms the observation that we made from finding the fit curve.

Also, notice that the variables density and alcohol have the largest negative correlation (-0.687) in the dataset. This negative correlation implies that wines having higher density values tend to have lower alcohol content.

Dataiku screenshot of the resulting correlation matrix.

Note

For more information, see Correlation Matrix in the reference documentation.

Analyze effects of dimensionality reduction with the PCA card#

When working with a dataset having many variables, we may be interested in analyzing the effects of using a reduced number of variables (or dimensions) of the data. For example, we may choose to explore the structure of the winequality dataset in two dimensions.

Dataiku enables you to analyze the effects of dimensionality reduction using a feature extraction method called Principal Component Analysis, or PCA.

Tip

Review the associated concept article if this is unfamiliar to you.

Let’s use the Principal Component Analysis card to represent the winequality dataset in two dimensions.

  1. From the existing statistics worksheet, click + New Card at the top right.

  2. Select Multivariate analysis and then the Principal Component Analysis card.

  3. Move all 11 numerical variables from the left to the selected variables column.

  4. Click Create Card.

Dataiku screenshot of the dialog to create a PCA card.

This produces a few outputs:

Visualization

Description

Scree plot

It shows that using only the first two principal components retains about 50.2% of the variance in the dataset. To retain a variance of at least 90% (the red vertical line), you must use a minimum of 7 principal components to represent the data.

Scatter plot

It shows the data projected onto the first two principal components.

Loading plot

It shows how strongly each of the 11 numerical variables influences the first two principal components. Vectors forming a small angle, such as volatile acidity and fixed acidity, are likely to be positively correlated. Vectors meeting in an orthogonal angle (or nearly orthogonal angle), such as density and total sulfur dioxide, are not likely to be correlated or have very little correlation. When two vectors form a large angle, such as residual sugar and pH, they are likely to be negatively correlated.

Principal components heatmap

A matrix of the principal component loading vectors. For instance, the first column of the matrix corresponds to the loading vector of PC1, or a vector of the coefficients used in the linear transformation of the data set to the first principal component dimension.

Principal component analysis with plots of 11 numerical variables.

Note

For more information about the PCA card, see Concept | Principal Component Analysis (PCA).

Perform statistical tests#

We can make data-driven conclusions from our winequality dataset using Dataiku’s built-in statistical tests. These statistical tests are a form of inferential statistics that use a sample to make predictions about a population. In other words, these tests allow you to test hypotheses about a population using a sample.

Tip

Review the associated concept article if this is unfamiliar to you.

One-sample Student t-test#

One-sample tests compare the location parameters or distribution of a population to a hypothesis using one sample. Other statistical tests for numerical variables may use two or more samples to test equality or similarity between populations.

Let’s determine whether the mean of the underlying population for the density variable is equal to a specified value. To do this, we will use the one-sample Student t-test card.

  1. From the existing statistics worksheet, click + New Card at the top right.

  2. Select Statistical tests and then the Student t-test card from the One-sample test panel.

  3. Select density as the variable.

  4. Type 0.995 as the value for the Hypothesized mean.

  5. Click Create Card.

Dataiku screenshot of the dialog to create a PCA card.

The card displays a summary of the density variable, including the:

  • Mean

  • Tested hypothesis

  • Results of the test

  • Plot of the distribution for the test statistic

Dataiku screenshot of the student t test on density card.

The card also displays a conclusion from the test. In this case, it concludes: “The population mean of density is different from 0.995.”

Similarly, you can test whether the median of the population for the density variable is equal to a specified value using the Sign test (one-sample).

Categorical Chi-square independence test#

All of Dataiku’s statistical tests are performed on numerical variables except the Chi-square Independence Test.

Let’s try it to see if two categorical variables in the winequality dataset are independent.

  1. From the existing statistics worksheet, click + New Card at the top right.

  2. Select Statistical tests and then the Chi-square Independence Test card from the Categorical test panel.

  3. Select quality as Variable 1.

  4. Select type as Variable 2.

  5. Click Create Card.

Dataiku screenshot of the dialog to create a chi-square independence test card.

The resulting card displays the tested hypothesis and the results of the test.

Similar to all of our statistical test results, the Chi-square independence test card also provides a conclusion. In this case, the result is that “Variables quality and type are not independent.”

Dataiku screenshot of the resulting chi square independence test.

Note

For more information, see Statistical Tests in the reference documentation.

What’s next?#

Congratulations! You’ve tried out a range of statistical analyses and tests using the native Statistics tab. Take the next steps by trying out further tests and customizing the outputs on your own.

Note

Also note that when more flexibility is required, you can create Code notebooks to explore a dataset with your own code.

Tip

You can find this content (and more) by registering for the Dataiku Academy course, Interactive Statistics. When ready, challenge yourself to earn a certification!