# Tutorial | Interactive statistics#

## Get started#

Exploratory data analysis in Dataiku often begins with the Analyze tool and the Charts tab. When you’re ready to dive deeper, you’ll often want to advance to the Statistics tab.

The Statistics tab of a dataset allows you to generate statistical reports on your data by creating cards within a worksheet.

### Objectives#

In this tutorial, you will create a variety of statistical reports, including:

Auto-suggested analyses

Descriptive univariate and bivariate analyses

Fit curves and distributions

Multivariate analyses, such as principal component analysis and correlation matrices

Inferential hypothesis testing

Recipe-generated statistics

### Prerequisites#

A Dataiku instance.

### Create the project#

From the Dataiku Design homepage, click

**+ New Project > DSS tutorials > ML Practitioner > Interactive Statistics**.From the project homepage, click

**Go to Flow**(or`g`

+`f`

).

Note

You can also download the starter project from this website and import it as a zip file.

### Use case summary#

This tutorial performs EDA tasks on the wine quality dataset from the UCI Machine Learning Repository [1].

The original dataset consists of 12 features (or variables). In this tutorial, we have created an additional column through the Stack recipe for a variable *type*. It indicates whether an observation belongs to the red or white wine category.

The *type* and *quality* variables in the dataset are treated as categorical variables, while all other variables are numerical.

Tip

Once you have created the project above, feel free to complete the following walkthroughs of various analyses and tests in any order. They can be completed independent of one another.

## Use automatically suggested analyses#

A statistics worksheet allows us to create many kinds of statistical reports. One option is to allow Dataiku to automatically suggest analyses for you.

This runs a smart assistant to discover patterns in the data by suggesting analyses on variables of interest. This is particularly useful when there are many columns in the dataset or when you need some notion of where to begin your analysis.

Let’s try it now!

From the Flow, open the

**winequality**dataset.Navigate to the

**Statistics**tab of the dataset.Click

**+ Create Your First Worksheet**.Select

**Automatically suggest analyses**from the window of possible card types.Select card(s) interesting to you, and click

**Create Selected Cards**.

Note

See the reference documentation to learn more about Assisted Data Exploration.

## Run univariate and bivariate analyses#

Of course, you can also manually select the statistical report you wish to generate. A common place to begin is exploring distributions of individual or pairs of variables with descriptive statistics.

### Univariate analysis#

Univariate analysis is used to compare the data distribution of individual variables.

Let’s use it to see a side-by-side comparison of the variables *density*, *alcohol*, and *type*.

Important

As covered in Concept | Variable types for interactive statistics, remember that the \(\boldsymbol{\#}\) symbol denotes a numerical variable and the \(\mathrm{\mathbf{A}}\) denotes a categorical variable.

From the existing statistics worksheet, click

**+ New Card**at the top right.Select

**Univariate analysis**.From the list of available variables on the left, drag and drop

**density**,**alcohol**, and**type**to the**variables to describe**section. Alternatively, you can select a variable on the left and click the plus icon.Click

**Create Card**to let Dataiku create the analyses selected by default.

Important

Dataiku automatically selects the statistical **Options** to the right that are appropriate for the numerical variables (*density* and *alcohol*) and the categorical variable (*type*). You can deselect any of these options if needed.

In the worksheet, Dataiku creates a card with one section for each variable. The type of statistical chart and descriptive statistic in each section depends on whether the variable is categorical or numerical.

In this case, the categorical variable *type* displays a categorical histogram, while *density* and *alcohol* each display a numerical histogram and box plot insert. Also, a quantile table is applied to the numerical variables, while a frequency table is applied to the categorical variable.

Important

By default, Dataiku computes worksheet statistics on a sample of the first records in your dataset. You can configure this setting by clicking the dropdown arrow next to **Sampling and filtering**.

Note

For more information, see Univariate Analysis in the reference documentation.

### Bivariate analysis#

Bivariate analysis lets us examine the data distribution for pairs of variables simultaneously.

In this section, let’s examine the response variable (*type*) for each factor variable (*density* and *alcohol*).

From the existing statistics worksheet, click

**+ New Card**at the top right.Select

**Bivariate analysis**.From the list of available variables on the left, drag and drop

**density**and**alcohol**to the selected factors box.Drag the

**type**variable to the Response section.Click

**Create Card**.

### Refine card visualizations#

Dataiku creates a card with one section for each factor-response pair.

Notice that each descriptive statistical visualization(e.g. histogram) in the card has a pencil icon that appears when you hover over it that lets you choose additional configurations. For example, clicking the pencil for a histogram plot enables you to select a binning mode and maximum number of bins.

In the

**type by density**histogram, click the**pencil**icon to adjust settings.Set the density binning mode to

**Fixed nb. of bins**.Set the

**Nb. of bins**to`100`

.Click

**Apply**.Repeat the same steps for the

**type by alcohol**histogram.

Note

For more information, see Bivariate Analysis in the reference documentation.

## Fit curves and distributions#

Another aspect of descriptive statistics involves modeling the probability distribution of your dataset. Three cards support these kinds of analysis for numerical variables.

Tip

Review the associated concept article if this is unfamiliar to you.

### Fit distributions#

Dataiku allows you to estimate the parameters of univariate probability distributions using the **Fit Distribution** card.

Let’s attempt to fit the normal and beta distributions to the dataset, considering only the *alcohol* variable.

From the existing statistics worksheet, click

**+ New Card**at the top right.Select

**Fit curves & distributions**and then the**Fit Distribution**card.Select

**alcohol**as the variable.Select

**Normal**as the distribution.Click

**+Add a Distribution**to add the Beta distribution.Click

**Create Card**.

Dataiku creates a card that shows the normal and beta probability density functions fit to the data.

There is also a Q-Q plot that compares the quantiles of the data to the quantiles of the fitted distributions. Observing points that are far from the identity line suggests that the data could not have been drawn from either distribution.

Additionally, the card includes goodness of fit metrics and the estimated parameters for the normal and beta distributions.

### 2D fit distributions#

Similarly, the **2D Fit Distributions** card is available for visualizing and estimating bivariate probability distributions on your dataset.

Let’s attempt to fit a 2D kernel density estimate (KDE) to the dataset, considering only the *density* and *alcohol* variables.

From the existing statistics worksheet, click

**+ New Card**at the top right.Select

**Fit curves & distributions**and then the**2D Fit Distribution**card.Select

**density**as the X variable.Select

**alcohol**as the Y variable.Click

**Create Card**, keeping the 2D KDE and relative bandwidth defaults.

Tip

Instead of the defaults, you can increase the relative bandwidth values to make the KDE plot smoother or decrease them to make the plot less smooth.

Note

For more information, see Fit curves and distributions in the reference documentation.

### Fit curves#

Finally, let’s use the **Fit Curve** card to find the best line or curve that models the relationship between the *free sulfur dioxide* and *total sulfur dioxide* variables.

From the existing statistics worksheet, click

**+ New Card**at the top right.Select

**Fit curves & distributions**and then the**Fit Curve**card.Select

**free sulfur dioxide**as the X variable.Select

**total sulfur dioxide**as the Y variable.Fit a polynomial curve of degree

`1`

.Click

**Create Card**.

It appears that an increase in the value of the *free sulfur dioxide* variable results in an increase in the value of the *total sulfur dioxide variable* and vice-versa. This indicates that both variables are positively correlated. We can confirm this by finding the correlation coefficient between these variables.

Note

For more information, see Fit curves and distributions in the reference documentation.

## Create a correlation matrix#

The **Correlation matrix** card allows you to examine the degree to which pairwise relationships may exist for variables in the dataset.

Tip

Review the associated concept article if this is unfamiliar to you.

Let’s create the card.

From the existing statistics worksheet, click

**+ New Card**at the top right.Select

**Multivariate analysis**and then the**Correlation matrix**card.Move all 11 numerical variables from the left to the selected variables column.

Switch to the

**Pearson**correlation coefficient option.Click

**Create Card**.

The correlation matrix card displays a heatmap with the pairwise correlation values in the matrix cells. Of all the variables in the dataset, *free sulfur dioxide* and *total sulfur dioxide* have the largest positive correlation (0.721). This confirms the observation that we made from finding the fit curve.

Also, notice that the variables *density* and *alcohol* have the largest negative correlation (-0.687) in the dataset. This negative correlation implies that wines having higher density values tend to have lower alcohol content.

Note

For more information, see Correlation Matrix in the reference documentation.

## Analyze effects of dimensionality reduction with the PCA card#

When working with a dataset having many variables, we may be interested in analyzing the effects of using a reduced number of variables (or dimensions) of the data. For example, we may choose to explore the structure of the *winequality* dataset in two dimensions.

Dataiku enables you to analyze the effects of dimensionality reduction using a feature extraction method called Principal Component Analysis, or PCA.

Tip

Review the associated concept article if this is unfamiliar to you.

Let’s use the **Principal Component Analysis** card to represent the *winequality* dataset in two dimensions.

From the existing statistics worksheet, click

**+ New Card**at the top right.Select

**Multivariate analysis**and then the**Principal Component Analysis**card.Move all 11 numerical variables from the left to the selected variables column.

Click

**Create Card**.

This produces a few outputs:

Visualization |
Description |
---|---|

Scree plot |
It shows that using only the first two principal components retains about 50.2% of the variance in the dataset. To retain a variance of at least 90% (the red vertical line), you must use a minimum of 7 principal components to represent the data. |

Scatter plot |
It shows the data projected onto the first two principal components. |

Loading plot |
It shows how strongly each of the 11 numerical variables influences the first two principal components. Vectors forming a small angle, such as volatile acidity and fixed acidity, are likely to be positively correlated. Vectors meeting in an orthogonal angle (or nearly orthogonal angle), such as density and total sulfur dioxide, are not likely to be correlated or have very little correlation. When two vectors form a large angle, such as residual sugar and pH, they are likely to be negatively correlated. |

Principal components heatmap |
A matrix of the principal component loading vectors. For instance, the first column of the matrix corresponds to the loading vector of |

Note

For more information about the PCA card, see Concept | Principal Component Analysis (PCA).

## Perform statistical tests#

We can make data-driven conclusions from our *winequality* dataset using Dataiku’s built-in statistical tests. These statistical tests are a form of inferential statistics that use a sample to make predictions about a population. In other words, these tests allow you to test hypotheses about a population using a sample.

Tip

Review the associated concept article if this is unfamiliar to you.

### One-sample Student t-test#

One-sample tests compare the location parameters or distribution of a population to a hypothesis using one sample. Other statistical tests for numerical variables may use two or more samples to test equality or similarity between populations.

Let’s determine whether the mean of the underlying population for the *density* variable is equal to a specified value. To do this, we will use the one-sample **Student t-test** card.

From the existing statistics worksheet, click

**+ New Card**at the top right.Select

**Statistical tests**and then the**Student t-test**card from the**One-sample test**panel.Select

**density**as the variable.Type

`0.995`

as the value for the Hypothesized mean.Click

**Create Card**.

The card displays a summary of the *density* variable, including the:

Mean

Tested hypothesis

Results of the test

Plot of the distribution for the test statistic

The card also displays a conclusion from the test. In this case, it concludes: “The population mean of density is different from 0.995.”

Similarly, you can test whether the median of the population for the *density* variable is equal to a specified value using the Sign test (one-sample).

### Categorical Chi-square independence test#

All of Dataiku’s statistical tests are performed on numerical variables except the **Chi-square Independence Test**.

Let’s try it to see if two categorical variables in the *winequality* dataset are independent.

From the existing statistics worksheet, click

**+ New Card**at the top right.Select

**Statistical tests**and then the**Chi-square Independence Test**card from the**Categorical test**panel.Select

**quality**as Variable 1.Select

**type**as Variable 2.Click

**Create Card**.

The resulting card displays the tested hypothesis and the results of the test.

Similar to all of our statistical test results, the Chi-square independence test card also provides a conclusion. In this case, the result is that “Variables quality and type are not independent.”

Note

For more information, see Statistical Tests in the reference documentation.

## Leverage the Generate statistics recipe#

Another way to analyze statistics in Dataiku is through the Generate statistics recipe, which can be used to embed statistical tests in your Flow.

Note

This section of the tutorial requires Dataiku 12.6+.

### Export as recipe#

From the statistics worksheet, we can export the Chi-square independence test card as a recipe.

Click on the

**More options**menu of the Chi-square independence test card.Select

**Export as recipe…**.Select

**Create Recipe**.Review the pre-filled fields, and the

**Run**the recipe.

The recipe and output dataset should be visible in the Flow.

### Create the recipe from the Flow#

You can also create a Generate statistics recipe from the Actions panel in the Flow.

Select the

**winequality**dataset.Click

**Generate statistics**from the**Actions**panel.Navigate to

**Statistical test: One-sample > Shapiro-Wilk Test**.Select

**Create Recipe**.

### Configure the Generate statistics recipe#

One benefit of the Generate statistics recipe is that you can perform multiple statistics tests at once. Let’s configure three normality tests in this recipe.

Since this test works best with fewer than 5000 records, set the

**Sampling method**to**Random (approx. ratio)**.Set the

**% to use**to`10`

and set the**Random seed**to`1111`

.For the

**Split**column, choose**Type**.For the

**Test variable**, choose**alcohol**.Select

**+ Add a Statistical Test**, and choose**pH**.Select

**+ Add a Statistical Test**, and choose**sulphates**.**Run**the recipe and open the output dataset.

You’ll be able to see six tests and their conclusions. It should appear that all tests but one reject the null hypothesis that the data is normally distributed.

Note

Another benefit of the Generate statistics recipe is that you can periodically run the recipe using scenarios.

## What’s next?#

Congratulations! You’ve tried out a range of statistical analyses and tests using the native **Statistics** tab. Take the next steps by trying out further tests and customizing the outputs on your own.

Note

Also note that when more flexibility is required, you can create Code notebooks to explore a dataset with your own code.

Tip

You can find this content (and more) by registering for the Dataiku Academy course, Interactive Statistics. When ready, challenge yourself to earn a certification!