Tutorial | Compute statistics#

In the tutorial on preparing data, we saw the beginnings of how to use the Prepare recipe. We also saw how tools like the Analyze window can reveal the distribution of a column.

We can perform exploratory data analysis (EDA) in much greater depth in the Statistics tab. The Statistics tab allows you to generate statistical reports on your data by creating worksheets and cards within those worksheets.

Get started#

Objectives#

In this tutorial, you will:

  • Create a statistics worksheet.

  • Conduct a univariate analysis.

  • Perform a fit distribution test.

Prerequisites#

To complete this tutorial, you’ll need the following:

  • A Dataiku instance (version 9.0 and above).

Create the project#

To create the project:

  1. From the Dataiku Design homepage, click + New Project > DSS tutorials > Core Designer > Compute Statistics.

  2. From the project homepage, click Go to Flow (or G + F).

Note

You can also download the starter project from this website and import it as a zip file.

Create a statistics worksheet#

Tip

A screencast at the end of the lesson recaps the actions described here.

Let’s create a worksheet with cards that perform common EDA tasks. For example, if we are interested in seeing a side-by-side summary of the orders_prepared dataset for each of the variables pages_visited, tshirt_category, and total, then:

  1. Open the orders_prepared dataset.

  2. Navigate to the Statistics tab, and click + Create Your First Worksheet.

  3. Select the Univariate analysis box.

  4. Click on pages_visited, tshirt_category, and total to select them from the list of available variables in the left panel of the window.

  5. With these variables selected, click the plus button in the Variables to describe panel.

    Note

    After making a selection, Dataiku automatically selects the statistical options (in the right panel of the window) that are appropriate for the numerical variables (pages_visited and total) and the categorical variable (tshirt_category).

  6. Click Create Card.

Creating a Univariate analysis statistics card.

Inspect a statistics card#

Dataiku creates a card with one section for each variable. The type of statistical chart and descriptive statistic in each section depends on whether the variable is categorical or numerical.

For example, tshirt_category, a categorical variable, has a bar chart (or categorical histogram), while pages_visited and total each have a numerical histogram and box plot insert. Also, the quantile table is applicable to the numerical variables, while the frequency table is applicable to the categorical variable.

Univariate analysis on three variables.

By default, Dataiku computes worksheet statistics on a sample of the first records in your dataset. You can configure this setting by clicking the dropdown arrow next to Sampling and filtering.

Sampling and filtering tab in a statistics worksheet.

Add a new card#

We may also be interested in checking whether the total variable follows an exponential distribution. The interactive statistics feature allows you to estimate the parameters of univariate probability distributions using the Fit Distribution card.

  1. Click the +New Card button at the top right of the worksheet window.

  2. Select the Fit curves & distributions option and the Fit Distribution card.

  3. Select total as the Variable and Exponential as the Distribution.

  4. Click Create Card.

Fit distribution analysis.

Dataiku creates a card that shows the exponential distribution fit to the data. There is also a Q-Q plot that compares the quantiles of the data against the quantiles of the fitted distribution. Observing points far from the identity line suggests that the data could not have been drawn from the exponential distribution.

Note

To learn about the full capabilities in the Statistics tab, see the Interactive statistics section of the reference documentation.

See a screencast covering these steps

What’s next?#

This was just a brief introduction into the kinds of statistical tests we can easily perform in Dataiku.

Now, feel free to deep dive into more statistics with the Tutorial | Interactive statistics.