Tutorial | Compute statistics (Core Designer part 5)¶
In the tutorial on preparing data, we saw the beginnings of how to use the Prepare recipe. We also saw how tools like the Analyze window can reveal the distribution of a column.
We can perform exploratory data analysis (EDA) in much greater depth in the Statistics tab. The Statistics tab allows you to generate statistical reports on your data by creating Worksheets, and Cards within those worksheets.
In this tutorial, you will:
Create a statistics worksheet.
Conduct a univariate analysis.
Perform a fit distribution test.
If you skipped the previous sections, you need to complete the tutorial on preparing data so you have the orders_prepared dataset.
Create a statistics worksheet¶
A screencast at the end of the lesson recaps the actions described here.
Let’s create a worksheet with cards that perform common EDA tasks. For example, if we are interested in seeing a side-by-side summary of the orders_prepared dataset for each of the variables pages_visited, tshirt_category, and total, then:
Open the orders_prepared dataset.
Navigate to the Statistics tab, and click +Create Your First Worksheet.
Select the Univariate analysis box.
Select pages_visited, tshirt_category, and total from the list of available variables in the left panel of the window.
With these variables selected, click the plus button in the “Variables to describe” panel.
After making a selection, Dataiku automatically selects the statistical Options (in the right panel of the window) that are appropriate for the numerical variables (pages_visited and total) and the categorical variable (tshirt_category).
Click Create Card.
Inspect a statistics card¶
Dataiku creates a card with one section for each variable. The type of statistical chart and descriptive statistic in each section depends on whether the variable is categorical or numerical.
For example, tshirt_category, a categorical variable, has a bar chart (or categorical histogram), while pages_visited and total each have a numerical histogram and box plot insert. Also, the quantile table is applicable to the numerical variables, while the frequency table is applicable to the categorical variable.
By default, Dataiku computes worksheet statistics on a sample of the first records in your dataset. You can configure this setting by clicking the drop-down arrow next to Sampling and filtering.
Add a new card¶
We may also be interested in checking whether the total variable follows an exponential distribution. The interactive statistics feature allows you to estimate the parameters of univariate probability distributions using the Fit Distribution card.
Click the +New Card button from the Worksheet window.
Select the Fit curves & distributions option and the Fit Distribution card.
Select total as the Variable and Exponential as the Distribution.
Click Create Card.
Dataiku creates a card that shows the exponential distribution fit to the data. There is also a Q-Q plot that compares the quantiles of the data against the quantiles of the fitted distribution. Observing points far from the identity line suggests that the data could not have been drawn from the exponential distribution.
To learn about the full capabilities in the Statistics tab, see the Interactive statistics section of the reference documentation.
This was just a brief introduction into the kinds of statistical tests we can easily perform in Dataiku.
Now let’s continue building our Flow with more visual recipes.