Explore the data#

Once you’ve imported a dataset, you’ll want to start exploring it through the Explore and Charts tabs!

The Explore tab of a dataset provides a tabular view of your data where you can start to examine it, while the Charts tab has a drag-and-drop interface for data visualizations.

Objectives#

In this section, you will:

  • Learn about the sampling method for a dataset.

  • Compute the row count of a dataset.

  • Adjust the meaning of a column.

  • Create a chart.

Dataset sampling#

As an environment capable of handling large datasets, Dataiku shows only a sample of a dataset when you are working interactively.

You can see the sampling method in the top left of the Explore tab. By default, the sample in this tab includes the first 10,000 records of the dataset.

  1. To view the total row count of your dataset, select Compute row count (the cyclic arrows icon).

  2. To change the sample settings of a dataset, select the Sample badge, which opens a panel on the left.

A Dataiku screenshot showing how to open the Sample settings panel of a dataset.

See also

For more information, see the Sampling article in the reference documentation.

Storage type and meaning of dataset columns#

In the dataset, beneath each column name, Dataiku indicates:

  • The storage type (in black)

  • The meaning (in blue)

Here, Dataiku detects a meaning of integer for the customer_id column, based upon the fact that most values in the sample for customer_id are integers.

The data quality bar shows red for the few values that do not match this meaning, which allows us to determine whether these values are truly invalid customer IDs, or, as is the case here, integer is too restrictive a meaning for customer_id.

A Dataiku screenshot showing Integer as the detected meaning for the *customer_id* column and a gauge showing red for a few values in the column.
  1. Click on the meaning to display the contextual menu.

  2. Select Text to update it.

Now the data quality bar for customer_id is entirely green.

A Dataiku screenshot showing that the meaning for the *customer_id* column has been updated to Text and the gauge is now completely green.

Note

In this dataset, we do not have any missing values. But if we did, they would be represented by the gray color in the data quality bar.

See also

For more information, see the Schemas, storage types and meanings article in the reference documentation.

Charts#

You can use charts to explore a dataset. For example, we might want to know how often each type of t-shirt is ordered.

  1. Click on the Charts tab (or use the keyboard shortcut G+V).

  2. From the panel on the left, drag and drop Count of records as the Y variable.

  3. Drag and drop tshirt_category as the X variable.

Dataiku shows a column chart of Count of records by tshirt_category for the current sample.

A Dataiku screenshot showing a chart of *Count of records* by *tshirt_category* for the current data sample.

The chart reveals that the values of tshirt_category are not consistently recorded. Sometimes black shirt color is recorded as “Black”, and sometimes as “Bl”. Similarly, white shirts are sometimes recorded as “White” and sometimes as “Wh”.

What’s next?#

Congratulations! You’ve created your first project, imported your first dataset, and built your first chart. In the next tutorial on preparing data, we’ll handle issues with the dataset by using a Prepare recipe.