Tutorial | Getting started with datasets#
Create the dataset#
When you start with an empty project, you’ll want to upload your first dataset!
In this section, you will:
Upload a local file to Dataiku.
To complete this tutorial, you’ll need to:
Have access to a Dataiku instance (version 9.0 and above).
Download the orders CSV file.
Create the project#
To create the project:
From the Dataiku Design homepage, click + New Project > DSS tutorials > Core Designer > Basics 101.
From the project homepage, click Go to Flow (or
You can also download the starter project from this website and import it as a zip file.
Dataiku lets you connect to a wide variety of data sources, but for this tutorial, let’s start by uploading a local file.
From the project homepage or the Flow (
G+F), click the blue button + Import Your First Dataset.
Click on Upload your files.
The New Uploaded Files Dataset page opens.
Click on Select Files, and choose the orders.csv file.
Review the dataset preview to make sure Dataiku detected the CSV format correctly. As you can see, the data is in a tabular format, with columns (features) and rows (records or observations). In this case, Dataiku has correctly formatted the data and set a default dataset name orders based on the file name.
Since that’s OK for us, finish importing the dataset by either hitting the Create button or using the shortcut Cmd/Ctrl+S.
This procedure creates the new dataset and lands you on the Explore tab of the orders dataset.
Explore the data#
Once you’ve imported a dataset, you’ll want to start exploring it through the Explore and Charts tabs!
The Explore tab of a dataset provides a tabular view of your data where you can start to examine it, while the Charts tab has a drag-and-drop interface for data visualizations.
In this section, you will:
Learn about the sampling method for a dataset.
Compute the row count of a dataset.
Adjust the meaning of a column.
Create a chart.
As an environment capable of handling large datasets, Dataiku shows only a sample of a dataset when you are working interactively.
You can see the sampling method in the top left of the Explore tab. By default, the sample in this tab includes the first 10,000 records of the dataset.
To view the total row count of your dataset, select Compute row count (the cyclic arrows icon).
To change the sample settings of a dataset, select the Sample badge, which opens a panel on the left.
For more information, see the Sampling article in the reference documentation.
Storage type and meaning of dataset columns#
In the dataset, beneath each column name, Dataiku indicates:
The storage type (in black)
The meaning (in blue)
Here, Dataiku detects a meaning of integer for the customer_id column, based upon the fact that most values in the sample for customer_id are integers.
The data quality bar shows red for the few values that do not match this meaning, which allows us to determine whether these values are truly invalid customer IDs, or, as is the case here, integer is too restrictive a meaning for customer_id.
Click on the meaning to display the contextual menu.
Select Text to update it.
Now the data quality bar for customer_id is entirely green.
In this dataset, we do not have any missing values. But if we did, they would be represented by the gray color in the data quality bar.
For more information, see the Schemas, storage types and meanings article in the reference documentation.
You can use charts to explore a dataset. For example, we might want to know how often each type of t-shirt is ordered.
Click on the Charts tab (or use the keyboard shortcut
From the panel on the left, drag and drop Count of records as the Y variable.
Drag and drop tshirt_category as the X variable.
Dataiku shows a column chart of Count of records by tshirt_category for the current sample.
The chart reveals that the values of tshirt_category are not consistently recorded. Sometimes black shirt color is recorded as “Black”, and sometimes as “Bl”. Similarly, white shirts are sometimes recorded as “White” and sometimes as “Wh”.
Congratulations! You’ve created your first project, imported your first dataset, and built your first chart. In the next tutorial on preparing data, we’ll handle issues with the dataset by using a Prepare recipe.