Tutorial | Getting started with datasets#
Create the dataset#
When you start with a project, you’ll want to upload a dataset.
Objectives#
In this section, you will:
Upload a local CSV file to Dataiku.
Prerequisites#
To complete this tutorial, you’ll need:
Dataiku 12.0 or later.
Download the cards CSV file.
Create the project#
From the Dataiku Design homepage, click + New Project.
Select Learning projects.
Search for and select Create Your First Project.
If needed, change the folder into which the project will be installed, and click Install.
From the project homepage, click Go to Flow (or type
g
+f
).
From the Dataiku Design homepage, click + New Project.
Select DSS tutorials.
Filter by Core Designer.
Select Create Your First Project.
From the project homepage, click Go to Flow (or type
g
+f
).
Note
You can also download the starter project from this website and import it as a zip file.
Use case summary#
Let’s say you work for a financial company that uses some credit card data to detect fraudulent transactions.
The project comes with two data sources (tx and merchants) and you’ll add a third one (cards).
The table below describes each of these three datasets.
Dataset |
Description |
---|---|
tx |
Each row is a unique credit card transaction with information such as the card and merchant involved in the purchase. It also indicates whether the transaction has either been:
|
merchants |
Each row is a unique merchant with information such as the merchant’s location and category. |
cards (to be uploaded) |
Each row is a unique credit card ID with information such as the card’s activation month or the cardholder’s FICO score (a common measure of creditworthiness in the US). |
Import data#
Dataiku lets you connect to a wide variety of data sources, but for this tutorial, let’s start by uploading a local file. Each row is the latitude and longitude coordinates of a unique credit card holder.
From the Flow, click + Dataset > Upload your files.
The New Uploaded Files Dataset page opens.
Note
If the project was empty, you would have seen a blue button + Import Your First Dataset on the project homepage.
Click Select Files, and choose the cards.csv file.
Dataiku displays a preview of the cards dataset. As you can see, the data is in a tabular format, with columns (features) and rows (records or observations). Dataiku has correctly set a default dataset name cards based on the file name.
Leave the name
cards
and click the Configure Format button above the preview.In the Schema tab, click on Infer Types From Data, and then Confirm so that Dataiku tries to guess the correct storage types based on the current sample.
Note
Once you have uploaded the dataset to the Flow, you can check its schema any time from the Settings > Schema tab, and then infer the types.
Since the result is OK, finish importing the dataset by either hitting the Create button or using the shortcut
Cmd/Ctrl
+s
.
This procedure creates the new dataset and lands you on the Explore tab of the cards dataset.

Explore the data#
Once you’ve imported a dataset, you’ll want to start exploring it through the Explore and Charts tabs!
The Explore tab of a dataset provides a tabular view of your data where you can start to examine it.
The Charts tab has a drag-and-drop interface for data visualizations.
Objectives#
In this section, you will:
Learn about the sampling method for a dataset.
Compute the row count of a dataset.
Analyze a column.
Visualize your data using charts and a pivot table.
Dataset sampling#
As an environment capable of handling large datasets, Dataiku shows only a sample of a dataset when you are working interactively.
You can see the sampling method in the top left of the Explore tab. By default, the sample in this tab includes the first 10,000 records of the dataset.
To view the total row count of your dataset, select Compute row count (the cyclic arrows icon).
To change the sample settings of a dataset, select the Sample badge, which opens a panel on the left.

See also
For more information, see the Sampling article in the reference documentation.
Storage type and meaning of dataset columns#
In the dataset, beneath each column name, Dataiku indicates:
The storage type (in gray)
The meaning (in blue)
Here, Dataiku detects a meaning of text for the id column, based upon the fact that most values in the sample for customer_id are strings.

The data quality bar shows green for all columns, which means that all rows are valid.
Note
This dataset doesn’t have any NOK (Not OK) or missing values. But if it did:
NOK values appear red in both the data quality bar and column cells.
Missing values appear gray in both the data quality bar and column cells.
See also
For more information, see the Schemas, storage types and meanings article in the reference documentation.
Analyze window#
You can analyze the content of each column using the Analyze window. Let’s analyze the content of the cardholder_fico_range column. To do so:
Click on the column name cardholder_fico_range, and select Analyze from the dropdown menu.
The Analyze window opens, showing the analysis of the data sample.
To extend the analysis to the whole data, select Whole data instead of sample in the dropdown next to the column name, then click Save and Compute.
Click the arrows next to the column name to switch from one column to another.
Important
When analyzing a whole dataset, click Compute again at the top of the window to display the analysis for the whole data of the new column. It would be automatically computed if just looking at the sample.
Close the window when you’re done with the analysis.
Charts#
You can use charts to explore a dataset. For example, we might want to know which reward program is the most frequent for specific age ranges.
Here, let’s use two types of charts and a pivot table to show you different approaches.
Add a vertical bars chart#
Click on the Charts tab (or use the keyboard shortcut
g
+v
).From the Data panel, drag:
Count of records to the Y variable.
reward_program to the X variable.
cardholder_age to the color droplet field.
The chart reveals that the cash_back reward program is the most popular, whatever the age of card holders.
Important
The Sample badge at the top of the chart indicates that the chart, like the dataset, is also a sample. Clicking on the badge allows you to change the sampling method if needed.
Open the dropdown menu of the cardholder_age variable and check the Adjust bin size for nicer bounds option to have five nicer bins for the age of card holders.
On the chart, hover over any of the orange bars to filter on card holders aged between 40 and 60. Once the menu appears, click on the bar to keep the menu on, and select the drill down icon (the down arrow) at the right of the cardholder_age variable.
This action adds a filter to the Filters section of the Setup tab of the left panel.
To go back to the default display, extend the age range from
18
to100
in the Filters section of the Setup tab.In the left panel, go to the Format tab to customize the chart:
Under X Axis, open the Title dropdown, and enter
Reward program
in the Axis title field.Under Color, select the built-in Pastel 2 palette.
Add a pie chart#
Let’s add a second chart.
At the bottom of the screen, click + Chart.
In the chart type dropdown, select Pie.
From the panel on the left, drag:
Count of records to the Show field.
reward_program to the By field.

The chart confirms that the cash_back reward program is the most popular.
Add a pivot chart#
Lastly, let’s add a pivot table to represent in a different way the distribution of card holders per age across the different reward programs.
At the bottom of the screen, click + Chart.
In the chart type dropdown, select Pivot table.
From the panel on the left, drag:
cardholder_age to the Rows field.
reward_program to the Columns field.
Count of records to the Value field.
In the dropdown menu of the cardholder_age variable:
Enter
5
as the Number of bins.Check the Adjust bin size for nicer bounds option to have five nicer bins for the age of card holders.

See also
For more information, see the Tutorial | Charts and pivot tables article.
Next steps#
Congratulations! You’ve created your first project, imported your first dataset, and built your first charts.
We recommend you to check the Tutorial | Join recipe, where we join the three datasets from this project to enrich each unique transaction with data about the information on credit card holders and merchants for that transaction.