Tutorial | Getting started with datasets#
Create the dataset#
When you start with a project, you’ll want to upload a dataset.
Objectives#
In this section, you will:
Upload a local CSV file to Dataiku.
Prerequisites#
To complete this tutorial, you’ll need:
Dataiku 12.0 or later.
Download the cards CSV file.
Create the project#
To create the project:
From the Dataiku Design homepage, click + New project > DSS tutorials > Core Designer > Create Your First Project.
From the project homepage, click Go to Flow (or
g
+f
).
Note
You can also download the starter project from this website and import it as a zip file.
Use case summary#
Let’s say we’re a financial company that uses some credit card data to detect fraudulent transactions.
The project comes with two data sources (tx and merchants) and you’ll add a third one (cards).
The table below describes each of these three datasets.
Dataset |
Description |
---|---|
tx |
Each row is a unique credit card transaction with information such as the card that was used and the merchant where the transaction was made. It also indicates whether the transaction has either been:
|
merchants |
Each row is a unique merchant with information such as the merchant’s location and category. |
cards (to be uploaded) |
Each row is a unique credit card ID with information such as the card’s activation month or the cardholder’s FICO score (a common measure of creditworthiness in the US). |
Import data#
Dataiku lets you connect to a wide variety of data sources, but for this tutorial, let’s start by uploading a local file. Each row is the latitude and longitude coordinates of a unique credit card holder.
From the Flow, click + Dataset > Upload your files.
The New Uploaded Files Dataset page opens.
Note
If the project was empty, you would have seen a blue button + Import Your First Dataset on the project homepage.
Click Select Files, and choose the cards.csv file.
Dataiku displays a preview of the cards dataset. As you can see, the data is in a tabular format, with columns (features) and rows (records or observations). Dataiku has correctly set a default dataset name cards based on the file name.
Leave the name
cards
and click the Configure Format button above the preview.In the Schema tab, click on Infer Types From Data, and then Confirm so that Dataiku tries to guess the correct storage types based on the current sample.
Note
Once the dataset is uploaded to your Flow, you can check its schema any time from the Settings > Schema tab, and then infer the types.
Since the result is OK for us, finish importing the dataset by either hitting the Create button or using the shortcut Cmd/Ctrl+S.
This procedure creates the new dataset and lands you on the Explore tab of the cards dataset.
Explore the data#
Once you’ve imported a dataset, you’ll want to start exploring it through the Explore and Charts tabs!
The Explore tab of a dataset provides a tabular view of your data where you can start to examine it, while the Charts tab has a drag-and-drop interface for data visualizations.
Objectives#
In this section, you will:
Learn about the sampling method for a dataset.
Compute the row count of a dataset.
Analyze a column.
Visualize your data using charts and a pivot table.
Dataset sampling#
As an environment capable of handling large datasets, Dataiku shows only a sample of a dataset when you are working interactively.
You can see the sampling method in the top left of the Explore tab. By default, the sample in this tab includes the first 10,000 records of the dataset.
To view the total row count of your dataset, select Compute row count (the cyclic arrows icon).
To change the sample settings of a dataset, select the Sample badge, which opens a panel on the left.
See also
For more information, see the Sampling article in the reference documentation.
Storage type and meaning of dataset columns#
In the dataset, beneath each column name, Dataiku indicates:
The storage type (in gray)
The meaning (in blue)
Here, Dataiku detects a meaning of text for the id column, based upon the fact that most values in the sample for customer_id are strings.
The data quality bar shows green for all columns, which means that all rows are valid.
Note
In this dataset, we do not have any NOK (Not OK) or missing values. But if we did:
NOK values would be represented in red in both the data quality bar and column cells.
Missing values would be in gray in both the data quality bar and column cells.
See also
For more information, see the Schemas, storage types and meanings article in the reference documentation.
Analyze window#
You can analyze the content of each column using the Analyze window. Let’s analyze the content of the cardholder_fico_range column. To do so:
Click on the column name cardholder_fico_range, and select Analyze from the dropdown menu.
The Analyze window opens, showing the analysis of the data sample.
To extend the analysis to the whole data, select Whole data instead of sample in the dropdown next to the column name, then click Save and Compute.
Click the arrows next to the column name to switch from one column to another.
Important
As we are analyzing the whole dataset, we have to click Compute again at the top of the window to display the analysis for the whole data of the new column. It would be automatically computed if we were just looking at the sample.
Close the window when you’re done with the analysis.
Charts#
You can use charts to explore a dataset. For example, we might want to know which reward program is the most frequent for specific age ranges.
Here, let’s use two types of charts and a pivot table to show you different approaches.
Add a vertical bars chart#
Click on the Charts tab (or use the keyboard shortcut
g
+v
).From the Data panel, drag and drop:
Count of records as the Y variable.
reward_program as the X variable.
cardholder_age to the color droplet field.
The chart reveals that the cash_back reward program is the most popular, whatever the age of card holders.
Important
The Sample badge at the top of the chart reminds you that the chart is sampled like the dataset. Clicking on the badge allows you to change the sampling method if needed.
Open the dropdown menu of the cardholder_age variable and check the Adjust bin size for nicer bounds option to have five nicer bins for the age of card holders.
On the chart, hover over any of the orange bars to filter on card holders aged between 40 and 60. Once the menu is displayed, click on the bar to keep the menu on and select the drill down icon (the down arrow) at the right of the cardholder_age variable.
This action adds a filter to the Filters section of the Setup tab of the left panel.
To go back to the default display, extend the age range from
18
to100
in the Filters section of the Setup tab.In the left panel, go to the Format tab to customize the chart:
Under X Axis, open the Title dropdown, and enter
Reward program
in the Axis title field.Under Color, select the built-in Pastel 2 palette.
Add a pie chart#
Let’s add a second chart.
At the bottom of the screen, click + Chart.
In the chart type dropdown, select Pie.
From the panel on the left, drag and drop:
Count of records in the Show field.
reward_program in the By field.
The chart confirms that the cash_back reward program is the most popular.
Add a pivot chart#
Lastly, let’s add a pivot table to represent in a different way the distribution of card holders per age across the different reward programs.
At the bottom of the screen, click + Chart.
In the chart type dropdown, select Pivot table.
From the panel on the left, drag and drop:
cardholder_age in the Rows field.
reward_program in the Columns field.
Count of records in the Value field.
In the dropdown menu of the cardholder_age variable:
Enter
5
as the Number of bins.Check the Adjust bin size for nicer bounds option to have five nicer bins for the age of card holders.
See also
For more information, see the Tutorial | Charts and pivot tables article.
What’s next?#
Congratulations! You’ve created your first project, imported your first dataset, and built your first charts.
We recommend you to check the Tutorial | Join recipe, where we join the three datasets from this project to enrich each unique transaction with data about the information on credit card holders and merchants for that transaction.