Create the dataset#

When you start with a project, you’ll want to upload a dataset.

Objectives#

In this section, you will:

  • Upload a local CSV file to Dataiku.

Prerequisites#

To complete this tutorial, you’ll need:

Create the project#

  1. From the Dataiku Design homepage, click + New Project.

  2. Select Learning projects.

  3. Search for and select Create Your First Project.

  4. Click Install.

  5. From the project homepage, click Go to Flow (or g + f).

Note

You can also download the starter project from this website and import it as a zip file.

Use case summary#

Let’s say we’re a financial company that uses some credit card data to detect fraudulent transactions.

The project comes with two data sources (tx and merchants) and you’ll add a third one (cards).

The table below describes each of these three datasets.

Dataset

Description

tx

Each row is a unique credit card transaction with information such as the card that was used and the merchant where the transaction was made.

It also indicates whether the transaction has either been:

  • Authorized (a score of 1 in the authorized_flag column)

  • Flagged for potential fraud (a score of 0)

merchants

Each row is a unique merchant with information such as the merchant’s location and category.

cards

(to be uploaded)

Each row is a unique credit card ID with information such as the card’s activation month or the cardholder’s FICO score (a common measure of creditworthiness in the US).

Import data#

Dataiku lets you connect to a wide variety of data sources, but for this tutorial, let’s start by uploading a local file. Each row is the latitude and longitude coordinates of a unique credit card holder.

  1. From the Flow, click + Dataset > Upload your files.

    The New Uploaded Files Dataset page opens.

    Note

    If the project was empty, you would have seen a blue button + Import Your First Dataset on the project homepage.

  2. Click Select Files, and choose the cards.csv file.

    Dataiku displays a preview of the cards dataset. As you can see, the data is in a tabular format, with columns (features) and rows (records or observations). Dataiku has correctly set a default dataset name cards based on the file name.

  3. Leave the name cards and click the Configure Format button above the preview.

  4. In the Schema tab, click on Infer Types From Data, and then Confirm so that Dataiku tries to guess the correct storage types based on the current sample.

    A dataiku screenshot showing the Preview on the upload dataset page.

    Note

    Once the dataset is uploaded to your Flow, you can check its schema any time from the Settings > Schema tab, and then infer the types.

  5. Since the result is OK for us, finish importing the dataset by either hitting the Create button or using the shortcut Cmd/Ctrl+S.

This procedure creates the new dataset and lands you on the Explore tab of the cards dataset.

Dataiku screenshot of the Explore tab of a dataset.