Import data#

See a screencast covering this section’s steps

If you look at the Flow, you’ll notice different icons on the blue squares representing datasets. These icons represent the underlying storage connection for each dataset. Dataiku supports connections to many possible data sources that an instance administrator can manage.

For example, the initial job_postings dataset is an uploaded file (as indicated by its icon), but the output to the Prepare recipe is the default location on your instance. It might be cloud storage, an SQL database, or even a local filesystem.

Upload a file#

For now, let’s demonstrate importing a new dataset into Dataiku.

  1. Download the earnings_by_education.csv file.

  2. From the Flow, click + Dataset.

  3. Select Upload your files.

    Dataiku screenshot of the Flow highlighting data connections.
  4. Click Select Files, and choose the earnings_by_education.csv file.

  5. Before creating it, click Configure Format.

Dataiku screenshot of the data import process.

Infer the dataset’s schema#

Configuring the format allows you to adjust the dataset’s schema, or the name and storage type of columns in this context.

  1. Navigate to the Schema subtab.

  2. Click Infer Types from Data so Dataiku can try to guess the correct storage types based on the current sample.

  3. Click Confirm, and notice how the median_weekly_earnings_usd column changed from a string to an integer.

  4. Click Create to finish importing the dataset.

  5. After looking at the new dataset, navigate back to the Flow (g + f).

Dataiku screenshot of the data import process.