Explore data#

See a screencast covering this section’s steps

Before actually preparing the data, let’s explore it briefly.

  1. If not already there, from the left-most menu in the top navigation bar, click on the Flow (or use the keyboard shortcut g + f).

  2. Double click on the job_postings dataset to open it.

Dataiku screenshot of the Flow showing the job postings dataset.

Tip

There are many other keyboard shortcuts beyond g + f. Type ? to pull up a menu or see the Accessibility page in the reference documentation.

Compute the row count#

One of the first things to recognize when exploring a dataset in Dataiku is that you are viewing only a sample of the data. This enables you to work interactively even with very large datasets.

  1. From within the Explore tab of the job_postings dataset, click the Sync icon. icon to compute the row count of the entire dataset.

  2. Click the Sample button to open the Sample settings panel.

  3. Click the Sampling method dropdown to see options other than the default first records. No changes are required.

  4. When ready, click the Sample button again to collapse the panel.

Dataiku screenshot of the sampling tab for a dataset.

Analyze column distributions#

When exploring a dataset, you’ll also want a quick overview of the columns’ distributions and summary statistics.

  1. Click on the header of the first column job_id to open a menu of options.

  2. Select Analyze.

  3. Use the arrows to cycle through presentations of each column distribution, including the target variable fraudulent column.

Dataiku screenshot of the Analyze tool.

Tip

Applying the Analyze tool on the fraudulent column shows that about 5% of records in the sample are labeled as fake (a value of 1), whereas the remaining 95% are real job postings (a value of 0).

You can adjust the dropdown from sample to Whole data, and click Save and Compute to see if this statistic differs for the whole dataset.