Dataiku Datasets

Find resources for understanding properties of Dataiku datasets and keys to exploring them efficiently.

Tip | Good dataset naming schemes

Properly naming your datasets and your recipes is arguably the most important element for collaboration. Good naming helps you recover your previous work, share your work with others, and understand quickly what your colleagues are working on.

The two main objectives are readable and self explanatory names. Keep your names as short as possible, and think of what this element is doing in your flow. Default names are created by appending the name of the operation to the input’s name. This ordered naming scheme has the benefit of being simple, but it quickly becomes unreadable. Try to replace this name with something more self explanatory.

A good method is to focus on what the created dataset will be used for, and find differentiating names, e.g. foo_raw, foo_clean. The input is raw data, the output is clean.

Suggested naming scheme

The following rules maintain names compatible with all storage connections (SQL dialects, HDFS, Python dataframe columns, etc.):

  • only alphanum and underscore (“_”),

  • all lowercase,

  • no spaces,

  • does not begin with a number.

Optionally, you can adopt prefixes and suffixes for your datasets. (E.g.: foo_t for a dataset in a SQL database, foo_hdfs for a HDFS dataset etc…)

Keep the same tips in mind when naming columns of your datasets, notebooks and projects.

Tip

For projects, informative naming can be a good solution: topic, author, version (date based).

Remember to use fully explicit project names (e.g.: “Data Ingestion” and not “p001_data_ingestion”…)

How-to | Rename a dataset

Renaming a dataset may be important when clarity or normalization is needed. Indeed, if you do not customize the name upon downloading a dataset or using a recipe that generates an output dataset, you may end up with a dataset with a name that does not suit you.

Dataiku allows you to rename any dataset. To do so:

  1. From the Flow, right click on the dataset and select Rename (or select the dataset and click Rename at the top of the Actions tab on the right panel).

    Rename menu of a dataset.

  1. Enter the new name in the dialog and click on Rename to confirm.

How-to | Reorder or hide dataset columns

In many cases, you might want to reorder the columns of a dataset. In other situations, you might just want to temporarily hide columns from view. Both actions can easily be achieved with Dataiku.

Reorder columns

In a Prepare recipe (or a Visual Analysis in the Lab), use the Move columns processor to alter the order of columns. This processor can be implemented by adding a new step from the processor library. Or more easily, by manually clicking on a column name and dragging it to the desired position. With large numbers of columns, switching from the default table view to the columns view often makes dragging columns even easier.

Hide columns

Instead of changing the order of columns, you may just want to temporarily hide certain columns or select which ones to view. When viewing any dataset, use the Display dropdown menu towards the top right corner and choose Select displayed columns. Then you can choose to display all columns, or a specific subset of columns, without affecting the actual dataset.

../../_images/kb-reorder-columns-display.png

Another tool to quickly find a particular column is to press the “C” key while viewing a dataset. Doing so opens a column spotlight search. Enter a column name to ensure that it is included in the view of currently displayed columns.

../../_images/kb-reorder-columns-c.png

FAQ | Why can’t I drag and drop a folder into Dataiku?

To import a collection of images or other file types/formats not supported natively by Dataiku, you must first create a Managed Folder in your Flow to serve as a repository for these files.

../../_images/managed-folder.png

At the top of your Flow or from the Datasets page, click on the +Dataset menu and select Folder. Name your folder, and select a filesystem-like location to store into. For example, you could select filesystem folders, Amazon S3, or HDFS.

../../_images/recipe-dropdown.png

Name your folder, and select a filesystem-like location to store into. For example, you could select filesystem folders, Amazon S3, or HDFS.

../../_images/new-folder-modal.png

Tip

If a connection allows managed folders, it is strongly recommended to set up naming rules for new datasets/folders, and default path/bucket if relevant, to prevent managed folders and datasets of different projects from overlapping and creating potential conflicts.

Once created, you can drag and drop or upload files into this folder, or create additional subfolders for organization purposes. Note that this is not like Windows, where you can drag and drop files directly onto the folder icon in your Flow; you must first open the Folder before dragging and dropping files into it.

../../_images/folder-view.png

Managed folders are primarily intended to be used as input or output for code recipes (Python, R, Scala), though some visual recipes dealing with unstructured data also use managed folders as output (Export, Download). Furthermore, you can upload and download files from the managed folder using the Public REST API.

Note

You can find more information about creating and using managed folders in our reference documentation.

FAQ | Where can I see how many records are in my entire dataset?

The default sample previewed in the Explore tab of a dataset is the first 10,000 records, but your whole dataset may have many more records than this. To check the full record count using Dataiku built-in methods, there are a few different options (additional details on each are included below):

  • From the Flow, select the dataset and directly compute dataset metrics from the Info tab on the far right panel, under the Status header.

  • With the dataset open in the Explore tab, select the Compute row count icon at the top of the dataset.

  • With the dataset open, visit the Status tab to compute or review dataset metrics.

  • If record count is part of a recurring quality check (after a scenario run, for example), you can embed this metric into a Dataiku dashboard and set it to automatically update each time the table is rebuilt.

Method 1: From the Flow

With the dataset selected in the Flow, navigate to the Info tab in the far right panel and click Compute under the Status header.

Computing record counts in the Info window under the Status tab.

Configured metrics will appear in-place inside this menu, and may be refreshed as needed from this point forward.

Window showing the total size and record count of the dataset.

Method 2: Compute row count

In the Explore dataset view, Dataiku displays the number of sampled rows in the top left. For datasets larger than 10,000 rows, Dataiku shows the total record count as “not computed” by default.

To view the record count, select the Compute row count icon, or the arrow icon, next to the Sample badge.

Computing record counts with the Compute Row Count icon in the top left of the dataset.

Methods 3 and 4: Status tab and metrics

From the Explore dataset view, navigate to the Status tab and click Compute.

Status tab showing the number of columns and records in a dataset.

The default metrics are column count and record count, but you can add additional dataset metrics in the Edit subtab if desired. Metrics are often used in conjunction with scenarios, but are not strictly dependent on scenarios. For example, tracking the number of records might show you how many new customer records are getting added to the database each day.

Metrics can be published to a Dataiku dashboard, and if you would like them to automatically update each time the dataset is rebuilt (as might be the case in a recurring automation scenario), simply toggle the option for Auto compute after build to Yes.

Updating metrics to automatically compute after build.

Note that metrics probes are automatically historized, which is very useful to track the evolution of a dataset’s status. To review the history of a dataset metric, simply select History instead of Last value in the Display dropdown menu of the main Metrics page.

View the history of record counts in the dataset.

You can find more information about metrics in our documentation here.