Tip | Good dataset naming schemes#

A good dataset naming convention helps you and your colleagues quickly understand what a Flow achieves. Ideally, dataset names should be readable, self-explanatory, and short.

When creating a recipe, Dataiku creates a default output name by appending the name of the operation to the input’s name. This ordered naming scheme has the benefit of being simple, but it can become unreadable in long data pipelines.

Try to replace this default name with something more self explanatory. A good method is to focus on what you will use the created dataset for, and find differentiating names, such as foo_raw, foo_clean. The input is raw data, the output is clean.

Compatible naming conventions#

The following rules maintain names compatible with all storage connections (SQL dialects, HDFS, Python DataFrame columns, etc.):

Only alphanum and underscore (_).
All lowercase characters.
No spaces.
Doesn’t begin with a number.

Optionally, you can adopt prefixes and suffixes for your datasets. (E.g.: foo_t for a dataset in an SQL database, foo_hdfs for a HDFS dataset etc).

Keep the same tips in mind when naming columns of your datasets, notebooks, and projects.

Tip

You can rename a dataset in the Flow by right clicking on it to open the context menu or finding the same function in the right panel.