Review the Flow#

See a screencast covering this section’s steps

One of the first concepts a user needs to understand about Dataiku is the Flow. The Flow is the visual representation of how datasets, recipes (steps for data transformation), and models work together to move data through an analytics pipeline.

See the Flow’s visual grammar#

Dataiku has its own visual grammar to organize AI, machine learning, and analytics projects in a collaborative way.

Shape

Item

Icon

Dataset icon.

Dataset

The icon on the square represents the dataset’s storage location, such as Amazon S3, Snowflake, PostgreSQL, etc.

Recipe icon.

Recipe

The icon on the circle represents the type of data transformation, such as a broom for a Prepare recipe or coiled snakes for a Python recipe.

Dataset icon.

Model

The icon on the diamond represents the type of modeling task, such as prediction, clustering, time series forecasting, etc.

Tip

In addition to shape, color has meaning too.

  • Datasets are blue, but those shared from other projects are black.

  • Visual recipes are yellow. Code recipes are orange. LLM recipes are pink. Plugin recipes are red.

  • Machine learning elements are green.

Take a look now!

  1. If not already there, from the left-most menu in the top navigation bar, click on the Flow (or use the keyboard shortcut g + f).

    Dataiku screenshot of the MLOps starting Flow.

    Note

    This project begins in the Data Preparation Flow zone from a labeled dataset named job_postings composed of 95% real and 5% fake job postings. The pipeline builds a prediction model capable of classifying a job posting as real or fake. Your job will be to deploy the model found in the Machine Learning Flow zone as a real-time API endpoint.

  2. Take a moment to review the objects in the Flow. Gain a high-level understanding of how the recipes first prepare, join, and split the data, then train a model, and finally use it score new data.

Tip

There are many other keyboard shortcuts beyond g + f. Type ? to pull up a menu or see the complete list in the reference documentation.

Build the Flow#

Unlike the initial uploaded datasets, the downstream datasets appear as outlines. This is because they have not been built, meaning that the relevant recipes have not been run to populate these datasets. However, this is not a problem because the Flow contains the recipes required to create these outputs at any time.

  1. Click to open the Flow Actions menu in the bottom right.

  2. Click Build all.

  3. Leaving the default options, click Build to run the recipes necessary to create the items furthest downstream.

  4. When the job completes, refresh the page to see the built Flow.

    Dataiku screenshot of the dialog for building the Flow.

    Let’s also take a closer look at the model itself.

  5. Double click to open the diamond-shaped Predict fraudulent (binary) model, and then return to the Flow when finished inspecting the model.

    • Note that it has only one version. As you retrain the model, the history of model versions is tracked, and so you can easily rollback between the active version and an older version.

    • Click on the model version name Random forest (s2) - v1 at the top left of the tile to see the full report.

Dataiku screenshot of the saved model object.

Note

To learn more about creating the model, see the Quick Start | Dataiku for machine learning.