Tutorial | Recipe engines#

When certain conditions are met, Dataiku has the ability to offload computation to potentially more powerful recipe engines (such as Spark or SQL) that can result in faster build times of data pipelines.

Knowing the basic outlines of how Dataiku selects the most optimal recipe engine can help guide an optimization strategy as you move from an experimental to a production workflow.

Get started#

Objectives#

In this tutorial, you will:

  • Learn how recipe engines fit into a Flow optimization strategy.

  • Understand how Dataiku selects the most optimal recipe engine.

Prerequisites#

To reproduce the steps in this tutorial, you’ll need:

  • Dataiku 12.6 or later.

  • Basic knowledge of Dataiku (Core Designer level or equivalent).

Create the project#

  1. From the Dataiku Design homepage, click + New Project.

  2. Select Learning projects.

  3. Search for and select Recipe Engines.

  4. Click Install.

  5. From the project homepage, click Go to Flow (or g + f).

Note

You can also download the starter project from this website and import it as a zip file.

Use case summary#

The project has three data sources:

Dataset

Description

tx

Each row is a unique credit card transaction with information such as the card that was used and the merchant where the transaction was made.

It also indicates whether the transaction has either been:

  • Authorized (a score of 1 in the authorized_flag column)

  • Flagged for potential fraud (a score of 0)

merchants

Each row is a unique merchant with information such as the merchant’s location and category.

cards

Each row is a unique credit card ID with information such as the card’s activation month or the cardholder’s FICO score (a common measure of creditworthiness in the US).

Additional prerequisites#

This tutorial demonstrates the principles behind recipe engine selection on a Dataiku instance with both Spark and in-database (SQL) engines enabled.

Fully reproducing the steps shown here require one or both of these engines configured, along with compatible storage connections.

Tip

If these are not available to you, you can still grasp the important principles from reading along.

Recipe engines in Dataiku#

All recipes in Dataiku are executed with an engine. Depending on the infrastructure available to your instance, this may mean that in addition to the DSS engine, other engines — such as Spark or in-database (SQL) — may be available in certain cases.

In most situations though, from the perspective of a user, the choice of recipe engine is not a daily concern. Detailed knowledge of Spark or in-database computation is not required. This all happens “under the hood.” Your instance administrator will have already selected the default storage location as you build out your Flow. Based on these preferences, Dataiku (not you) makes the optimal recipe engine selections.

However, as you work with larger datasets, or begin to build Flows frequently (such as with an automation scenario), you may encounter computational bottlenecks. In this case, reviewing the recipe engines used in the Flow can be one part of a successful optimization strategy.

Rather than trying to manually change recipe engines, knowing a few basic principles will help ensure Dataiku selects the best possible engine. From a user’s perspective, two key considerations impact Dataiku’s selection of recipe engine.

  • What is the storage connection of the input and output datasets?

  • What data transformation does the recipe perform?

Let’s examine how these questions impact Dataiku’s selection of a recipe engine.

Check storage connections#

To ensure we are on the same page, let’s first review the storage connection of all downstream datasets in the Flow. In this case, “downstream” means all data other than the raw data originating with this project.

  1. From the Flow menu in the top navigation bar, navigate to the Datasets page (g + d).

  2. Select all datasets by checking the box in the top left corner.

  3. Deselect the cards, tx, and merchants datasets.

    Note

    For uploaded files or files in folder datasets, we would need to use a Sync recipe to change their connection, but that won’t be required here.

  4. In the Actions tab on the right, click Change connection.

  5. Choose the Spark or SQL-compatible connection that you will use for this tutorial.

  6. Click Save.

Dataiku screenshot of the datasets page.

Note

If choosing a Spark-compatible connection, you’ll also need to select a compatible file format, such as parquet.

Find the engine of a visual recipe#

As a first example, the inputs and the output to the Join recipe are stored in different connections. Let’s look at what engine Dataiku has selected for this recipe.

  1. Return to the Flow, and double click to open the Join recipe in the Data ingestion Flow zone.

  2. Click the gear icon underneath the Run button at the bottom left.

  3. Click Show to view all engines, including those not selectable.

  4. Click Close without making any changes.

Dataiku screenshot of the dialog for selecting a recipe engine.

Dataiku selected the DSS engine. Do you know why yet? Before answering, let’s take a wider look at the other connections and engines used in the Flow.

Explore the connections and recipe engines Flow views#

Two Flow views can be particularly helpful: connections and recipe engines.

  1. From the Flow, click to open the View menu at the bottom left.

  2. Select the Connections view.

Dataiku screenshot of the connections Flow view.

Note

Note how the same Spark-compatible connection (in this case an S3 bucket named dataiku-managed-storage) is used throughout the Data preparation Flow zone.

We can also use the recipe engines Flow view to see the engine of all recipes in the Flow.

  1. From the Flow, click to open the View menu at the bottom left.

  2. Select the Recipe engines view.

Dataiku screenshot of the recipe engine Flow view.

Let’s look closer at some of these selections.

Input and output datasets share the same connection#

For a recipe to use the Spark or SQL engine, its input and output datasets must share the same Spark or SQL-compatible connection. Let’s start in the Data ingestion Flow zone.

  • The Join recipe includes “file in folder” input datasets and a different type of output. Accordingly, it falls back to using the DSS engine.

  • Similarly, the Prepare recipe has an uploaded files input dataset and a different type of output. It also must use the DSS engine.

Dataiku screenshot of the recipe engine Flow view highlighting different inputs and outputs.

Tip

If you wanted to optimize the Join recipe, use a Sync recipe to transfer the tx and merchants datasets to the same connection as the output. Feel free to try this on your own!

Conversely, all recipe inputs and outputs in the Data preparation Flow zone share the same connection. Those recipes (besides exceptions we’ll discuss next) can use the Spark or SQL engine depending on your chosen storage location.

Dataiku screenshot of the recipe engine Flow view highlighting shared inputs and outputs.

Supported data transformations#

The Window recipe in this Flow uses the DSS engine even though its input and output and datasets share the same connection. Let’s take a closer look.

  1. Click to open the Window recipe.

  2. Click the gear icon at the bottom left to open the recipe engine dialog.

  3. Note why this recipe is forced to use the DSS engine.

  4. Close the dialog without making any changes.

Dataiku screenshot of the recipe engine setting of a Window recipe.

Although the Window recipe, like other visual recipes, is typically compatible with the Spark or SQL engines, a particular aspect to the data transformation in this recipe is not supported — in other words, an edge case!

Tip

Switch the window frame setting to Limit the number of preceding/following rows, and observe the change in engine selection. If we needed to limit the window based on a value range without the DSS engine, we would need to find another way to express this transformation.

Recipe engines with the Prepare recipe#

All processors in the Prepare recipe are compatible with the Spark engine. However, the ability of a Prepare recipe to use the in-database SQL engine depends on the specific processors in the script, as detailed in the reference documentation.

If you are using an SQL database, take a moment to see which processors are SQL compatible in the Flow’s second Prepare recipe.

  1. Click to open the Prepare recipe in the Data preparation zone.

  2. To see which steps are SQL-compatible, observe how the script contains certain steps with green SQL icons (such as Round values) and other steps with red SQL icons (such as Convert currencies).

  3. Click on the gear icon underneath the Run button for further details about unsupported processors that prevent the SQL engine from being chosen.

Dataiku screenshot of a Prepare recipe with SQL input and output.

Tip

Aside from expressing the incompatible transformations in other ways, one optimization strategy may be to divide the steps into back-to-back Prepare recipes — one that is SQL-compatible and one that is not.

What’s next?#

In this tutorial, we learned how Dataiku selects the most optimal recipe engine, and how that can guide optimization strategies.

The next step to further optimizing the Flow may be to enable Spark or SQL pipelines depending on your organization’s infrastructure.

For now though, move ahead to the Data Quality & Automation course to learn about metrics, data quality rules, and scenarios.

See also

See the reference documentation to learn more about Execution engines.