Tutorial | Batch deployment basics#

Get started#

When you finish designing a project, it’s time to push it into production! As opposed to real-time scoring, let’s implement a batch processing production framework.

Objectives#

In this tutorial, you will:

  • Create a project bundle on the Design node.

  • Deploy the bundle on the Automation node using the Project Deployer.

  • Manage bundle versions between Design and Automation nodes.

  • Monitor the project’s status on the Deployer.

Prerequisites#

  • A business or enterprise license of Dataiku 12+. Discover licenses and the free edition are not compatible.

  • An Automation node connected to the Design node.

  • Users need to be able to create a project bundle, which requires, for 12.1+ users, the Write project content permission on the project used in this tutorial. Users on instances prior to 12.1 require the project admin permission.

  • Intermediate knowledge of Dataiku (recommended courses in the Advanced Designer learning path or equivalent).

Create the project#

  1. From the Dataiku Design homepage, click +New Project > DSS tutorials > MLOps Practitioner > Batch Deployment.

  2. From the project homepage, click Go to Flow (or g + f).

Note

You can also download the starter project from this website and import it as a zip file.

You’ll next want to build the Flow.

  1. Click Flow Actions at the bottom right of the Flow.

  2. Click Build all.

  3. Keep the default settings and click Build.

Use case summary#

The project has three data sources:

Dataset

Description

tx

Each row is a unique credit card transaction with information such as the card that was used and the merchant where the transaction was made.

It also indicates whether the transaction has either been:

  • Authorized (a score of 1 in the authorized_flag column)

  • Flagged for potential fraud (a score of 0)

merchants

Each row is a unique merchant with information such as the merchant’s location and category.

cards

Each row is a unique credit card ID with information such as the card’s activation month or the cardholder’s FICO score (a common measure of creditworthiness in the US).

Run the failing scenario on the Design node#

The existing Data Refresh scenario found in the project attempts to rebuild a downstream dataset (tx_windows) only after verifying the data quality rules pass on an upstream dataset (tx_prepared). However, it’s currently failing.

  1. From the Jobs menu in the top navigation bar, go to the Scenarios page.

  2. Open the Data Refresh scenario, and click Run to manually trigger it.

  3. Go to the Last Runs tab, and see that it fails because it cannot verify a data quality rule.

  4. Click on link in the failed step to view the data quality tab of the tx_prepared dataset.

Dataiku screenshot of the last runs tab of a failing scenario.

Tip

Normally we’d want to fix a failing scenario before deploying the project into production. In this case though, the error will be instructive. Let’s move the project into production!

Create a bundle and publish it to the Deployer#

See a screencast covering similar steps to those required here.

The first step is to create a bundle from the project found in the Design node (the development environment).

  1. From the More Options (…) menu in the top navigation bar, choose Bundles.

  2. Click + New Bundle.

Dataiku screenshot of the page to create a new bundle.

Add additional content to the bundle#

A bundle acts as a consistent packaging of a complete Flow. By default, it includes only the project metadata. As a result, all datasets will come empty, and models will come untrained. However, depending on the use case, we can choose to include additional content such as datasets, managed folders, and saved models.

Unlike in most real-life projects that would be connected to some kind of database or cloud storage, our initial datasets are uploaded files and managed folders. Therefore, they won’t be re-computed from production sources. To access these files in the production environment, we’ll also need to include them in the bundle.

  1. Provide the bundle ID v1.

  2. In the Additional Content section, to the right of Datasets, click Add. Choose cards.

  3. To the right of Managed folders, click Add. Choose tx.

  4. Click Create.

Dataiku screenshot of the bundle creation page showing input data included.

Important

Unlike tx and cards, the starting merchants dataset originates from a Download recipe, and hence Dataiku is able to recompute it in the production environment.

If the project included a saved model that we wanted to use for batch scoring (or real-time inference), we’d also need to include it in the bundle.

Publish the bundle to the Deployer#

The project on the Design node now includes a bundle. Although we could download this file and manually upload it to the Automation node, the strongly preferred method is to use the Project Deployer because it centralizes the history of all deployments.

  1. From the Bundles page of the project on the Design node, select the v1 bundle.

  2. Click Publish on Deployer, and then again confirm the same.

Dataiku screenshot of the dialog for publishing a bundle to the Deployer.

Create and manage deployments#

Until now, your experience with Dataiku may have been limited to the Design node. However, as an end-to-end platform, Dataiku includes other nodes (Automation, API, Govern) for production-oriented tasks.

Explore the Deployer#

Before actually deploying the bundle to the Automation node, let’s take a look at how the Deployer, and in particular for this tutorial, the Project Deployer, fits into this process.

First, there are two modes for installing the Deployer:

  • One is a local Deployer that sits on top of either the Design or Automation node and requires no further setup. Dataiku Cloud users employ this option.

  • The other is a standalone or remote Deployer for infrastructures with multiple Design and/or Automation nodes.

Regardless of which setup is found on your instance, the process for using the Project Deployer is the same.

  1. If you are using a remote Deployer, make sure you are connected to this instance. You’ll need credentials from your instance administrator.

  2. For either setup, from the bundle details page on the Design node, click Open in Deployer. If you’ve closed this dialog, just click Deployer where the publishing date is recorded.

Dataiku screenshot of the bundles page showing a bundle published to the Deployer.

Tip

You can also always navigate to the Deployer by choosing Local/Remote Deployer in the Applications menu from the top navigation bar.

Although we now have pushed a bundle to the Deployer, we don’t yet have an actual deployment. Before creating a new deployment, take a moment to explore the Deployer. If you are using a remote Deployer, note the change in the instance URL.

  1. From the Project Deployer, click on Deployer at the top left to see how this node has separate components for deploying projects, deploying API services, and monitoring.

  2. Click on Deploying Projects to view current deployments, projects that have published bundles, and available infrastructures.

Dataiku screenshot of the Deployer home.

Tip

The Deployer for your instance might already have projects and/or API services belonging to colleagues and other teams!

Create a new deployment#

Thus far, we’ve published a bundle from the Design node to the Project Deployer. To create an active deployment, we still need to push the bundle from the Project Deployer to an Automation node.

  1. If not already there, click Deployments in the top navigation bar of the Deployer to view all project deployments on the instance.

  2. In the Bundles to deploy panel on the left of the Project Deployer, find the v1 bundle for this project, and click Deploy.

    Caution

    If the Deploy button is not clickable, it means there is no infrastructure ready for deployment. Dataiku Cloud users need to add the Automation node extension. Self-managed users (or their instance admins) should consult the reference documentation on Deployment infrastructures.

  3. Choose a Target infrastructure. This will vary depending on the infrastructure available to your organization.

  4. Leave the default Deployment ID, which takes the form of <PROJECTKEY>-on-<infrastructure>.

  5. Click Create, and then Deploy and Activate.

Dataiku screenshot of the dialog for creating a new deployment.

Important

Just as when importing a project zip file to a Dataiku instance, you may see warnings about missing plugins or plugin version mismatches. If any of these plugins are used in the project at hand, you’ll want to closely review them. The same can be said for missing connections. See the article Concept | Automation node preparation for more details.

Manage deployment settings#

Your project is now running on the Automation node! You can click to open it from the panel on the left. Before doing so though, it is helpful to understand what deployment settings can be controlled from the Project Deployer itself.

Within the Project Deployer, we can monitor the status of deployments, such as when it was created (and by whom), when it was updated (and by whom), and recent scenario runs.

Dataiku screenshot of the Status tab of a deployment on the Project Deployer.

Note

The reference documentation also covers how to modify deployment settings from the Project Deployer.

Remap connections#

In the Settings tab of a deployment, you can configure criteria around variables, connections, code environments, and scenarios.

Connection remapping, for example, is one setting that will commonly need to be configured. Organizations often maintain different databases for development and production environments. If this is the case, you’ll need to remap the source connections used on the Design node to the target connections that should be used on the Automation node.

  1. Within the new deployment, navigate to the Settings tab.

  2. Navigate to the Connections panel on the left.

Dataiku screenshot of the Connections page of a deployment's settings.

Tip

Assuming you are using the same data sources for development and production environments for this tutorial, no action is required here.

Manage scenario auto-triggers#

To ensure scenarios never run unexpectedly, all scenarios in a new deployment are deactivated by default — regardless of their settings on the Design node.

  1. Remaining within the Settings tab of the deployment, navigate to the Scenarios panel.

    Here you can enable, disable, or even override the behavior defined on the Automation node — giving you the option of how you want to manage the scenarios for a deployed project.

  2. Leave the default setting in place (the leftmost option that does not override the behavior defined on the Automation node).

Dataiku screenshot of the Scenarios page of a deployment.

Activate the scenario on the Automation node project#

Finally, let’s check out the project on the Automation node.

  1. Ensure you are connected to a running Automation node.

  2. Navigate back to the Status tab of the deployment.

  3. Click to open the project on the Automation node.

Dataiku screenshot showing where to find the Automation node project.

Tip

Keep your Design, Deployer, and Automation nodes open in separate browser tabs!

Once on the Automation node, the project should look quite familiar. Confirm a few points:

  • The project homepage reports what bundle is running and when it was activated.

  • The scenario is not yet active.

Let’s activate the scenario in the Automation node project!

  1. While in the Automation node version of the project, open the Data Refresh scenario.

  2. In the Settings tab, turn on the time-based trigger (every 1 minute) and the auto-trigger setting for the scenario.

  3. Click Save to activate it.

Dataiku screenshot of activated triggers for a scenario on the Automation node.

Monitor a batch deployed project#

Now that we have an active scenario running in production, we have to to monitor it. In addition to the scenario’s Last Runs tab, let’s examine additional monitoring tools at our disposal.

At the project-level#

The most granular view is within the project itself.

  1. From the Jobs menu of the Automation node project, select Automation Monitoring.

  2. Click Load Scenario Runs.

  3. Explore the Daily Summary and Timeline tabs using the filters to view the progress of any scenarios within this project.

Dataiku screenshot of the Automation monitoring page.

At the Project Deployer-level#

Instead of checking the progress on the Automation node, we can also monitor the progress of deployments from the Project Deployer.

  1. Return to the Project Deployer.

  2. From the Deployments tab, we can view information about the status of any deployment, including scenario runs.

  3. Select the current deployment to see an overview in the Status tab.

Dataiku screenshot of the Status tab of a deployment showing a scenario run.

Unified Monitoring#

It can also be helpful to have an instance-wide view of the health status of all deployed projects (as well as API services).

  1. From the Project Deployer, click Deployer at the top left to go to the Deployer home.

  2. Select Monitoring.

  3. Navigate to the Dataiku Projects tab as that is our interest at the moment.

  4. See that one project has an OK deployment status, but an error for its execution status. This registers one project in the error category in terms of global status.

Dataiku screenshot of the projects tab of the unified monitoring page.

Tip

Depending on how quickly you’ve made it through these steps, your Unified Monitoring screen may not yet show an error. The default synchronization interval is five minutes. Learn more in the reference documentation on Unified Monitoring.

Since we can see the scenario failing, let’s deactivate the automated trigger from the Project Deployer.

  1. From the Dataiku Projects tab of the Unified Monitoring page, click on the row for the failing deployment to see more information.

  2. Click on the Deployment.

    Dataiku screenshot of a deployment on the unified monitoring page.
  3. Navigate to the Settings tab.

  4. Go to the Scenarios panel.

  5. Click to Disable automatic triggers.

  6. Click Save and Update, and then Confirm.

    Dataiku screenshot of the scenarios panel of a project deployment.

Tip

Check back on the homepage of the Automation project to confirm the scenario is no longer active.

Version a deployed project#

As you monitor the health of deployments, you’ll need to deploy updated versions of projects over time — especially since this one is already failing!

Where to make changes to a project#

When it is necessary to make a change to a deployed project, it’s critical to make all such changes in the development environment (the Design node), and then push a new bundle to the production environment (the Automation node).

It may be tempting just to make a quick change to the project on the Automation node, but you should avoid this temptation, as the project in the production environment would no longer be synced with its counterpart in the development environment.

Consider a situation where you want to revert back to an earlier version of the project. If you’ve made changes in the Automation node, these changes will be lost. Accordingly, actual development should always happen in the Design node, and new versions of bundles should be pushed from there.

Fix the failing scenario#

Taking this advice, let’s return to the Design node project. There are actually three changes that should be made.

Edit the data quality rule#

We first need to fix the failing data quality rule. Instead of an error, let’s reduce it to a warning.

  1. On the Design project, navigate to the Data Quality tab of the tx_prepared dataset.

  2. Click the pencil to edit the “Record count is above 50000” rule.

  3. Check the box to Auto-compute metric.

  4. Turn off the Min setting, and turn ON a Soft min of 50000 to produce a warning instead of an error.

  5. Click Run Test to confirm the warning.

  6. Click Save.

Dataiku screenshot of a data quality rule.

Add a Compute metrics step to the scenario#

We can also add a step in the scenario to explicitly compute metrics.

  1. Navigate to the Steps tab of the Data Refresh scenario.

  2. Click Add Step

  3. Select Compute metrics from the list of steps.

  4. Drag the metrics step to the first position.

  5. Click Add Dataset to Compute > tx_prepared > Add.

  6. Click Save.

  7. As a matter of good practice, click Run to make sure it returns the expected warning result.

Dataiku screenshot of a compute metrics step in a scenario.

Enable the scenario trigger#

Finally, let’s turn on the time-based trigger, but not the auto-trigger for the scenario itself. This way, once we activate the scenario’s auto-triggers in the production environment, it will begin running.

  1. Navigate to the Settings tab of the Data Refresh scenario.

  2. Turn On the Time-based trigger.

  3. Verify the Auto-triggers remain Off.

  4. Click Save.

Dataiku screenshot of scenario trigger settings.

Create a second bundle#

Now let’s demonstrate the process for updating an existing deployment with a new bundle.

  1. From the Bundles page on the Design node project, click + New Bundle.

  2. Name it v2.

  3. In the release notes, add fixed scenario.

  4. Click Create.

Dataiku screenshot of the second version of the project bundle.

Note

Note how when creating the second bundle, the configuration of the previous one is inherited. In this case, the uploaded dataset and the managed folder are already included.

Deploy the new bundle#

The process for deploying the new bundle is the same as for the first one.

  1. Click on the newly-created v2 bundle, and click Publish on Deployer.

  2. Confirm that you indeed want to Publish on Deployer.

  3. Click to Open in Deployer to view the bundle details on the Deployer.

  4. Once on the Deployer, click Deploy on the v2 bundle.

    Dataiku gives the option to create a new deployment or update the existing one.

  5. Since this is a new version of an existing deployment, verify Update is selected, and click OK.

  6. Click OK again to confirm the deployment you want to edit.

Dataiku screenshot for updating a deployed bundle.

We’re not done yet!

  1. Navigate to the Status tab of the deployment, and note how Dataiku warns that the active bundle on the Automation node does not match the configured bundle.

  2. Click the green Update button to deploy the new bundle. Then Confirm.

  3. Navigate to the Deployments tab of the Project Deployer to see the new bundle as the currently deployed version of this project.

Dataiku screenshot of the Deployer showing a second version of the deployment.

Activate a scenario from the Deployer#

Previously we activated the scenario directly from the Automation node project. Now let’s control it from the Project Deployer.

  1. Navigate to the Settings tab of the deployment.

  2. Uncheck the box for Disable automatic triggers.

  3. Click Activate All to enable the auto-triggers for any scenario.

  4. Click Save and Update and then Confirm.

Dataiku screenshot of the scenarios panel of a deployment.

Tip

Once you’ve done this, verify the v2 scenario runs and produces a warning. You can check this on the Automation project, the Project Deployer, or on the Unified Monitoring page (depending on the synchronization interval).

Revert to a previous bundle#

It’s also important to be able to revert to an earlier version, should a newer bundle not work as expected. Let’s demonstrate that now.

  1. From the Deployments tab of the Deployer, find the project in the left hand panel.

  2. Click Deploy next to the v1 bundle.

  3. With Update selected, click OK, and confirm this is correct.

  4. Now on the Settings tab with v1 as the source bundle, click the green Update button, and Confirm the change.

Dataiku screenshot of the dialog to revert a bundle.

Important

If you return to the Status tab of this deployment, or open the project homepage on the Automation node, you’ll see that v1 is once again the active bundle running in production.

Before signing off, be sure to disable automatic triggers for this deployment either from the Project Deployer or the Automation project!

See also

See the reference documentation to learn more about reverting bundles.

What’s next?#

Congratulations! To recap, in this tutorial, you:

  • Created a project bundle on the Design node.

  • Published a bundle to the Automation node via the Deployer.

  • Activated (and disabled) a scenario to run on the Automation node.

  • Saw where to monitor the health of deployments.

  • Switched bundle versions within a deployment.

Now that you have seen the batch processing framework for production, your next step may be to examine the real-time API scoring method of production presented in the API Deployment course.

See also

For more information on batch processing, please refer to the reference documentation on Production deployments and bundles.