Tutorial | Batch deployment basics (MLOps part 2)

This tutorial demonstrates the batch processing framework; later tutorials will cover the real-time scoring approach. For our use case, think of a monthly credit card fraud report.

Objectives

In this tutorial, you will:

  • Create a bundle from a project on the Design node.

  • Push the bundle to the Project Deployer.

  • Deploy and activate the bundle on the Automation node.

  • Modify the original project on the Design node and push a new bundle version to the Automation node.

Starting here?

If you skipped the previous sections and just want to focus on batch deployment, you need to:

  • Satisfy the technical prerequisites.

  • Create the project (+New Project > DSS Tutorials > MLOps > MLOps (Tutorial)).

  • Build the Flow.

Production concepts recap

Before pushing our project into production, let’s consider the goal of deploying to a dedicated environment. In essence, we need an environment that is:

  • repeatable and reliable;

  • safe (alteration is highly limited);

  • connected to production sources.

Note

Recall from the Production Concepts course, that a development environment, as opposed to a production environment, is a sandbox for experimental analyses where failure is expected.

Batch deployment steps in action

This tutorial covers all of the steps for batch deployment in detail. In addition, you might want to watch a screencast (recorded on version 11.0) achieving the same objective on a similar project from beginning to end.

Create the bundle

The first step is to create the bundle from the project found in the development environment (the Design node).

  • In order to package the Flow into a bundle, from the More Options (…) menu in the top navigation bar, choose Bundles.

  • Click + New Bundle.

Note

You can learn more about project bundles in this batch deployment concept article.

Add additional content to the bundle

A bundle acts as a consistent packaging of a complete Flow. By default, it includes only the project metadata. As a result, all datasets will come empty, and models will come untrained. However, depending on the use case, we can choose to include additional datasets, managed folders, saved models, or model evaluation stores.

Let’s add the saved model trained on the Design node to the bundle so it can be used for scoring new production data on the Automation node.

  • Name the bundle v1.

  • In the Additional Content section, to the right of Saved models, click +Add.

  • Choose Predict authorized_flag (binary).

Unlike in most real-life projects that would be connected to some kind of database, our initial datasets are uploaded files. Therefore, they won’t be re-computed from production sources. To access these files in the production environment then, we’ll also need to include them in the bundle.

  • In the Additional Content section, to the right of Datasets, click Add. Choose cardholder_info_csv and merchant_info_csv.

  • To the right of Managed folders, click Add. Choose transaction_data and new_transaction_data.

  • Click Create.

Dataiku screenshot of the bundle creation page showing a saved model and input data included.

Note

You can learn more about creating a bundle in the reference documentation.

Publish the bundle to the Deployer

We now have a project bundle. We could always download this file and upload it to the Automation node. Instead of this manual process, however, we are going to use the Project Deployer, which centralizes the history of all deployments, and so is the strongly preferred method.

  • From the Bundles page of the project on the Design node, select the v1 bundle.

  • Click Publish on Deployer, and then again confirm the same.

Dataiku screenshot of the dialog for publishing a bundle to the Deployer.

Explore the Deployer

Before actually deploying the bundle to the Automation node, let’s take a look at what the Deployer, and in particular the Project Deployer, offers.

There are actually two modes for installing the Deployer:

  • One is a local Deployer that sits on top of either the Design or Automation node and requires no further setup.

  • The other is a standalone or remote Deployer for infrastructures with multiple Design and/or Automation nodes.

Regardless of which setup your instance administrators have chosen, the process for using the Project Deployer is the same.

  • If you are using a remote Deployer, make sure you are connected to this instance. (You’ll need credentials from your instance administrator).

  • Then for either setup, from the bundle details page on the Design node, click Open in Deployer. If you’ve closed this dialog, just click Deployer where the publishing date is recorded.

Dataiku screenshot of the bundles page showing a bundle published to the Deployer.

Note

You can also always navigate to the Deployer by choosing Local/Remote Deployer in the Applications menu from the top navigation bar.

Before actually creating a new deployment, take a moment to explore the Deployer. If you are using the remote Deployer, note the change in the instance URL.

  • From the Project Deployer, click on Deployer at the top left to see how this node has separate components for deploying projects or API services.

  • Click on Projects to view current deployments, projects that have published bundles, and available infrastructures.

Dataiku screenshot of a bundle on the Project Deployer.

Create a new deployment

Thus far, we’ve published a bundle from the Design node to the Project Deployer. But, to create an active deployment, there’s a second step. We still need to push the bundle from the Project Deployer to an Automation node.

  • First click Deployments in the top navigation bar to view all deployments on the instance.

  • In the Bundles to deploy panel on the left of the Project Deployer, find the v1 bundle for this project, and, click Deploy.

    Warning

    If the Deploy button is not clickable, it means there is no infrastructure ready for deployment. Please contact your instance administrator to create one.

  • Choose a Target infrastructure. This will vary depending on the infrastructure available to your organization.

  • Leave the default Deployment ID, which takes the form of <PROJECTKEY>-on-<infrastructure>.

  • Click Create, and then Deploy and Activate.

Dataiku screenshot of the dialog for creating a new deployment.

Note

Just as when importing a project to an instance (such as from a project export), you may see warnings about missing plugins or plugin version mismatches. If any of these plugins are used in the project at hand, you’ll want to closely review them.

The same can be said for missing connections. See the article on preparing the Automation node for more details.

Manage deployment settings

Your project is now running on the Automation node! You can click to open it from the panel on the left. Before doing so though, it is helpful to understand what deployment settings can be controlled from the Project Deployer itself.

Within the Project Deployer, we can monitor the status of deployments, such as when it was created (and by whom), when it was updated (and by whom), and recent scenario runs.

Dataiku screenshot of the Status tab of a deployment on the Project Deployer.

Note

The reference documentation also covers how to modify deployment settings from the Project Deployer.

Remap connections

In the Settings tab of a deployment, you can further configure criteria around variables, connections, code environments, and scenarios.

Connection remapping, for example, is one setting that will commonly need to be configured. In many cases, organizations maintain different databases for development and production environments. If this is the case, you’ll need to remap the source connections used on the Design node to the target connections that should be used on the Automation node.

  • Within the new deployment, navigate to the Settings tab, and then the Connections panel within that.

Dataiku screenshot of the Connections page of a deployment's settings.

Note

If, for the purposes of this tutorial, you are using the same database for development and production environments, there is nothing you need to do here.

See this article to learn more about remapping connections in a Dataiku instance.

Manage scenario auto-triggers

To ensure scenarios never run unexpectedly, all scenarios in a new deployment are deactivated by default—regardless of their settings on the Design node.

  • Still within the Settings tab of the deployment, navigate to the Scenarios panel.

Here you can enable, disable, or even override the behavior defined on the Automation node—giving you the option of how you want to manage the scenarios for a deployed project.

  • Leave the default setting in place (the leftmost option that does not override the behavior defined on the Automation node).

Dataiku screenshot of the Scenarios page of a deployment.

View the Automation node project

Finally, let’s check out the project on the Automation node.

  • Ensure you are connected to the Automation node.

  • Navigate back to the Status tab of the deployment, and click to open the project on the Automation node.

Dataiku screenshot showing where to find the Automation node project.

Once on the Automation node, the project should look quite familiar. Confirm a few points:

  • The project homepage reports what bundle is running and when it was activated.

  • The scenario auto-triggers are turned off.

Rather than use some kind of trigger, let’s manually run a scenario to confirm it is working.

  • While in the Automation node version of the project, open the Build Flow scenario (a dummy scenario that just builds the test_scored dataset), and click Run to manually start it.

Instead of checking the progress on the Automation node, let’s monitor its progress from the Project Deployer.

  • Return to the Deployer, and open the same deployment. Note how the progress of the most recent scenario run is reported.

Dataiku screenshot of the Status tab of a deployment showing a scenario run.

Versioning a deployed project

You have successfully created a project bundle in the Design node and, via the Deployer, published it to the Automation node. Of course, this is not the end of the story! Data science is an iterative process. You’ll need to deploy updated versions of this project over time.

When it is necessary to make a change to a deployed project, it’s critical to make all such changes in the development environment (the Design node), and then push an updated bundle to the production environment (the Automation node).

It may be tempting just to make a quick change to the project on the Automation node, but you should avoid this temptation, as the project in the production environment would no longer be synced with its counterpart in the development environment.

Consider a situation where something went wrong, and you want to revert back to an earlier version of the project. If you’ve made changes in the Automation node, these changes will be lost. Accordingly, actual development should always happen in the Design node, and new versions of bundles should be pushed from there.

Create a second bundle

Let’s demonstrate the process for updating a deployment with a new bundle.

  • Return to the original project on the Design node.

  • We could make any change, but for example, change the trigger of the Build Flow scenario to run at 3 AM instead of 2 AM.

  • If a trigger doesn’t exist, add any time-based trigger. Save any changes.

  • From the Bundles page, click + New Bundle.

  • Name it v2.

  • In the release notes, add adjusted scenario trigger timing.

  • Click Create.

Dataiku screenshot of the second version of the project bundle.

Note

Note how when creating the second bundle, the configuration of the previous one is inherited. In this case, the saved model, uploaded datasets, and managed folder are already included.

Deploy the new bundle

The process for deploying the new bundle is the same as for the first one.

  • Click on the newly-created v2 bundle, and click Publish on Deployer.

  • Confirm that you indeed want to Publish on Deployer.

  • Click to Open in Deployer to view the bundle details on the Deployer.

  • Once on the Deployer, click Deploy on the v2 bundle.

Dataiku gives the option to create a new deployment or update the existing one.

  • Since this is a new version of an existing deployment, make sure Update is selected, and click OK.

  • Click OK again to confirm the deployment you want to edit.

Dataiku screenshot for updating a deployed bundle.

We’re not done yet!

  • Navigate to the Status tab of the deployment, and note how Dataiku warns that the active bundle on the Automation node does not match the configured bundle.

  • Click the green Update button to deploy the new bundle. Then Confirm.

  • Navigate to the Deployments tab of the Project Deployer to see the new bundle as the currently deployed version of this project.

Dataiku screenshot of the Deployer showing a second version of the deployment.

Revert to a previous bundle

It’s also important to be able to revert to an earlier version, should a newer bundle not work as expected. Let’s demonstrate that now.

  • From the Deployments tab of the Deployer, find the project in the left hand panel.

  • Click Deploy next to the v1 bundle.

  • With Update selected, click OK, and confirm this is correct.

  • Now on the Settings tab with v1 as the source bundle, click the green Update button, and Confirm the change.

If you return to the Status tab of this deployment, or open the project homepage on the Automation node, you’ll see that v1 is once again the active bundle running in production.

Note

See the reference documentation to learn more about reverting bundles.

Next steps

Congratulations on putting your first project bundles into production! Under this batch scoring framework, our project will be able to run in a dedicated production environment.

Once we have an active batch deployment, we might want to automate bundle updates. Let’s do that next!

Note

For more information, consult the reference documentation on project deployments and bundles.