Tutorial | Batch deployment basics#

Get started#

When you finish designing a project, it’s time to push it into production!

Note

This tutorial demonstrates the batch processing framework; see Tutorial | Real-time API basics to learn the real-time scoring approach.

Objectives#

In this tutorial, you will:

  • Create a project bundle on the Design node.

  • Deploy the bundle on the Automation node.

  • Manage bundle versions between Design and Automation nodes.

  • Design a scenario that creates a new bundle and updates an existing project deployment when a certain condition is met.

Prerequisites#

  • A business or enterprise license of Dataiku 12+. Discover licenses and the free edition are not compatible.

  • An Automation node connected to the Design node. Dataiku Cloud users can follow instructions for adding the Automation node extension. Administrators of self-managed Dataiku instances should follow the reference documentation.

  • Users need to be able to create a project bundle, which requires, for 12.1+ users, the Write project content permission on the project used in this tutorial. Users on instances prior to 12.1 require the project admin permission.

  • The Reverse Geocoding plugin (version 2.1 or above) installed on your Dataiku instance. This plugin is installed by default on Dataiku Cloud.

  • Broad knowledge of Dataiku (ML Practitioner + Advanced Designer level or equivalent).

  • You may also want to review this tutorial’s associated concept article.

Create the project#

We’ll start from a project that includes a basic classification model and a zone for scoring new, incoming data.

  1. From the Dataiku Design homepage, click +New Project > DSS tutorials > MLOps Practitioner > Batch Deployment.

  2. From the project homepage, click Go to Flow (or g + f).

Note

You can also download the starter project from this website and import it as a zip file.

You’ll next want to build the Flow.

  1. Click Flow Actions at the bottom right of the Flow.

  2. Click Build all.

  3. Keep the default settings and click Build.

Use case summary#

You’ll work with a simple credit card fraud use case. Using data about transactions, merchants, and cardholders, we have a Flow including a model that predicts which transactions should be authorized and which are potentially fraudulent.

  • A score of 1 for the target variable, authorized_flag, represents an authorized transaction.

  • A score of 0, on the other hand, is a transaction that failed authorization.

Putting this model into production can enable two different styles of use cases commonly found in machine learning workflows:

Scoring framework

Example

Batch

A bank employee creates a monthly fraud report.

Real-time

A bank’s internal systems authorize each transaction as it happens.

Tip

This use case is just an example to practice monitoring and deploying MLOps projects into production. Rather than thinking about the data here, consider how you’d apply the same techniques and Dataiku features to solve problems that matter to you!

Production concepts recap#

Before pushing our project into production, let’s consider the goal of deploying to a dedicated environment. In essence, we need an environment that is:

  • Repeatable and reliable

  • Safe (alteration is highly limited)

  • Connected to production sources

Note

Recall from the Production Concepts course that a development environment, as opposed to a production environment, is a sandbox for experimental analyses where failure is expected.

Create a bundle and publish it to the Deployer#

See a screencast covering this section’s steps

The first step is to create a bundle from the project found in the development environment (the Design node).

  1. From the More Options (…) menu in the top navigation bar, choose Bundles.

  2. Click + New Bundle.

Add additional content to the bundle#

A bundle acts as a consistent packaging of a complete Flow. By default, it includes only the project metadata. As a result, all datasets will come empty, and models will come untrained. However, depending on the use case, we can choose to include additional datasets, managed folders, saved models, or model evaluation stores.

Unlike in most real-life projects that would be connected to some kind of database, our initial datasets are uploaded files. Therefore, they won’t be re-computed from production sources. To access these files in the production environment, we’ll also need to include them in the bundle.

Similarly, let’s add the saved model trained on the Design node to the bundle so it can be used for scoring new production data on the Automation node.

  1. Name the bundle v1.

  2. In the Additional Content section, to the right of Datasets, click Add. Choose cardholder_info_csv and merchant_info_csv.

  3. To the right of Managed folders, click Add. Choose transaction_data and new_transaction_data.

  4. To the right of Saved models, click +Add. Choose Predict authorized_flag (binary).

  5. Click Create.

Dataiku screenshot of the bundle creation page showing a saved model and input data included.

Publish the bundle to the Deployer#

The project on the Design node now includes a bundle. Although we could download this file and manually upload it to the Automation node, the strongly preferred method is to use the Project Deployer because it centralizes the history of all deployments.

  1. From the Bundles page of the project on the Design node, select the v1 bundle.

  2. Click Publish on Deployer, and then again confirm the same.

Dataiku screenshot of the dialog for publishing a bundle to the Deployer.

Create and manage deployments#

Until now, your experience with Dataiku may have been limited to the Design node. However, it’s good to know that as an end-to-end platform, Dataiku includes other nodes (Automation, API, Govern) for production-oriented tasks.

Explore the Deployer#

Before actually deploying the bundle to the Automation node, let’s take a look at what the Deployer, and in particular for this tutorial, the Project Deployer, offers.

There are two modes for installing the Deployer:

  • One is a local Deployer that sits on top of either the Design or Automation node and requires no further setup. Dataiku Cloud users employ this option.

  • The other is a standalone or remote Deployer for infrastructures with multiple Design and/or Automation nodes.

Regardless of which setup you have, the process for using the Project Deployer is the same.

  1. If you are using a remote Deployer, make sure you are connected to this instance. You’ll need credentials from your instance administrator.

  2. For either setup, from the bundle details page on the Design node, click Open in Deployer. If you’ve closed this dialog, just click Deployer where the publishing date is recorded.

Dataiku screenshot of the bundles page showing a bundle published to the Deployer.

Tip

You can also always navigate to the Deployer by choosing Local/Remote Deployer in the Applications menu from the top navigation bar.

Although we now have pushed a bundle to the Deployer, we don’t yet have an actual deployment. Before creating a new deployment, take a moment to explore the Deployer. If you are using the remote Deployer, note the change in the instance URL.

  1. From the Project Deployer, click on Deployer at the top left to see how this node has separate components for deploying projects, deploying API services, and monitoring.

  2. Click on Deploying Projects to view current deployments, projects that have published bundles, and available infrastructures.

Dataiku screenshot of a bundle on the Project Deployer.

Create a new deployment#

Thus far, we’ve published a bundle from the Design node to the Project Deployer. To create an active deployment, there’s a second step. We still need to push the bundle from the Project Deployer to an Automation node.

  1. If not already there, click Deployments in the top navigation bar to view all deployments on the instance.

  2. In the Bundles to deploy panel on the left of the Project Deployer, find the v1 bundle for this project, and click Deploy.

    Caution

    If the Deploy button is not clickable, it means there is no infrastructure ready for deployment. Please contact your instance administrator to create one.

  3. Choose a Target infrastructure. This will vary depending on the infrastructure available to your organization.

  4. Leave the default Deployment ID, which takes the form of <PROJECTKEY>-on-<infrastructure>.

  5. Click Create, and then Deploy and Activate.

Dataiku screenshot of the dialog for creating a new deployment.

Note

Just as when importing a project zip file, you may see warnings about missing plugins or plugin version mismatches. If any of these plugins are used in the project at hand, you’ll want to closely review them.

The same can be said for missing connections. See the article on preparing the Automation node for more details.

Manage deployment settings#

Your project is now running on the Automation node! You can click to open it from the panel on the left. Before doing so though, it is helpful to understand what deployment settings can be controlled from the Project Deployer itself.

Within the Project Deployer, we can monitor the status of deployments, such as when it was created (and by whom), when it was updated (and by whom), and recent scenario runs.

Dataiku screenshot of the Status tab of a deployment on the Project Deployer.

Note

The reference documentation also covers how to modify deployment settings from the Project Deployer.

Remap connections#

In the Settings tab of a deployment, you can configure criteria around variables, connections, code environments, and scenarios.

Connection remapping, for example, is one setting that will commonly need to be configured. In many cases, organizations maintain different databases for development and production environments. If this is the case, you’ll need to remap the source connections used on the Design node to the target connections that should be used on the Automation node.

  1. Within the new deployment, navigate to the Settings tab.

  2. Navigate to the Connections panel on the left.

Dataiku screenshot of the Connections page of a deployment's settings.

Note

Assuming you are using the same database for development and production environments for this tutorial, there is nothing you need to do here.

Manage scenario auto-triggers#

To ensure scenarios never run unexpectedly, all scenarios in a new deployment are deactivated by default — regardless of their settings on the Design node.

  1. Remaining within the Settings tab of the deployment, navigate to the Scenarios panel.

    Here you can enable, disable, or even override the behavior defined on the Automation node — giving you the option of how you want to manage the scenarios for a deployed project.

  2. Leave the default setting in place (the leftmost option that does not override the behavior defined on the Automation node).

Dataiku screenshot of the Scenarios page of a deployment.

View the Automation node project#

Finally, let’s check out the project on the Automation node.

  1. Ensure you are connected to a running Automation node.

  2. Navigate back to the Status tab of the deployment.

  3. Click to open the project on the Automation node.

Dataiku screenshot showing where to find the Automation node project.

Once on the Automation node, the project should look quite familiar. Confirm a few points:

  • The project homepage reports what bundle is running and when it was activated.

  • The scenario auto-triggers are turned off.

Rather than use some kind of trigger, let’s manually run a scenario to confirm it is working.

  1. While in the Automation node version of the project, open the Build Flow scenario (a dummy scenario that just builds the test_scored dataset).

  2. Click Run to manually trigger it.

Instead of checking the progress on the Automation node, let’s monitor its progress from the Project Deployer.

  1. Return to the Deployer.

  2. Open the same deployment. Note how the progress of the most recent scenario run is reported.

Dataiku screenshot of the Status tab of a deployment showing a scenario run.

Versioning a deployed project#

You have successfully created a project bundle in the Design node and, via the Deployer, published it to the Automation node!

Of course, this is not the end of the story. Data science is an iterative process. You’ll need to deploy updated versions of this project over time.

Important

When it is necessary to make a change to a deployed project, it’s critical to make all such changes in the development environment (the Design node), and then push an updated bundle to the production environment (the Automation node).

It may be tempting just to make a quick change to the project on the Automation node, but you should avoid this temptation, as the project in the production environment would no longer be synced with its counterpart in the development environment.

Consider a situation where something went wrong, and you want to revert back to an earlier version of the project. If you’ve made changes in the Automation node, these changes will be lost. Accordingly, actual development should always happen in the Design node, and new versions of bundles should be pushed from there.

Create a second bundle#

Let’s demonstrate the process for updating a deployment with a new bundle.

  1. Return to the original project on the Design node.

  2. We could make any change, but for example, change the trigger of the Build Flow scenario to run at 3 AM instead of 2 AM.

  3. Save any changes.

  4. From the Bundles page, click + New Bundle.

  5. Name it v2.

  6. In the release notes, add adjusted scenario trigger timing.

  7. Click Create.

Dataiku screenshot of the second version of the project bundle.

Note

Note how when creating the second bundle, the configuration of the previous one is inherited. In this case, the saved model, uploaded datasets, and managed folder are already included.

Deploy the new bundle#

The process for deploying the new bundle is the same as for the first one.

  1. Click on the newly-created v2 bundle, and click Publish on Deployer.

  2. Confirm that you indeed want to Publish on Deployer.

  3. Click to Open in Deployer to view the bundle details on the Deployer.

  4. Once on the Deployer, click Deploy on the v2 bundle.

    Dataiku gives the option to create a new deployment or update the existing one.

  5. Since this is a new version of an existing deployment, make sure Update is selected, and click OK.

  6. Click OK again to confirm the deployment you want to edit.

Dataiku screenshot for updating a deployed bundle.

We’re not done yet!

  1. Navigate to the Status tab of the deployment, and note how Dataiku warns that the active bundle on the Automation node does not match the configured bundle.

  2. Click the green Update button to deploy the new bundle. Then Confirm.

  3. Navigate to the Deployments tab of the Project Deployer to see the new bundle as the currently deployed version of this project.

Dataiku screenshot of the Deployer showing a second version of the deployment.

Revert to a previous bundle#

It’s also important to be able to revert to an earlier version, should a newer bundle not work as expected. Let’s demonstrate that now.

  1. From the Deployments tab of the Deployer, find the project in the left hand panel.

  2. Click Deploy next to the v1 bundle.

  3. With Update selected, click OK, and confirm this is correct.

  4. Now on the Settings tab with v1 as the source bundle, click the green Update button, and Confirm the change.

If you return to the Status tab of this deployment, or open the project homepage on the Automation node, you’ll see that v1 is once again the active bundle running in production.

See also

See the reference documentation to learn more about reverting bundles.

Optional: Update a project deployment automatically#

Congratulations on putting your first project bundles into production! Under this batch scoring framework, our project will be able to run in a dedicated production environment.

As your MLOps setup becomes more sophisticated, you can rely on automation to do more. You can run scenarios that not only monitor model performance or data drift, but also retrain models based on this information.

It’s also possible to go one step further once a deployment is created. You can automatically create new bundles and update project deployments when certain conditions are met. Let’s try that next!

Start with a retrain model scenario#

Let’s start by duplicating a scenario that retrains the model if the data drift metric fails. In other words, this scenario retrains the model when our chosen metric in the model evaluation store exceeds the specified threshold.

  1. Navigate to the Scenarios page from the top navigation bar.

  2. Check the box to the left of the Retrain Model scenario to open the Actions tab.

  3. Click Duplicate.

  4. Name it Retrain Model & Deploy.

  5. Click Duplicate.

Dataiku screenshot for duplicating a scenario.

Note

To learn more about the Retrain Model scenario, see Tutorial | Model monitoring basics.

Add a create bundle step#

In the current scenario, the step that retrains the model runs only if a previous step (in our case the MES check) fails. However, the box is ticked to reset the failure state, and so this scenario can continue with other steps.

Let’s proceed with creating the bundle in cases where a new model is retrained.

  1. In the Retrain Model & Deploy scenario, navigate to the Steps tab.

  2. Click Add Step.

  3. Select Create bundle from the Deployer section.

  4. Name the step Create auto_deploy bundle.

  5. Provide the bundle id auto_deploy.

  6. Check the box to Make bundle id unique. Instead of v1, v2, etc, as we previously chose manually, our bundle ids will be “auto_deploy”, “auto_deploy1”, etc.

  7. Provide the target variable bundleid.

  8. Check the box to Publish on Deployer.

  9. Choose the present project from the existing deployments as the Target project (selected by default).

Dataiku screenshot of the create bundle step.

Note

The help note at the top of this step indicates that the new bundle will include any additional data defined in the Bundles page. If you navigate to the Bundles page, click Configure Content to see what data will be included in the automatically-created bundles.

Add an update project deployment step#

As we have seen in the process for batch deployment, once we have a bundle, we need to deploy it. There’s a scenario step for this too!

  1. In the Retrain Model & Deploy scenario, click Add Step.

  2. Select Update project deployment from the Deployer section.

  3. Name the step Update auto_deploy.

  4. Provide the Deployment id, which takes the form of <PROJECTKEY>-on-<infrastructure>. Click on the field or start typing to see available options.

  5. Provide the new bundle id as ${bundleid}. Be sure to use the variable syntax here since this references the target variable in the previous step.

  6. Click Save.

Dataiku screenshot of the update deployment step.

Run the scenario & observe the outcome#

Let’s imagine that some specified unit of time has passed, triggering the scenario to run.

  1. Click Run to manually trigger the Retrain Model & Deploy scenario on the Design node project.

  2. Switch to the Last Runs tab to observe its progress, including the two new steps.

    Dataiku screenshot of the last run tab having automatically deployed a new bundle.

    With no new data in this situation, we already know the check on data drift in the model evaluation store will fail, and so we can anticipate the outcome.

  3. Return to the Deployments page of the Project Deployer to confirm that auto_deploy is the new active bundle.

Dataiku screenshot of the Project Deployer showing the new bundle deployed.

You can also confirm that the project on both the Design and Automation nodes has a new active version of the saved model found in the Flow.

Tip

Run the scenario again to see how the bundle ID increments to auto_deploy1, and so on.

Plan for a more robust setup#

To be sure, this scenario is not ready for a live MLOps setup. It’s intended only to demonstrate how you can use Dataiku to achieve your MLOps goals.

In fact, this level of automation may only become necessary when deploying very large numbers of models in many projects. To do this successfully though, you need to have mastered the fundamentals — i.e. robust metrics and checks to know with certainty that the model you are redeploying is truly better than the existing one.

That being said, let’s discuss a few ways you could make this setup more robust to handle the challenges of live production.

Add more metrics & checks#

This scenario triggered the model rebuild based on the failure of one check based on a model evaluation store metric.

Depending on our Flow, it’s likely that we also want to create metrics and checks on other on upstream objects, such as datasets or managed folders. If upstream checks fail, we can circumvent the model retraining cycle.

We might also want to implement metrics and checks on the saved model itself to determine whether it is better than a previous version.

Keep a human in the loop#

Even after adding a sufficient level of metrics and checks, we might never want to automatically deploy a bundle. Our scenario might stop at creating the new bundle, alerting a team member with a reporter, but leaving the job of updating the deployment to a human.

Add more stages of deployment infrastructure#

In this example, we had only one lifecycle stage of deployment infrastructure. However, in a real setup, it would be common to have multiple stages, such as the default “Dev”, “Test”, and “Prod”.

Our scenario might automatically update a deployment in the “Dev” stage, but require a human to push the deployment to the “Test” or “Prod” stages.

What’s next?#

Congratulations! In this tutorial, you:

  • Created a project bundle.

  • Deployed it to an Automation node via the Deployer.

  • Published and redeployed new bundle versions.

  • Created a scenario that can automatically update a batch deployment.

While this level of automation may not always be desirable (or advisable), it hints at what’s possible using only very simple building blocks.

Now that you have seen the batch deployment framework, move on to the methods for real-time API scoring.

See also

For more information, please refer to the reference documentation on MLOps or production deployments.