Concept | Batch deployment#

Watch the video

In this article, we’ll discuss the process for pushing a batch processing project to production, meaning going from a project on a Design node to a project bundle deployed on an Automation node with the help of the Project Deployer.

Slide depicting an overview of batch deployment in Dataiku.

Ready for production#

Let’s assume we have a batch processing project ready to be deployed into production.

First, recall what a ready project is. A project ready to be deployed into production has robust scenarios with metrics, checks, and reporters; optimized pipelines; and well-documented workflows, among other qualities.

Slide depicting what constitutes a project ready for production.

Note

For review, return to the Preparing for Production course.

Development vs. production environments#

In order to batch deploy a project into production, we need to transfer it from its development environment (a Design node) to a production environment (an Automation node).

Recall that a Design node is an experimental sandbox for developing data projects. Since you are creating new data pipelines and models there, you can expect jobs to fail occasionally.

Slide depicting a Design node as a development environment.

An Automation node, on the other hand, is a production environment. It is for operational jobs serving external consumers.

Recall that the Automation node also needs to be ready for production. Among other requirements, it should have the correct connections, such as the production version of databases, and the same plugins and code environments found in the project on the Design node.

Slide depicting an Automation node as a production environment.

Note

You can review this material in an article on Dataiku Architecture in the Production Concepts course.

The Project Deployer#

How do we transfer a project from a Design node to an Automation node?

We could manually download the project bundle from the Design node and upload it to the Automation node. However, this is not the preferred way. We’ll use the Project Deployer to make this process easier and more centralized.

Some of the following details are interesting only to administrators, and so we won’t cover it in great detail.

The Project Deployer is one component of the Deployer:

  • The Project Deployer is used to batch deploy project bundles to an Automation node (which is our focus here).

  • The API Deployer is used to deploy real-time API services to an API node (as will be done in the Real-time APIs course).

Slide depicting relationships between the Deployer and Design, Automation, and API nodes.

Note

If your architecture has a single Design or Automation node, the Project Deployer can be part of this Dataiku node itself — a local Deployer. In that case, no additional setup is required. It comes pre-configured.

If your architecture includes multiple Design and/or Automation nodes, a separate node can act as the centralized Deployer for all Design and Automation nodes. This is a standalone or remote Deployer.

You’ll find more in the reference documentation on Setting up the Deployer for self-managed instances.

Project bundles#

The component that we do need to concern ourselves with is the project bundle being deployed.

You’re probably familiar with creating an export of a Dataiku project from the homepage of a Design node. This kind of export can be imported to other Design nodes for further development.

Unlike an exported project, a project bundle is a versioned snapshot of the project’s configuration. This snapshot can replay the tasks that were performed on the Design node — on an Automation node.

Slide depicting the contents of a project bundle.

Important

By project configuration, we mean things like project settings, notebooks, visual analyses, recipes, scenarios, shared project code, and the metadata from objects like datasets, saved models, and managed folders.

Once a project is ready to be deployed, creating a project bundle is simple. Just navigate to the Bundles page from the “More Options” menu, and click to create a new bundle.

Dataiku screenshot of where to create a new bundle.

Additional bundle content#

You then have the option of adding additional content to the project bundle. A project bundle does NOT include the actual data, nor the saved models deployed to the Flow. This is because when the project is running on the Automation node, you’ll have new production data running through the Flow.

Slide depicting what's included and not included in a project bundle.

Depending on your use case, however, you may want to add additional content from certain datasets, managed folders, or saved models. For example, you may need to include datasets for enrichment or reference datasets not recomputed in production. Or, if you plan to score data with a model that has been trained in the Design node, you need to add the model to the bundle.

Dataiku screenshot showing the bundle configuration page.

Deploying a bundle#

Once you’ve created the bundle, you can publish it on the Project Deployer. Once on the Deployer, you can manage all of the bundles from all of the projects in various stages of production. If your infrastructure is in place, deploying the bundle is as simple as a few more clicks.

For any particular deployment, you can manage settings, such as remapping connections between the development and production environments.

Dataiku screenshot of where to manage deployment settings from the Deployer.

To verify your deployment is working, open your Automation node to see the project now running in a production environment!

What’s next?#

Those are the basics of creating a project bundle on a Design node and transferring it to an Automation node. Follow the tutorials to gain experience doing this yourself!

Note

You can learn more about production deployments and bundles in the reference documentation.