Concept | Batch deployment#

Watch the video

In this article, we’ll discuss the process for pushing a batch processing project from a development to a production environment. In other words, how do we go from a project on a Design node to a project bundle deployed on an Automation node? We’ll achieve this with the help of the Project Deployer.

Slide depicting an overview of batch deployment in Dataiku.

Ready for production#

Let’s assume we have a batch processing project ready to be deployed into production. Among other qualities, a project ready to be deployed into production most often includes:

  • Robust scenarios with metrics, data quality rules and/or checks, and reporters

  • Optimized data pipelines

  • Well-documented workflows.

Slide depicting what constitutes a project ready for production.

Development vs. production environments#

In order to batch deploy a project into production, we need to transfer it from its development environment (a Design node) to a production environment (in the case of a batch processing workflow, an Automation node).

Recall that a Design node is an experimental sandbox for developing data projects. Since you are creating new data pipelines there, you can expect jobs to fail occasionally.

Slide depicting a Design node as a development environment.

An Automation node, on the other hand, is a production environment. It is for operational jobs serving external consumers.

The Automation node also needs to be ready for production. Among other requirements, it should have the correct connections, such as the production version of databases, and the same plugins and code environments found in the project on the Design node.

Slide depicting an Automation node as a production environment.

The Project Deployer#

Watch a short video on the Project Deployer

How do we transfer a project from a Design node to an Automation node?

We could manually download the project bundle from the Design node and upload it to the Automation node. However, this is not the preferred way. We’ll use the Project Deployer to make this process easier and more centralized.

Some of the following details are interesting only to administrators, and so we won’t cover it in great detail.

The Project Deployer is one of two components of the Deployer:

  • The Project Deployer is used to deploy project bundles to an Automation node for batch processing workloads (which is our focus here).

  • The API Deployer is used to deploy real-time API services to an API node (as will be done in the API Deployment course).

Slide depicting relationships between the Deployer and Design, Automation, and API nodes.

Note

If your architecture has a single Design or Automation node, the Project Deployer can be part of this Dataiku node itself — a local Deployer. In that case, no additional setup is required. It comes pre-configured.

If your architecture includes multiple Design and/or Automation nodes, a separate node can act as the centralized Deployer for all Design and Automation nodes. This is a standalone or remote Deployer.

You’ll find more in the reference documentation on Setting up the Deployer for self-managed instances.

Project bundles#

You’re probably familiar with creating an export of a Dataiku project from the homepage of a Design node. This kind of export can be imported to other Design nodes for further development.

Unlike an exported project, a project bundle is a versioned snapshot of the project’s configuration. This snapshot can replay the tasks that were performed on the Design node — on an Automation node.

Important

By project configuration, we mean items like project settings, notebooks, visual analyses, recipes, scenarios, shared project code, and the metadata from objects like datasets, saved models, and managed folders.

Once a project is ready to be deployed, creating a project bundle is simple. Just navigate to the Bundles page from the More Options menu, and click to create a new bundle.

Dataiku screenshot of where to create a new bundle.

Additional bundle content#

You then have the option of including additional content to the project bundle. A project bundle does NOT include the actual data, nor any saved models that may be deployed to the Flow. This is because when the project is running on the Automation node, you’ll have new production data running through the Flow.

Slide depicting what's included and not included in a project bundle.

Depending on your use case, however, you may want to add additional objects such as certain datasets, managed folders, saved models, etc. For example:

  • You may need to include datasets for enrichment or reference datasets not recomputed in production.

  • If you plan to score data with a model that has been trained in the Design node, you need to include the model in the bundle.

Dataiku screenshot showing the bundle configuration page.

Deploying a bundle#

Once you’ve created the bundle, you can publish it on the Project Deployer. Once on the Deployer, you can manage all of the bundles from all of the projects on the instance in various stages of production. If your infrastructure is in place, deploying the bundle is as simple as a few more clicks.

For any particular deployment, you can manage settings, such as remapping connections between the development and production environments.

Dataiku screenshot of where to manage deployment settings from the Deployer.

To verify your deployment is working, open your Automation node to see the project now running in a production environment!

What’s next?#

Those are the basics of creating a project bundle on a Design node and transferring it to an Automation node. See Tutorial | Batch deployment basics to gain experience doing this yourself!

Note

You can learn more about production deployments and bundles in the reference documentation.