Concept: Batch Deployment¶
In this article, we’ll discuss the process for pushing a batch-processing project to production, meaning going from a project on a Design node to a project bundle deployed on an Automation node—with the help of the Project Deployer.
This content is also included in a free Dataiku Academy course on Projects in Production, which is part of the MLOps Practitioner learning path. Register for the course there if you’d like to track and validate your progress alongside concept videos, summaries, hands-on tutorials, and quizzes.
Ready for Production¶
Let’s assume we have a batch-processing project ready to be deployed into production.
First, recall what a “ready” project is. A project “ready” to be deployed into production has robust scenarios with metrics, checks, and reporters; optimized pipelines; and well-documented workflows, among other qualities.
For review, return to the Preparing for Production course.
Development vs. Production Environments¶
In order to deploy a project into production, we need to transfer it from its development environment (a Design node) to a production environment (an Automation node).
Recall that a Design node is an experimental sandbox for developing data projects. Since you are creating new data pipelines and models there, you can expect jobs to fail occasionally.
An Automation node, on the other hand, is a production environment. It is for operational jobs serving external consumers.
Recall that the Automation node also needs to be “ready” for production. Among other requirements, it should have the correct connections, such as the production version of databases, and the same plugins and code environments found in the project on the Design node.
The Project Deployer¶
How do we transfer a project from a Design node to an Automation node?
We could manually download the project bundle from the Design node and upload it to the Automation node. However, this is not the preferred way. We’ll use the Project Deployer to make this process even easier.
Some of the following details are interesting only to administrators, and so we won’t cover it in great detail.
The Project Deployer is one of two components of the Deployer:
The Project Deployer is used to batch deploy project bundles to an Automation node (which is our focus here).
The API Deployer is used to deploy real-time API services to an API node (as will be done in the Real-time APIs course).
If your architecture has a single Design or Automation node, the Project Deployer can be part of this DSS node itself—a local Deployer. In that case, no additional setup is required. It comes pre-configured.
If your infrastructure includes multiple Design and/or Automation nodes, a separate node can act as the centralized Deployer for all Design and Automation nodes. This is a standalone or remote Deployer.
Your instance administrator will also need to follow the product documentation to set up the infrastructure that enables these nodes to “talk” to each other. As someone just focused on deploying, we’ll consider this task already completed.
Setting up the Deployer is addressed in the product documentation.
The component that we do need to concern ourselves with is the project bundle being deployed.
You’re probably familiar with creating an export of a Dataiku project from the homepage of a Design node. This kind of export can be imported to other Design nodes for further development.
Unlike an exported project, a project bundle is a versioned snapshot of the project’s configuration. This snapshot can replay the tasks that were performed on the Design node, on an Automation node.
By project configuration, we mean things like project settings, notebooks, visual analyses, recipes, scenarios, shared project code, and the metadata from objects like datasets, saved models, and managed folders.
Once a project is ready to be deployed, creating a project bundle is simple. Just navigate to the Bundles page from the “More Options” menu, and click to create a new bundle.
Additional Bundle Content¶
You then have the option of adding additional content to the project bundle. A project bundle does NOT include the actual data, nor the saved models deployed to the Flow. This is because when the project is running on the Automation node, you’ll have new production data running through the Flow.
Depending on your use case, however, you may want to add additional content from certain datasets, managed folders, or saved models. For example, you may need to include datasets for enrichment or reference datasets not recomputed in production. Or, if you plan to score data with a model that has been trained in the Design node, you need to add the model to the bundle.
Deploying a Bundle¶
Once you’ve created the bundle, you can publish it on the Project Deployer. Once on the Deployer, you can manage all of the bundles from all of the projects in various stages of production. If your infrastructure is in place, deploying the bundle is as simple as a few more clicks.
For any particular deployment, you can manage settings, such as remapping connections between the development and production environments.
To verify your deployment is working, open your Automation node to see the project now running in a production environment!