Quick Start | Dataiku for MLOps#
Get started#
Recent advancements in generative AI have made it easy to apply for jobs. But be careful! Scammers have also been known to create fake job applications in the hopes of stealing personal information. Let’s see if you — with Dataiku’s help — can spot a real job posting from a fake one!
Objectives#
In this quick start, you’ll:
Create an API endpoint from a prediction model.
Deploy a version of the API endpoint to a production environment.
Automate the building of a data pipeline.
Tip
To check your work, you can review a completed version of this entire project from data preparation through MLOps on the Dataiku gallery.
Create an account#
To follow along with the steps in this tutorial, you need access to a 12.6+ Dataiku instance. If you do not already have access, you can get started in one of two ways:
Start a 14 day free trial. See How-to | Begin a free trial from Dataiku for help if needed.
The locally-installed free edition is not fully compatible.
Open Dataiku#
The first step is getting to the homepage of your Dataiku Design node.
Go to the Launchpad.
Click Open Instance in the Design node tile of the Overview panel once your instance has powered up.
Important
If using a self-managed version of Dataiku, open the Dataiku Design node directly in your browser.
Once you are on the Design node homepage, you can create the tutorial project.
Create the project#
From the Dataiku Design homepage, click + New Project.
Select Learning projects.
Search for and select MLOps Quick Start.
Click Install.
From the project homepage, click Go to Flow (or
g
+f
).
From the Dataiku Design homepage, click + New Project.
Select DSS tutorials.
Filter by Quick Starts.
Select MLOps Quick Start.
From the project homepage, click Go to Flow (or
g
+f
).
Note
You can also download the starter project from this website and import it as a zip file.
Review the Flow#
See a screencast covering this section’s steps
One of the first concepts a user needs to understand about Dataiku is the Flow. The Flow is the visual representation of how datasets, recipes (steps for data transformation), and models work together to move data through an analytics pipeline.
See the Flow’s visual grammar#
Dataiku has its own visual grammar to organize AI, machine learning, and analytics projects in a collaborative way.
Shape |
Item |
Icon |
---|---|---|
Dataset |
The icon on the square represents the dataset’s storage location, such as Amazon S3, Snowflake, PostgreSQL, etc. |
|
Recipe |
The icon on the circle represents the type of data transformation, such as a broom for a Prepare recipe or coiled snakes for a Python recipe. |
|
Model |
The icon on the diamond represents the type of modeling task, such as prediction, clustering, time series forecasting, etc. |
Tip
In addition to shape, color has meaning too.
Datasets are blue. Those shared from other projects are black.
Visual recipes are yellow. Code recipes are orange. LLM recipes are pink. Plugin recipes are red.
Machine learning elements are green.
Take a look now!
If not already there, from the left-most () menu in the top navigation bar, select the Flow (or use the keyboard shortcut
g
+f
).Important
This project begins in the Data Preparation Flow zone from a labeled dataset named job_postings composed of 95% real and 5% fake job postings. The pipeline builds a prediction model capable of classifying a job posting as real or fake. Your job will be to deploy the model found in the Machine Learning Flow zone as a real-time API endpoint.
Take a moment to review the objects in the Flow. Gain a high-level understanding of how the recipes first prepare, join, and split the data, then train a model, and finally use it score new data.
Tip
There are many other keyboard shortcuts beyond g
+ f
. Type ?
to pull up a menu or see the Accessibility page in the reference documentation.
Build the Flow#
Unlike the initial uploaded datasets, the downstream datasets appear as outlines. This is because they have not been built, meaning that the relevant recipes have not been run to populate these datasets. However, this is not a problem because the Flow contains the recipes required to create these outputs at any time.
Click to open the Flow Actions menu in the bottom right.
Click Build all.
Leaving the default options, click Build to run the recipes necessary to create the items furthest downstream.
When the job completes, refresh the page to see the built Flow.
Inspect the saved model#
Let’s take a closer look at the model found in the Flow.
From the Flow, double click to open the diamond-shaped Predict fraudulent (binary) model in the Machine Learning Flow zone.
Note that the model has only one version, and so this version is also the active version. As you retrain the model and deploy new versions, the history of model versions is tracked — making it easy to roll back between versions.
Click on the model version name Random forest (s2) - v1 to see the full report, including visualizations of its explainability and performance.
Return to the Flow (
g
+f
) when finished inspecting the model.
Tip
In this case, the saved model in the Flow was built with Dataiku’s visual AutoML. However, it’s also possible to import models packaged with MLflow as saved models into Dataiku. See this blog on importing MLFlow saved models to learn more.
See also
To learn more about creating the model, see the Machine Learning Quick Start.
Create an API endpoint#
See a screencast covering this section’s steps
Dataiku’s architecture for MLOps supports both batch and real-time API frameworks. In this case, let’s implement a real-time API strategy to individually score a new job posting as real or fake.
Create an API service including a prediction endpoint#
The first step is packaging the saved model in the Flow as a prediction endpoint within an API service.
From the Flow, click on the saved model Predict fraudulent (binary) once to select it.
Click to open the Actions () tab in the right sidebar.
Select Create API.
Name the service ID
job_postings
.Name the endpoint ID
predict_fake_job
.Click Append.
Note
This path was a shortcut to the API Designer found in the top navigation bar’s More Options () menu.
Add test queries#
Before deploying, let’s add some test queries to the API endpoint to make sure that it is working correctly.
For the predict_fake_job endpoint, navigate to the Test queries panel.
Click + Add Queries.
Add
5
queries.Choose to add them from the test dataset.
Click Add.
Click Run Test Queries.
Examine some of the test queries, including the features that were sent to the endpoint, the prediction returned, and additional details.
Deploy an API endpoint#
See a screencast covering this section’s steps
Although you have created a version of an API service including the endpoint, it exists only on the Design node, which is a development environment. A production use case requires separate environments for development and production. For example:
A batch deployment use case would require enabling an Automation node.
A real-time API deployment use case (as shown here) could use an API node if staying within Dataiku’s ecosystem. Additionally, depending on your MLOps strategy, you also have external deployment options such as Amazon SageMaker, Azure ML, Google Vertex AI, or Snowflake.
At a high-level, you can think of the entire process in three steps:
Create the API service on the Design node (already done!).
Publish the API service on the Design node to the API Deployer.
Deploy the API service on the API Deployer to a production environment (normally an API node).
Note
Many organizations incorporate an additional governance framework throughout this process. They utilize a Govern node to manage the deployment of projects and models with a built-in sign-off process. Learn more in the Academy course on Dataiku Govern.
Configure an API node#
Before deploying, you first need to configure a production environment. In this example, we’ll use an API node.
Free trial users (or any Dataiku Cloud users) need to activate the API node extension from their Launchpad.
Users on self-managed instances need to follow API Node & API Deployer: Real-time APIs in the reference documentation.
From the Design node to the Deployer#
Once you have the necessary infrastructure in place, it’s a few more clicks to actually deploy the endpoint.
From the job_postings API service on the Design node, click Publish on Deployer.
Click Publish, accepting the default version ID.
From the Deployer to an API node#
You now have pushed the API service from the Design node to the API Deployer, so let’s navigate there.
Immediately after publishing, you can click the popup notification to Open API Deployer.
If you miss it, open the waffle () menu in the top right.
Choose Local/Remote Deployer.
Then click Deploying API Services.
Now that you have published the API service to the API Deployer, there is one more step to deploy the service to an API node.
On the API Deployer, find your API service.
Click Deploy.
If not already chosen for you, select an available infrastructure.
Click Deploy again.
Click Deploy once more to confirm.
You now have an API endpoint running in a production environment!
Send test queries to the API node#
Once again, let’s test the endpoint with a few more queries — this time sending them to an API node.
From the Status tab of the predict_fake_job endpoint on the API Deployer, navigate to the Run and test subtab.
Click Run All.
See also
Once you’ve deployed an API service, the next step would be to monitor it using an Evaluate recipe and a model evaluation store. You’ll learn about these tools in the MLOps Practitioner learning path!
Automate the Flow#
Once you’ve mastered the basics, you can start automating your MLOps processes with Dataiku’s system of scenarios. A scenario in Dataiku is a set of actions to run, along with conditions for when they should execute and who should be notified of the results.
Since the same tools can be used to retrain models or redeploy new versions of API services, let’s start small by designing a scenario that rebuilds the furthest downstream dataset only if an upstream dataset satisfies certain conditions.
See also
These automation tools can be implemented visually, with code, or a mixture of both. To get started using code in your MLOps workflows, see the Developer Guide.
View the existing scenario#
This project already has a basic one step scenario for rebuilding the data pipeline.
Navigate back to the Design node project.
From the Jobs () menu in the top navigation bar, open the Scenarios page.
Click to open the Score Data scenario.
On the Settings tab, note that the scenario already has a weekly trigger, but does not yet have a reporter.
Navigate to the Steps tab.
Click on the Build step to see that this scenario will build the test_scored dataset (and its upstream dependencies, if required) whenever the scenario is triggered.
Recognize that this step will only run if no previous step in the scenario has failed.
Tip
You’ll learn about build modes in the Data Pipelines course of the Advanced Designer learning path.
Select a data quality rule type#
As of now, on a weekly basis, this scenario will attempt to build the test_scored dataset if its upstream dependencies have changed.
In addition to having many options for when a scenario should execute (e.g. time periods, dataset changes, or code), Dataiku also provides tools for control of how a scenario should execute. For example, you may want to interrupt (or proceed with) a scenario’s execution if a condition is met (or not met).
Let’s demonstrate this principle by adding a data quality rule to an upstream dataset of interest.
In the Data Preparation Flow zone, open the job_postings_prepared dataset.
Navigate to the Data Quality tab.
Click Edit Rules.
Select the rule type Record count in range.
Configure a data quality rule#
Now let’s configure the details of this rule assuming you have expectations on the number of records at the start of the pipeline.
Set the min as
100
and the soft min as300
.Set the soft max as
20000
and the max as25000
. Make sure all are turned ON.Click Run Test, and confirm that the record count is indeed within the expected range.
Tip
Feel free to adjust these values to simulate warnings or errors on your own!
Verify a data quality rule in a scenario#
If this rule were to fail (the number of upstream records is greater than or less than our expectations), you could avoid computing the rest of the pipeline, as well as send a notification about the unexpected result.
Let’s have the scenario verify this rule is met before building the pipeline.
From the Jobs () menu in the top navigation bar, return to the Scenarios page, and click to open the Score Data scenario.
Navigate to the Steps tab.
Click Add Step to view the available steps, and choose Verify rules or run checks.
Click + Add Item > Dataset > job_postings_prepared > Add Item.
Using the dots on the left side of the step, drag the verification step above the build step.
Click the green Run button to manually trigger the scenario’s execution.
Inspect the scenario run#
Let’s take a closer look at what should be a successful scenario run.
Navigate to the Last runs tab of the scenario.
Click on the most recent run to view its details.
The scenario’s build step triggered a job. Click on the job for the build step, and see that there was Nothing to do for it.
All that for nothing? What happened?
The data in the Flow has not changed. Not surprisingly then, the scenario was first able to successfully verify the Record count in range rule. This is the same result as when you directly tested the rule on the dataset. With this verification step done, the scenario could proceed to the build step.
The build step on the downstream test_scored dataset was set to build required dependencies. As this dataset was not out of date, Dataiku did not waste resources rebuilding it.
Tip
To see this job do some actual work, try the AI Collaboration Quick Start, where you’ll execute the same scenario via a reusable Dataiku Application!
What’s next?#
Congratulations! You’ve taken your first steps toward MLOps with Dataiku.
If you’ve already explored the Core Designer, ML Practitioner, and Advanced Designer learning paths, you’ll want to begin the MLOps Practitioner learning path and challenge yourself to earn the MLOps Practitioner certification.
Another option is to shift your attention to AI collaboration. In the Quick Start | Dataiku for AI collaboration, you can learn about how users with different profiles and responsibilities can securely work together to build advanced projects.
See also
You can also find more resources on MLOps and operationalization in the following spaces: