Quick Start | Dataiku for MLOps#

Get started#

Recent advancements in generative AI have made it easy to apply for jobs. But be careful! Scammers have also been known to create fake job applications in the hopes of stealing personal information. Let’s see if you — with Dataiku’s help — can spot a real job posting from a fake one!

Objectives#

In this quick start, you’ll:

  • Create an API endpoint from a prediction model.

  • Deploy a version of the API endpoint to a Dataiku API node.

  • Automate the building of a data pipeline.

Tip

To check your work, you can review a completed version of this entire project from data preparation through MLOps on the Dataiku gallery.

Create an account#

To follow along with the steps in this tutorial, you need access to a 11.0+ Dataiku instance. If you do not already have access, you can get started in one of two ways:

  • Start a 14 day free trial. See this how-to for help if needed.

  • The locally-installed free edition is not fully compatible.

Open Dataiku#

The first step is getting to the homepage of your Dataiku Design node.

  1. Go to the Launchpad.

  2. Click Open Instance in the Design node tile of the Overview panel once your instance has powered up.

  3. See this how-to for any difficulties.

Important

If using a self-managed version of Dataiku, open the Dataiku Design node directly in your browser.

Create the project#

Once you are on the Design node homepage, you can create the tutorial project.

  1. From the Dataiku Design homepage, click + New Project.

  2. Click DSS tutorials in the dropdown menu.

  3. In the dialog, click Quick Starts on the left hand panel.

  4. Choose MLOps Quick Start, and then click OK.

Dataiku screenshot of the dialog for creating a new project.

Note

You can also download the starter project from this website and import it as a zip file.

Build the Flow#

See a screencast covering this section’s steps

One of the first concepts a user needs to understand about Dataiku is the Flow. The Flow is the visual representation of how datasets, recipes (steps for data transformation), and models work together to move data through an analytics pipeline.

In fact, Dataiku has its own visual grammar to organize AI, machine learning, and analytics projects in a collaborative way.

  • Squares represent datasets. The icon on the square signifies the storage location, such as cloud storage, a relational database, or a local filesystem.

  • Circles represent recipes. The icon on the circle signifies the type of transformation; the color signifies visual or code.

  • Diamonds represent machine learning models.

Take a look now!

  1. If not already there, from the left-most menu in the top navigation bar, click on the Flow (or use the keyboard shortcut g + f).

    Dataiku screenshot of the MLOps starting Flow.

    This project begins in the Data Preparation Flow zone from a labeled dataset named job_postings composed of 95% real and 5% fake job postings. The pipeline builds a prediction model capable of classifying a job posting as real or fake. Your job will be to deploy the model found in the Machine Learning Flow zone as a real-time API endpoint.

  2. Take a moment to review the objects in the Flow. Gain a high-level understanding of how the recipes prepare, join, and split the data, train a model, and use it score new data.

Unlike the initial uploaded datasets, the downstream datasets appear as outlines. This is because they have not been built, meaning that the relevant recipes have not been run to populate these datasets. However, this is not a problem because the Flow contains the recipes required to create these outputs at any time.

  1. Click to open the Flow Actions menu in the bottom right.

  2. Click Build all.

  3. Leaving the default options, click Build to run the recipes necessary to create the items furthest downstream.

  4. When the job completes, refresh the page to see the built Flow.

    Dataiku screenshot of the dialog for building the Flow.

    Let’s also take a closer look at the model itself.

  5. Double click to open the diamond-shaped Predict fraudulent (binary) model, and then return to the Flow when finished inspecting the model.

    • Note that it has only one version. As you retrain the model, the history of model versions is tracked, and so you can easily rollback between the active version and an older version.

    • Click on the model version name Random forest (s2) - v1 at the top left of the tile to see the full report.

Dataiku screenshot of the saved model object.

Note

To learn more about creating the model, see the Quick Start | Dataiku for machine learning.

Create an API endpoint#

See a screencast covering this section’s steps

Dataiku’s architecture for MLOps supports both batch and real-time API frameworks. In this case, let’s implement a real-time API strategy to individually score new job postings as real or fake.

  1. From the Flow, click on the saved model Predict fraudulent (binary) once to select it.

  2. Click to open the Actions tab on the right.

  3. Select Create API.

  4. Name the service ID job_postings.

  5. Name the endpoint ID predict_fake_job.

  6. Click Append.

Dataiku screenshot of dialog for creating an API endpoint.

Note

This path was a shortcut to the API Designer found in the top navigation bar’s More Options (…) menu.

Add test queries#

Before deploying, let’s add some test queries to the API endpoint to make sure that it is working correctly.

  1. For the predict_fake_job endpoint, navigate to the Test queries panel on the left.

  2. Click + Add Queries.

  3. Add 5 queries.

  4. Choose to add them from the test dataset.

  5. Click Add.

  6. Click the blue Run Test Queries at the top right.

  7. Examine some of the test queries, including the features that were sent to the endpoint, the prediction returned, and additional details.

Dataiku screenshot of test queries in the API designer.

Deploy an API endpoint#

See a screencast covering this section’s steps

Although you have created a version of an API service including the endpoint, it exists only on the Design node, which is a development environment. A production use case requires separate environments for development and production.

A batch deployment use case would require enabling an Automation node. However, since this is a real-time API deployment use case, you need an API node.

At a high-level, you can think of the entire process in three steps:

  1. Create the API service on the Design node (already done!).

  2. Publish the API service on the Design node to the API Deployer.

  3. Deploy the API service on the API Deployer to an API node.

Configure an API node#

Before deploying, you first need to configure a production environment.

  • Free trial users (or any Dataiku Cloud users) need to activate the API node extension from their Launchpad.

  • Users on self-managed instances need to follow the reference documentation for setting up the API Deployer and API node.

From the Design node to the Deployer#

Once you have the necessary infrastructure in place, it’s a few more clicks to actually deploy the endpoint.

  1. From the job_postings API service on the Design node, click Publish on Deployer.

  2. Click OK, accepting the default version ID.

Dataiku screenshot of the dialog for publishing an API service.

From the Deployer to an API node#

You now have pushed the API service from the Design node to the API Deployer, so let’s navigate there.

  1. Immediately after publishing, you can click the popup notification to Open API Deployer.

  2. If you miss it, open the Applications menu in the top right.

  3. Choose Local Deployer.

  4. Then click Deploying API Services.

Dataiku screenshot of the path to find the local deployer.

Now that the service is on the API Deployer, there is one more step to deploy the endpoint to an API node.

  1. On the API Deployer, find your API service.

  2. Click Deploy.

  3. If not using Dataiku Cloud, select an available infrastructure; otherwise, one will already be chosen for you.

  4. Click Deploy again.

  5. Click Deploy once more to confirm.

Dataiku screenshot of the dialog for deploying an API service.

You now have an API endpoint running in a production environment!

Dataiku screenshot of an active API deployment.

Send test queries to the API node#

Once again, let’s test the endpoint with a few more queries — this time sending them to an API node.

  1. From the Status tab of the predict_fake_job endpoint on the API Deployer, navigate to the Run and test subtab.

  2. Click Run All.

Dataiku screenshot of test queries run on the API node.

Note

Once you’ve deployed an API service, the next step would be to monitor it using an Evaluate recipe and a model evaluation store. You’ll learn about these tools as your progress further with Dataiku!

Automate the Flow#

See a screencast covering this section’s steps

Once you’ve mastered the basics, you can start automating your MLOps processes with Dataiku’s system of scenarios. A scenario in Dataiku is a set of actions to run, along with conditions for when they should execute.

Let’s design a scenario that rebuilds the furthest downstream dataset (the test_scored dataset) if all checks on the initial upstream data pass successfully.

Note

These automation tools can be implemented visually, with code, or a mixture of both. To get started using code in your MLOps workflows, see the Developer Guide.

Compute metrics on Flow objects#

You can begin by computing default metrics (or defining your own) on Flow objects, such as:

  • The size of a dataset

  • The AUC of a model

  • The data drift of a model evaluation

For example:

  1. Navigate back to the Design node project.

  2. Open the job_postings dataset in the Data Preparation Flow zone.

  3. Click on the Status tab.

  4. Click Compute to calculate the default metrics.

Dataiku screenshot of metrics on a dataset.

Create checks on metrics#

You can then create checks on these metrics, such as:

  • Is the size of a dataset within a certain range?

  • Is the AUC of a model within a certain range?

  • Is the data drift of a model evaluation within a certain range?

For example:

  1. From the Status tab of the job_postings dataset, navigate to the Edit subtab.

  2. Within that, go to the Checks panel.

  3. Click Check to test the existing check that the number of records in the dataset is within some expected range.

Dataiku screenshot of a check on a model.

Run a scenario#

Lastly, you can incorporate these checks into the conditions of a scenario. Examples include:

  • If all checks on an upstream dataset pass, rebuild the final downstream dataset.

  • If any check on a model’s performance fails, retrain the model.

  • If one check on a model evaluation fails, send an email alert.

This project has the beginnings of a scenario to accomplish the first example. You can finish it!

  1. From the Jobs menu in the top navigation bar, open the Scenarios page.

  2. Click to open the Score Data scenario.

  3. Navigate to the Steps tab, and read the first two actions included in the scenario.

  4. Click Add Step to view the available steps.

  5. Choose Build / Train.

  6. Click Add Dataset to Build.

  7. Select test_scored, and click Add.

  8. Click the green Run button at the top right to manually trigger the scenario.

Dataiku screenshot of the Steps tab of a scenario.

Note

Here you manually ran the scenario, but on the Settings tab, you’ll find a tile for Triggers that can automatically execute a scenario based on conditions such as time, a dataset change, or even Python code.

Inspect the scenario run#

Let’s take a closer look at this scenario run.

  1. Navigate to the Last runs tab of the scenario.

  2. Click on the run in the left hand panel to view its details.

  3. Step 3 of the scenario triggered a job. Click to open it, and see that there was “Nothing to do” for it.

Dataiku screenshot of the last runs tab of a scenario.

With no new data in the pipeline, the check on the upstream job_postings dataset passed as it did before. However, the build step on the downstream test_scored dataset was set to build required dependencies. Because this dataset was not out of date, Dataiku did not waste resources rebuilding it.

Tip

To see this job do some actual work, try the Quick Start | Dataiku for AI collaboration, where you’ll execute the same scenario via a reusable Dataiku Application!

What’s next?#

Congratulations! You’ve taken your first steps toward MLOps with Dataiku.

You’re now ready to begin the MLOps Practitioner learning path and challenge yourself to earn the MLOps Practitioner certification.

Another option is to shift your attention to AI collaboration. In the Quick Start | Dataiku for AI collaboration, you can learn about how users with different profiles and responsibilities can securely work together to build advanced projects.

Note

You can also find more resources on MLOps and operationalization in the following spaces: