Tutorial | Surface external models within Dataiku#

Background#

Watch the featurette

Often, an organization employs a variety of tools and architectures in its technological schema. For instance, organizations commonly work with different tools for model training and deployment.

Fortunately, an organization that trains and deploys models outside of Dataiku can benefit from Dataiku’s flexible MLOps offering by “hopping on or off” at a number of different integration points.

This tutorial will review an integration that lets you take a model deployed on a cloud ML platform and surface it within Dataiku as an external model.

Slide of Dataiku's MLOps approach.

Objectives#

In this tutorial, you will:

  • Surface within Dataiku a model externally deployed on a cloud ML platform.

  • Use that external model for explainability reports, comparisons with other types of models in Dataiku, and scoring data.

Prerequisites#

Completing the setup for this tutorial requires the assistance of your instance administrator and knowledge of the security and configuration of your cloud ML platform.

Before beginning this tutorial, you’ll need to:

  • Have access to an instance of Dataiku (version 12.2 or above).

  • Have an active deployed endpoint in either Amazon SageMaker, Azure Machine Learning, or Google Vertex AI. We’ll point to external resources for creating a sample endpoint in each case. Unrelated to Dataiku, you may need certain knowledge or rights to complete these tutorials.

  • Have an instance administrator create the external models code environment in the Administration > Settings > Misc section, as noted in the reference documentation.

  • Grant permissions to Dataiku to access your external endpoint. This may not be granted to your current instance as it relates to specific scope rights. We will specify the minimum required permissions to the extent possible.

  • Have an instance administrator create a dedicated connection to your cloud ML platform.

Create the project#

To get started, you can import a Dataiku project including many of the starter objects for your chosen cloud ML platform.

  1. From the Dataiku Design homepage, click +New Project > DSS tutorials > MLOps Practitioner > External Models.

Note

You can also download the starter project from this website and import it as a zip file.

Deploy a working external endpoint#

Before you can surface an external model in Dataiku, you need an endpoint running in your cloud ML platform. In each case, we have created an example endpoint following the cloud ML platform’s documentation.

  1. Create an Amazon SageMaker endpoint by following their tutorial on customer churn prediction with XGBoost.

When finished, you should be able to get predictions, such as:

dummy = "5,0,12.524227291335833,2,5.639471129269478,5,4.937219265876419,150,5.881787271425925,5,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0"
dummy_ar = np.fromstring(dummy, sep=',')
predictions = predict(dummy_ar)
print(predictions)

It should print something like:

['0.9981379508972168']

Create a connection in Dataiku to access the external endpoint#

After confirming that you have an endpoint running in your cloud ML platform, the next step is to create a connection in Dataiku with sufficient credentials to access the endpoint.

Important

Creating the connection requires administrative access. However, once the connection is created, other users only require the security setting Details readable by to access it.

  1. From the Applications menu of the Design node, click Administration.

  2. Navigate to the Connections tab.

  3. Click + New Connection.

  4. Choose your cloud ML platform from the Managed Model Deployment Infrastructures section.

  5. Provide any name, such as sagemaker-ext.

  6. Enter a working way to get a role with the right to query the SageMaker endpoint on the correct account and region. (This screenshot is using STS with AssumeRole logic).

Dataiku screenshot of an Amazon SageMaker connection.

Dataiku external models require the following rights:

  • sagemaker:DescribeEndpoint

  • sagemaker:DescribeEndpointConfig

  • sagemaker:InvokeEndpoint

  • sagemaker:ListEndpointConfigs

  • sagemaker:ListEndpoints

Note

See the Amazon SageMaker documentation on API permissions for details.

Add an evaluation dataset in Dataiku#

Once you have a connection in Dataiku to the external endpoint, you can create an external saved model.

However, before creating the actual model, a good practice is to have a dataset for Dataiku to evaluate the model performance. For each case here, we will use the model’s training data found in the starter project.

If you prefer not to evaluate the model, you can make that selection when creating a version of the external saved model. However, you’ll still need to provide a dataset so that Dataiku can save the feature names and types.

  1. In the Amazon SageMaker Flow zone, find the sagemaker_churn_prepared dataset.

Note

We initially uploaded the dataset from the churn tutorial as sagemaker_churn. However, as SageMaker models require a very specific format, we needed to transform this dataset into sagemaker_churn_prepared with a Python recipe using the exact same code as in the tutorial notebook.

Create an external saved model in Dataiku#

Now that you have an evaluation dataset, let’s proceed with the creation of the external saved model itself.

  1. From the Visual Analyses menu in the top navigation bar, click on Saved Models.

  2. Click + New External Saved Model.

  3. Choose your cloud ML platform.

Dataiku screenshot of the dialog for creating a new external saved model.

Important

Although you’ll see MLflow as an option here, you should be aware of a key difference. This MLflow option refers to a model designed outside of Dataiku, packaged using the MLflow format, and imported into Dataiku. On the other hand, for the external models discussed here, the models remain deployed where they are, but are made available to Dataiku users — as you will see.

Fill in the basic model information:

  1. Give a meaningful name to your model like sagemaker-churn.

  2. The prediction type is a Two-class classification.

  3. For the Authentication field, select the connection used in the previous step; sagemaker-ext in our case.

  4. Select the Region where your endpoint is deployed; eu-west-1 in our case.

Dataiku screenshot of the dialog for creating a new Amazon SageMaker external model.

Add a model version#

We now have an external model, but we need to create a model version that will contain all the details to query the endpoint.

  1. Open the saved model, and click on + Add Model Version.

  2. In the first section, click on Get Endpoints List to fetch a list of accessible endpoints.

  3. Select your endpoint from the list. Alternatively, you can enter its ID directly.

  4. Enter a simple version ID, such as v1.

Then complete the rest of the window with details specific to the case at hand:

  1. Enter the classes: 0 and 1 in our case.

  2. Leaving the default option to evaluate the model selected, enter the evaluation dataset, sagemaker_churn_prepared to have performance metrics on the model directly in Dataiku.

  3. Select Churn?_True. as the target column.

  4. As the computation of explainability data is not very long in this sample, uncheck the Skip expensive reports box.

  5. In this case, you can leave the default option of allowing Dataiku to guess the input/output format using the evaluation dataset.

  6. Click Create to let Dataiku create and evaluate this model. The creation is fairly fast, but the evaluation can take some time.

Dataiku screenshot of the dialog for creating a new Amazon SageMaker model version.

Note

Unchecking the default option to Skip expensive reports allows Dataiku to perform the requests required to compute the explainability sections. Although this adds value to the model, it triggers thousands of requests.

This is acceptable for our purpose, but for production endpoints, the additional stress on the endpoint may be a factor to consider. If need be, these reports can also be computed afterward individually.

Manage external endpoint changes#

Before demonstrating how you can use an external model that has been surfaced in Dataiku, it is important to recognize that Dataiku does not control the endpoint behind this object. Accordingly, these endpoints can change without notice. This could be a technical change, such as the addition of memory to the underlying infrastructure, or a change in the exposed model, such as a new deployed version.

  1. From the Saved Models page, open your saved model.

  2. Open the report for your model version (such as v1).

  3. In the Summary panel, find the following details about the external endpoint and the Check Endpoint button:

Dataiku screenshot of the model summary for an Amazon SageMaker endpoint.

The Check Endpoint button is used to control whether the data we fetched and stored at the creation time is still accurate. If there are differences, Dataiku will offer you the ability to create a new saved model version as the potential data on performances and explainability is incorrect.

Dataiku screenshot of the result of the check endpoint dialog.

Use an external model in Dataiku#

Once you have a version of an external saved model in Dataiku, you can use it much like other visual models in Dataiku. In particular, we’ll highlight the following three benefits:

Model explainability#

To avoid black box models, Dataiku offers many built-in visualizations to understand the nature of a model. The same interface for investigating a model built in Dataiku can be used for investigating models deployed on a cloud ML platform.

  1. From the report of your model version, browse the Explainability and Performance sections on the left that were computed during the evaluation process.

  2. If you left the Skip expensive report option checked when creating the model version, feel free to request their computation manually.

Dataiku screenshot of the explainability panel for an Amazon SageMaker endpoint.

Model comparison#

In order to determine a champion model from any number of challengers, you may be familiar with creating a model comparison in Dataiku. This feature enables you to compare metrics side by side for any combination of models built in Dataiku, MLflow models imported into Dataiku, and models deployed externally.

The starter project already includes a model comparison containing a simple AutoML model built in Dataiku. You just need to add the external model to the existing model comparison object.

  1. From the Visual Analyses menu in the top navigation bar, click on Model Comparisons.

  2. Click to open the SageMaker vs AutoML comparison.

  3. Click Configure.

  4. Click + Add Items.

  5. Click Select Saved Model Versions from the Flow.

  6. Select the model version of your external endpoint.

  7. Click Add.

  8. Click Apply.

Dataiku screenshot of the model comparison including an Amazon SageMaker endpoint.

Model scoring#

If the external model is indeed your champion, you can use it in a Score recipe in a Dataiku Flow as you would for a visual model — either manually or through a scenario for ad hoc scoring or a real high-volume scoring batch.

  1. From the Flow zone dedicated to your cloud ML platform, select the external model and its evaluation dataset.

  2. In the Actions tab, select the Score recipe.

  3. Click Create Recipe.

  4. In the Settings tab of the Score recipe, notice the engine selection of External Model at the bottom left. Recall that although you have surfaced the model in Dataiku, it is only deployed externally.

  5. Click Run to score the data using the external endpoint’s engine.

    Dataiku screenshot of the engine selection for a Score recipe with an external model.
Dataiku screenshot of a scoring Flow including an Amazon SageMaker endpoint.

Additional considerations#

Endpoint scalability amidst increased democratization#

A key advantage of surfacing an external endpoint within Dataiku is the expanded audience of users who will be able to interact with the model. However, as with any democratization benefit, you’ll need to consider the corresponding impact on performance and volume on the endpoint.

The operations shown here generate queries to the external endpoint. This is especially true with explainability requests, which can easily reach into the thousands. To serve this new usage, you’ll need to plan for your endpoint to scale accordingly.

End user discovery and import of external models#

Although using external models in Dataiku is straightforward, actually surfacing them requires some knowledge of the underlying model and, potentially, administration rights to Dataiku and the ML cloud platform.

To simplify this process for end users, administrators might consider creating a dedicated Dataiku project that serves as a home for all external models and their corresponding sample datasets.

By making these models shareable and granting permissions to this project, a wider audience of Dataiku users can search for and add the desired external model into their own projects.

Deployment to a production environment#

This tutorial only demonstrated the usage of an external model in a Design sandbox. However, external models can also be surfaced in a scheduled, operationalized project. Deploying such a project to an Automation node is perfectly workable, provided the required connection also exists on the Automation node.

What’s next?#

Congratulations! You have taken an endpoint deployed on your cloud ML platform, and surfaced it as an external model within Dataiku so that you, as well as a wider audience, can use it for key functions like model explainability, comparisons, and scoring.

Note

You can learn more about external models in the reference documentation.

You can also learn more about other MLOps integrations points, such as: