Tutorial | Surface external models within Dataiku#
Get started#
Watch the featurette
Often, an organization employs a variety of tools and architectures in its technological schema. For instance, organizations commonly work with different tools for model training and deployment.
Fortunately, an organization that trains and deploys models outside of Dataiku can benefit from Dataiku’s flexible MLOps offering by “hopping on or off” at a number of different integration points.
This tutorial will review an integration that lets you take a model deployed on a cloud ML platform and surface it within Dataiku as an external model.
Objectives#
In this tutorial, you will:
Surface within Dataiku a model externally deployed on a cloud ML platform.
Use that external model for explainability reports, comparisons with other types of models in Dataiku, and scoring data.
Prerequisites#
Completing the setup for this tutorial requires the assistance of your instance administrator and knowledge of the security and configuration of your cloud ML platform.
Before beginning this tutorial, you’ll need:
Dataiku 12.2 or later.
A Full Designer user profile on the Dataiku for AI/ML or Enterprise AI packages.
An active deployed endpoint in either Amazon SageMaker, Azure Machine Learning, or Google Vertex AI. We’ll point to external resources for creating a sample endpoint in each case. Unrelated to Dataiku, you may need certain knowledge or rights to complete these tutorials.
An instance administrator to have created the external models code environment in the Administration > Code envs > Internal envs setup section, as noted in the reference documentation.
Permissions granted to Dataiku to access your external endpoint. This may not be granted to your current instance as it relates to specific scope rights. We will specify the minimum required permissions to the extent possible.
An instance administrator to have created a dedicated connection to your cloud ML platform.
Create the project#
From the Dataiku Design homepage, click + New Project.
Select Learning projects.
Search for and select External Models.
Click Install.
From the project homepage, click Go to Flow (or
g
+f
).
From the Dataiku Design homepage, click + New Project.
Select DSS tutorials.
Filter by MLOps Practitioner.
Select External Models.
From the project homepage, click Go to Flow (or
g
+f
).
Note
You can also download the starter project from this website and import it as a zip file.
Deploy a working external endpoint#
Before you can surface an external model in Dataiku, you need an endpoint running in your cloud ML platform. In each case, we have created an example endpoint following the cloud ML platform’s documentation.
Create an Amazon SageMaker endpoint by following their tutorial on customer churn prediction with XGBoost.
When finished, you should be able to get predictions, such as:
dummy = "5,0,12.524227291335833,2,5.639471129269478,5,4.937219265876419,150,5.881787271425925,5,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0"
dummy_ar = np.fromstring(dummy, sep=',')
predictions = predict(dummy_ar)
print(predictions)
It should print something like:
['0.9981379508972168']
Create an Azure Machine Learning endpoint by following their tutorial on deploying a model to predict credit card defaults as an online endpoint.
When finished, this code block from the tutorial should return '[1,0]'
.
# test the blue deployment with the sample data
ml_client.online_endpoints.invoke(
endpoint_name=online_endpoint_name,
deployment_name="blue",
request_file="./deploy/sample-request.json",
)
Create a Google Vertex AI endpoint by following their tutorial on AutoML tabular training and prediction.
When finished, you should be able to get predictions, such as:
prediction = endpoint.predict(
[
{
"Type": "Cat",
"Age": "3",
"Breed1": "Tabby",
"Gender": "Male",
"Color1": "Black",
"Color2": "White",
"MaturitySize": "Small",
"FurLength": "Short",
"Vaccinated": "No",
"Sterilized": "No",
"Health": "Healthy",
"Fee": "100",
"PhotoAmt": "2",
}
]
)
print(prediction)
It should print something like:
Prediction(predictions=[
{'classes': ['Yes', 'No'],
'scores': [0.6285621523857117, 0.3714377582073212]
}],
...
)
Create a connection in Dataiku to access the external endpoint#
After confirming that you have an endpoint running in your cloud ML platform, the next step is to create a connection in Dataiku with sufficient credentials to access the endpoint.
Important
Creating the connection requires administrative access. However, once the connection is created, other users only require the security setting Details readable by to access it.
From the waffle () menu of the Design node, click Administration.
Navigate to the Connections tab.
Click + New Connection.
Choose your cloud ML platform from the Managed Model Deployment Infrastructures section.
Provide any name, such as
sagemaker-ext
.Enter a working way to get a role with the right to query the SageMaker endpoint on the correct account and region. (This screenshot is using STS with AssumeRole logic).
Dataiku external models require the following rights:
sagemaker:DescribeEndpoint
sagemaker:DescribeEndpointConfig
sagemaker:InvokeEndpoint
sagemaker:ListEndpointConfigs
sagemaker:ListEndpoints
Note
See the Amazon SageMaker documentation on API permissions for details.
From the waffle () menu of the Design node, click Administration.
Navigate to the Connections tab.
Click + New Connection.
Choose your cloud ML platform from the Managed Model Deployment Infrastructures section.
Provide any name, such as
azureml-ext
.Enter a working way to get a role with the right to query the Azure Machine Learning endpoints you want to use. (This screenshot is using the OAuth from App method).
From the waffle () menu of the Design node, click Administration.
Navigate to the Connections tab.
Click + New Connection.
Choose your cloud ML platform from the Managed Model Deployment Infrastructures section.
Provide any name, such as
vertex-ext
.Enter the credentials to use. Ensure that this account is allowed to query Vertex endpoints, which is given by the role Vertex AI User.
If you want more granular access, Dataiku requires the following rights:
aiplatform.endpoints.get
aiplatform.endpoints.list
aiplatform.endpoints.predict
Note
See the Vertex AI documentation on access control with IAM for details.
Add an evaluation dataset in Dataiku#
Once you have a connection in Dataiku to the external endpoint, you can create an external saved model.
However, before creating the actual model, a good practice is to have a dataset for Dataiku to evaluate the model performance. For each case here, we will use the model’s training data found in the starter project.
If you prefer not to evaluate the model, you can make that selection when creating a version of the external saved model. However, you’ll still need to provide a dataset so that Dataiku can save the feature names and types.
In the Amazon SageMaker Flow zone, find the sagemaker_churn_prepared dataset.
Note
We initially uploaded the dataset from the churn tutorial as sagemaker_churn. However, as SageMaker models require a very specific format, we needed to transform this dataset into sagemaker_churn_prepared with a Python recipe using the exact same code as in the tutorial notebook.
In the Azure ML Flow zone, find the default_of_credit_card_clients_prepared dataset.
Note
We uploaded the dataset from this Azure ML example data, removed the first line of X1, X2,...
, and dropped the ID column in a Prepare recipe.
In the Vertex AI Flow zone, find the petfinder-tabular-classification dataset.
Note
We uploaded it from this Google Cloud sample data.
Create an external saved model in Dataiku#
Now that you have an evaluation dataset, let’s proceed with the creation of the external saved model itself.
From the Flow () menu in the top navigation bar, click on Models & Agents.
Click + New Saved Model, then New External Model.
Choose your cloud ML platform.
Important
Although you’ll see MLflow as an option here, you should be aware of a key difference. This MLflow option refers to a model designed outside of Dataiku, packaged using the MLflow format, and imported into Dataiku. On the other hand, for the external models discussed here, the models remain deployed where they are, but are made available to Dataiku users — as you will see.
Fill in the basic model information:
Give a meaningful name to your model like
sagemaker-churn
.The prediction type is a Two-class classification.
For the Authentication field, select the connection used in the previous step; sagemaker-ext in our case.
Select the Region where your endpoint is deployed; eu-west-1 in our case.
Give a meaningful name to your model like
az-credit-fraud
.The prediction type is a Two-class classification.
For the Authentication field, select the connection used in the previous step; sagemaker-ext in our case.
The other fields (Subscription ID, Resource group, and Workspace) are specific to your setup.
Give a meaningful name to your model like
vertex-petfinder
.The prediction type is a Two-class classification.
For the Authentication field, select the connection used in the previous step; vertex-ext in our case.
Select the Region where your endpoint is deployed; europe-west-1 in our case.
Project name is specific to your setup.
Add a model version#
We now have an external model, but we need to create a model version that will contain all the details to query the endpoint.
Open the saved model, and click on + Add Model Version.
In the first section, click on Get Endpoints List to fetch a list of accessible endpoints.
Select your endpoint from the list. Alternatively, you can enter its ID directly.
Enter a simple version ID, such as
v1
.
Then complete the rest of the window with details specific to the case at hand:
Enter the classes:
0
and1
in our case.Leaving the default option to evaluate the model selected, enter the evaluation dataset, sagemaker_churn_prepared to have performance metrics on the model directly in Dataiku.
Select Churn?_True. as the target column.
As the computation of explainability data is not very long in this sample, uncheck the Skip expensive reports box.
In this case, you can leave the default option of allowing Dataiku to guess the input/output format using the evaluation dataset.
Click Create to let Dataiku create and evaluate this model. The creation is fairly fast, but the evaluation can take some time.
Enter the classes:
0
and1
in our case.Leaving the default option to evaluate the model selected, enter the evaluation dataset, default_of_credit_card_clients_prepared to have performance metrics on the model directly in Dataiku.
Select default payment next month as the target column.
As the computation of explainability data is not very long in this sample, uncheck the Skip expensive reports box.
In this case, you can leave the default option of allowing Dataiku to guess the input/output format using the evaluation dataset.
Click Create to let Dataiku create and evaluate this model. The creation is fairly fast, but the evaluation can take some time.
Enter the classes:
Yes
andNo
in our case.Leaving the default option to evaluate the model selected, enter the evaluation dataset, petfinder-tabular-classification to have performance metrics on the model directly in Dataiku.
Select Adopted as the target column.
As the computation of explainability data is not very long in this sample, uncheck the Skip expensive reports box.
Click Create to let Dataiku create and evaluate this model. The creation is fairly fast, but the evaluation can take some time.
Note
Unchecking the default option to Skip expensive reports allows Dataiku to perform the requests required to compute the explainability sections. Although this adds value to the model, it triggers thousands of requests.
This is acceptable for our purpose, but for production endpoints, the additional stress on the endpoint may be a factor to consider. If need be, these reports can also be computed afterward individually.
Manage external endpoint changes#
Before demonstrating how you can use an external model that has been surfaced in Dataiku, it is important to recognize that Dataiku does not control the endpoint behind this object. Accordingly, these endpoints can change without notice. This could be a technical change, such as the addition of memory to the underlying infrastructure, or a change in the exposed model, such as a new deployed version.
From the Saved Models page, open your saved model.
Open the report for your model version (such as v1).
In the Summary panel, find the following details about the external endpoint and the Check Endpoint button:
The Check Endpoint button is used to control whether the data we fetched and stored at the creation time is still accurate. If there are differences, Dataiku will offer you the ability to create a new saved model version as the potential data on performances and explainability is incorrect.
Use an external model in Dataiku#
Once you have a version of an external saved model in Dataiku, you can use it much like other visual models in Dataiku. In particular, we’ll highlight the following three benefits:
Model explainability#
To avoid black box models, Dataiku offers many built-in visualizations to understand the nature of a model. The same interface for investigating a model built in Dataiku can be used for investigating models deployed on a cloud ML platform.
From the report of your model version, browse the Explainability and Performance sections on the left that were computed during the evaluation process.
If you left the Skip expensive report option checked when creating the model version, feel free to request their computation manually.
Note
The endpoint in the Azure ML tutorial model happens to be non-probabilistic (it only outputs 0 or 1), and so many of the reports cannot be computed.
Model comparison#
In order to determine a champion model from any number of challengers, you may be familiar with creating a model comparison in Dataiku. This feature enables you to compare metrics side by side for any combination of models built in Dataiku, MLflow models imported into Dataiku, and models deployed externally.
The starter project already includes a model comparison containing a simple AutoML model built in Dataiku. You just need to add the external model to the existing model comparison object.
From the Visual ML () menu in the top navigation bar, click on Model Comparisons.
Click to open the SageMaker vs AutoML comparison.
Click Configure.
Click + Add Items.
Click Select Saved Model Versions from the Flow.
Select the model version of your external endpoint.
Click Add.
Click Apply.
From the Visual ML () menu in the top navigation bar, click on Model Comparisons.
Click to open the Azure ML vs AutoML comparison.
Click Configure.
Click + Add Items.
Click Select Saved Model Versions from the Flow.
Select the model version of your external endpoint.
Click Add.
Click Apply.
From the Visual ML () menu in the top navigation bar, click on Model Comparisons.
Click to open the Vertex AI vs AutoML comparison.
Click Configure.
Click + Add Items.
Click Select Saved Model Versions from the Flow.
Select the model version of your external endpoint.
Click Add.
Click Apply.
Model scoring#
If the external model is indeed your champion, you can use it in a Score recipe in a Dataiku Flow as you would for a visual model — either manually or through a scenario for ad hoc scoring or a real high-volume scoring batch.
From the Flow zone dedicated to your cloud ML platform, select the external model and its evaluation dataset.
In the Actions tab, select the Score recipe.
Click Create Recipe.
In the Settings tab of the Score recipe, notice the engine selection of External Model at the bottom left. Recall that although you have surfaced the model in Dataiku, it is only deployed externally.
Click Run to score the data using the external endpoint’s engine.
Additional considerations#
Endpoint scalability amidst increased democratization#
A key advantage of surfacing an external endpoint within Dataiku is the expanded audience of users who will be able to interact with the model. However, as with any democratization benefit, you’ll need to consider the corresponding impact on performance and volume on the endpoint.
The operations shown here generate queries to the external endpoint. This is especially true with explainability requests, which can easily reach into the thousands. To serve this new usage, you’ll need to plan for your endpoint to scale accordingly.
End user discovery and import of external models#
Although using external models in Dataiku is straightforward, actually surfacing them requires some knowledge of the underlying model and, potentially, administration rights to Dataiku and the ML cloud platform.
To simplify this process for end users, administrators might consider creating a dedicated Dataiku project that serves as a home for all external models and their corresponding sample datasets.
By making these models shareable and granting permissions to this project, a wider audience of Dataiku users can search for and add the desired external model into their own projects.
Deployment to a production environment#
This tutorial only demonstrated the usage of an external model in a Design sandbox. However, external models can also be surfaced in a scheduled, operationalized project. Deploying such a project to an Automation node is perfectly workable, provided the required connection also exists on the Automation node.
What’s next?#
Congratulations! You have taken an endpoint deployed on your cloud ML platform, and surfaced it as an external model within Dataiku so that you, as well as a wider audience, can use it for key functions like model explainability, comparisons, and scoring.
Note
You can learn more about external models in the reference documentation.
You can also learn more about other MLOps integrations points, such as: