Concept | Dataiku project walkthrough¶
Read about the role of the AI lifecycle in developing a Dataiku project.
Most courses in the Dataiku Academy take a hands-on approach to teach you how to become a proficient Dataiku user. This course, however, takes a broader approach. Rather than teaching any specific features of Dataiku, this course examines the AI lifecycle through a sample Dataiku project.
While doing so, it is often beneficial to speak in terms of some kind of analytics framework or methodology, such as CRISP-DM or ASUM-DM. A Dataiku project can be envisioned through the AI Lifecycle diagram below.
As the diagram below illustrates, this framework comprises a series of iterative cycles that span stages of Question, Discover, Experiment, Deploy and Operationalize. These stages are executed with the Design, Automation, and API nodes working together.
In our example, the question is simple: How to predict taxi fares in New York City, given a starting location, a destination, and a particular time of day?
With this question in mind, the AI lifecycle moves to the Discovery stage. Users begin in the Design node of Dataiku to explore and transform the necessary data, providing valuable insights into the question at hand.
Informed by these insights, in the next stage, users build and assess the performance of machine learning models. This Experiment stage is also accomplished in the Design node.
Having chosen a model from the Experiment stage, it is time to define the API service used to deploy the model to the API node. Finally, users can operationalize the model in the Automation node.
Of course, Dataiku supports many different paths to realizing Everyday AI. Although the AI lifecycle in Dataiku always begins in the Design node, an enterprise’s particular use case, project objectives, and infrastructure choices may determine how the Automation and/or API nodes are utilized.
With a clear business question to investigate, let’s take a closer look at the following stages, beginning with Discovery.
The Discovery stage includes three important phases, the first of which is Data Acquisition. The goal of the Data Acquisition phase is to establish connections to data sources, managing the size and speed at which the raw data changes.
This section addresses:
The nature of datasets in Dataiku
How everyone can connect to existing infrastructure through the visual interface of Dataiku
How coders can develop custom connectors through a plugin system for the use and benefit of everyone in an organization
How Dataiku detects the format and schema of a dataset
Datasets in Dataiku¶
Before connecting to any data, it is first important to understand how Dataiku defines a dataset. In Dataiku, a dataset is any piece of data in a tabular format. A CSV file, an Excel sheet, or an SQL table are just a few examples of possible datasets in Dataiku.
Generally, creating a dataset in Dataiku means that the user merely informs Dataiku how it can access the data. Dataiku remembers the location of the original external or source datasets. The data is not copied into Dataiku. Rather, the dataset in Dataiku is a view of the data in the original system. Only a sample of the data, as configured by the user, is transferred via the browser.
Furthermore, Dataiku can partition datasets. Users can instruct Dataiku to split a dataset along a meaningful dimension, such as date or a discrete value, so that each partition contains a subset of the complete dataset. This feature can be useful when working with large datasets or analyzing distinct subgroups.
Hovering over a dataset in the Flow shows its storage system, tags, description, and more.
Connecting to existing infrastructure¶
Dataiku allows users to natively connect to more than 25 data storage systems, through a visual interface or code. Possibilities include traditional relational databases, Hadoop and Spark supported distributions, NoSQL sources, and cloud object storage.
This means that, when taking advantage of the push-down computation strategy, Dataiku itself imposes no limits on datasets, based on system or server memory. A dataset in Dataiku can be as large as its underlying storage system allows.
The page for introducing a new dataset to a project depends upon what storage systems have been connected to the instance. In our example, it includes data sources connected to plugins, like Freshdesk and Twitter.
According to the visual grammar of Dataiku, blue squares represent datasets, and the icon represents the storage system. A single visual grammar also makes it easy to remap connections to new sources when the underlying infrastructure changes.
In the Flow of the NY Taxi Fares project, we see an up arrow for an UploadedFiles dataset, two cubes for a dataset in Amazon S3, and an elephant for HDFS.
Extending existing connectivity¶
The list of available out-of-the-box data storage connections is always expanding. However, a core principle of Dataiku is its extensibility. Dataiku plugins or an enterprise’s own Python or R scripts can be used to create custom visual connectors for any APIs, databases, or file-based formats. These can easily be shared within a team or the wider community.
The Dataiku Plugin Store includes connections for sources such as Tableau, Salesforce, Microsoft Power BI, Freshdesk, and Airtable.
You can find a wide range of additional functionality, including custom connectors, in the Plugin Store.
Format & schema detection¶
When connecting to a data source, Dataiku automatically infers both the file format and the schema (the list of column names and types). Detailed settings for reading the data source and important properties, such as the dataset’s lineage, can be previewed and adjusted in a simple interface.
Dataiku automatically detects the storage type and meaning of all columns in a dataset. These settings can be adjusted from the Schema tab.
Each column in Dataiku has both a storage type and a high-level meaning:
The storage type indicates how the dataset backend should store the column data. Examples include categories like string, integer, float, boolean, and date.
The meaning, on the other hand, is a rich, semantic type. Meanings have a high-level definition such as IP address, country code, or URL. They are particularly useful for suggesting relevant transformations and validating rows that may not match a given meaning. Users can also define their own meanings for items like internal department codes or likert scales.
The difference between type and meaning can be seen in a dataset’s Explore tab, which is introduced just below. Under the column’s name in bold is first the storage type and then, in blue, the meaning predicted by Dataiku. There is also a validation gauge representing the number of rows that satisfy the predicted meaning in green.
In our example, Dataiku has assigned a meaning of Integer to “Contact Phone Number”, but has stored it as a string because some rows do not match this expectation. Those that do not are highlighted in red.
The goal of the Data Exploration phase is to investigate data with iterative visualizations and statistical summaries.
Dataiku allows users to quickly explore data through a visual UI or code, depending on their preference. As explained in the following lessons, this can be done from within the project’s Flow or in a separate workspace for experimentation, known as the Lab.
The Explore tab¶
For any dataset, the Explore tab shows a tabular representation that will appear familiar to any spreadsheet user. However, as this is an environment built for big data, the Explore tab shows only a sample of the data. The user has full control over the design of the sample. Working with just a sample in the browser allows for quicker exploration of large datasets.
The default sample in the Explore tab contains the first ten thousand records of a dataset.
Within the Explore tab, users have access to tools like Analyze. Users need only click on a column to produce a quick summary of the column’s distribution, statistics, and outliers.
When exploring the raw training data, using the Analyze tool on the fare amount column shows the distribution and key metrics that can guide what data cleaning might be necessary. These statistics can be calculated on the sample or the whole dataset.
The Charts tab¶
The ability to quickly iterate on data visualizations and share results with colleagues is a key part of any data exploration.
The Charts tab contains a drag and drop interface, allowing users to produce quick visualizations of the data in the sample. Default chart options include standard types like histograms and scatter plots, but also geographic maps.
It is also easy to save any charts as insights or publish them to a webapp or dashboard through the visual UI.
By just dragging two variables in place, we have created an interactive Leaflet map of taxi dropoff locations in New York City colored by fare amounts.
Although it is possible to experiment directly in the project Flow, it is often helpful to have a complementary workspace to iterate between data preparation, visualization, and machine learning. The Lab is this space within Dataiku.
Keeping experimentation in the Lab helps avoid overcrowding the Flow with unnecessary items that will not be used in production.
As with many aspects of Dataiku, the choice between visual tools and code is left to the user. In the Lab, users can create a Visual Analysis or start a Code Notebook. In either case, work in the Lab can be deployed to the Flow when ready.
A Visual Analysis will appear similar to the Prepare recipe (discussed below) found in the Flow. However, a Visual Analysis is a pure “Lab” object, which does not have persistent output data. As such, columns in an analysis do not have a notion of storage types (since they are not stored). When a script is ready to be used in the Flow, it can easily be deployed as a Prepare recipe.
This Visual Analysis allows for experimenting on the parking garage data using any transformation in the Processors library.
Even for coders, the visual data exploration tools in Dataiku will often be a time saver. However, for those who would prefer to explore their data via the programming language of their choice, Dataiku offers interactive code notebooks. Depending on the storage of the dataset, these could be Python, R, SQL, Scala, Hive or Impala.
Here we have a Jupyter notebook exploring how to import clustered NYC boroughs in order to enrich the pickup and dropoff location data. We can use the notebook in an exploratory way or, as done in this project, save it back to a recipe so it can be deployed in the Flow.
Moreover, although Jupyter notebooks are integrated into the Dataiku interface, Dataiku offers integrations with several popular IDEs (integrated development environments), including PyCharm, Sublime Text, RStudio, and VS Code. After configuring a connection between the IDE and their Dataiku instance, these integrations allow developers to pull code from existing Dataiku recipes and plugins into their IDE. After editing the code in their IDE, they can save it back to the recipe or plugin.
The goal of the Data Preparation phase is to wrangle and enrich data as input for model building.
Regardless of the data’s underlying storage system, data preparation in Dataiku is accomplished in the same manner. It is done through running recipes.
A Dataiku recipe is a repeatable set of actions to perform on one or more input datasets, resulting in one or more output datasets. As is found throughout Dataiku, recipes are one function that can be created with a visual UI or code. The name “recipe” is meant to underscore that it is a series of steps designed to be repeated.
The following lessons focus on the different types of recipes (visual, code, and plugin) and how these recipes execute jobs.
Dataiku provides a standard set of visual recipes that provide a simple UI for accomplishing the most common data transformations. Filtering, sorting, splitting and joining datasets are just some of the operations that can be performed with a few clicks.
According to the visual grammar of Dataiku, yellow circles signify a visual recipe, where the center icon represents the particular transformation. For example, the icon of the Prepare recipe is a broom.
Visual recipes highlighted in this Flow include a Sample/Filter recipe to sample the raw training data, a Stack recipe to perform a union of the training sample and testing data, and a Prepare recipe to perform cleaning and enrichment steps.
Moreover, unlike many visual tools such as spreadsheets, visual recipes in Dataiku provide a recorded history of actions that can be repeated, copied, or edited at any time. The idea is to iteratively construct a workflow that can be run as needed, and avoid a conflict between the original dataset and the output version.
The Prepare recipe handles the bulk of the data cleaning and feature generation with the help of the processors library. Over 100 built-in visual processors make code-free data wrangling possible. These processors can be used to accomplish a huge range of tasks like rounding numbers, extracting regular expressions, concatenating or splitting columns, and much more. Users can also write Formulas in an Excel-like language for more customized operations. An alternative option is to write a custom Python function as a step within the Prepare recipe.
This Prepare recipe consists of 8 steps, including creating geo-points from geographic coordinates, removing rows under certain conditions, and parsing dates.
Dataiku will even automatically suggest transformations to a column based on its predicted meaning.
Here Dataiku suggests using the Extract date components processor because it recognizes the column meaning as a Date.
In cases where the need is more customized than what a visual recipe can provide, or just according to the user’s preference, Dataiku allows users to code any recipe in a variety of languages. This code can be written in a Jupyter notebook or using any of the IDE integrations.
As shown above, code notebooks can be used for interactive design or debugging, but can be saved and deployed to the Flow as a recipe when ready for production. In the Flow, orange circles with the icon of the programming language represent code recipes.
The project uses a Python recipe to compute driving times for every cluster combination. From an input of geographic coordinates of pickup and dropoff locations, this Python recipe produces an output including distance and travel time for every cluster combination.
In addition to the standard library of visual recipes, users can also code their own visual recipes by creating a plugin. This option allows users to create reusable components that wrap additional functionality into a visual UI, thereby extending the capabilities of Dataiku to a wider audience. Users can also search the Plugin store for existing plugins shared by the community. Red circles in the Flow represent plugin recipes.
The three red circles to the right belong to the Forecast plugin. In three visual steps, it covers the full forecasting cycle of data cleaning, model training, and evaluation and prediction.
Actions in Dataiku, such as running a recipe (whether it may be visual, code or plugin) or training a model, generate a job. The flexible computation environment of Dataiku grants users control over how a job is executed.
Wherever possible, Dataiku pushes down the computation to the underlying location of the data. However, in addition to choosing the execution engine for a particular job, users can also control the composition of tasks within a job. This is especially important when managing long data pipelines, computationally-expensive operations, and/or the continuous arrival of new data.
Accordingly, before executing any job, users can choose between a range of different build strategies. For example, a non-recursive build runs only the current recipe. A “smart” rebuild only rebuilds out-of-date datasets. The most computationally-intensive option rebuilds all dependent datasets. Users can also prevent datasets from being rebuilt by write-protecting them.
Before running this recipe, we can choose whether to run only this recipe or to recursively build dependent datasets that may be out of date.
The Jobs menu shows the status of all current and past jobs, including log files. Users can observe how long each activity in a job took to complete. When trying to optimize a Flow for production, this may be a useful place to start, as it can help identify bottlenecks and direct attention to parts of the Flow that could be refactored.
This current job in progress is rebuilding the final output dataset in the project. In this case, it needs to complete eight activities in order to do so.
Having sufficiently explored and prepared the taxi fare data, the next stage of the AI lifecycle is to experiment with machine learning models.
This experimentation stage encompasses two key phases: model building and model assessment.
Model building: Users have full control over the choice and design of a model — its features, algorithms, hyperparameters and more.
Model assessment: Tools such as visualizations and statistical summaries allow users to compare model performance.
These two phases work in tandem to realize the idea of Responsible AI. Either through a visual interface or code, building models with Dataiku can be transparently done in an automated fashion. At the same time, the model assessment tools provide a window into ensuring the model is not a black box.
Dataiku is fully equipped to build models for both supervised (prediction) and unsupervised (clustering) learning tasks. After deciding on the type of task, users can decide if they wish to enter Expert mode and write a fully customized model or enter the Automated Machine Learning mode, where a visual interface guides model design.
While in the Lab, after choosing a prediction task targeting the “fare_amount” variable, Dataiku presents the option of Automated Machine Learning or Expert Mode.
When building a visual model, users can choose a template instructing Dataiku to prioritize considerations like speed, performance, and interpretability. Having decided on the basic type of machine learning task, users retain full freedom to adjust the default settings chosen by Dataiku before training any models. These options include the split of the train and test set, the metric for which to optimize, what features to include, and what algorithms should be tested.
While trying to predict fares, the visual UI provides full control over what features to consider and how the algorithms should handle those features, including strategies for missing values and rescaling. Here, we can confirm that distance in kilometers is designated as a numeric input variable.
The automated machine learning capabilities allow users to train dozens of algorithms using a visual interface, while still leveraging state-of-the-art open source machine learning libraries, such as Scikit-Learn, MLlib, and XGBoost. In addition to these standard algorithms, users can also import custom Python models.
Here we are ready to train models using algorithms such as Random Forest and XGBoost. Users can control the parameters for each of these algorithms through the visual UI.
Furthermore, when training a prediction model on partitioned data, Dataiku is able to build partitioned (or stratified) models. Such a model is trained separately on each data partition, and so partition-level results can be compared against the overall model metrics. Partitioned models can lead to better predictions when relevant predictors for a target variable vary widely across partitions.
Here we have imported the LightGBM model into Dataiku. Originally developed by Microsoft, this algorithm performed slightly better than Random Forest and XGBoost in our example project.
After having trained as many models as desired, Dataiku offers tools for full training management to track and compare model performance across different algorithms. Dataiku also makes it easy to update models as new data becomes available and to monitor performance across sessions over time.
In the Result pane of any machine learning task, Dataiku provides a single interface to compare performance in terms of sessions or models, making it easy to find the best performing model in terms of the chosen metric.
Here we can compare the performance of various models and see at a glance how they differ in terms of the most important variables. In this case, LightGBM, an imported model built in Python, had the lowest Root Mean Square Error (RMSE).
Just clicking on any model produces a full report of tables and visualizations of performance against a range of different possible metrics.
Here we can review the report of the Light GBM model. Each tab in the left panel provides a different insight into the model’s interpretation, performance or other information.
After experimenting with a range of models built on historic training data, the next stage of the AI lifecycle is to deploy a chosen model to score new, unseen records. The same flexibility Dataiku brings to model building extends to model deployment. Whether batch or real-time scoring is the right option for a particular use case, Dataiku has robust tools in place.
For many AI applications, batch scoring, where new data is collected over some period of time before being passed to the model, is the most effective scoring pattern. To score a dataset full of records in batch, a user deploys a model to the Flow, attaches a Score recipe to the dataset to be scored, and selects which model to use.
Deploying a model creates a “saved” model in the Flow, together with its lineage. A saved model is the output of a Training recipe which takes as input the original training data used while designing the model.
The color green in the Flow represents machine learning processes. The first green circle is the Training recipe. The diamond is the model, which is the output of the Training recipe. The green circle to the right is the Score recipe, which is used to generate predictions for the unseen “kaggle_submission” dataset.
Models aim to capture patterns in a moving and complex world. Accordingly, they must be monitored and re-trained over time in order to prevent model drift. A saved model can be retrained directly from the Flow as new data becomes available.
Despite frequent model re-training sessions over the course of a project’s lifetime, Dataiku makes it clear what version of the model is active at any given time and makes it easy to fall back to a previously-deployed version when necessary.
In the NY Taxi Fares project, we have deployed five versions of the LightGBM model, but it’s clear which is the active version.
For use cases with a real-time need, batch scoring may be an insufficient answer. To score records as they arrive in real-time, an infrastructure dedicated to handling the scoring requests is required. Dataiku API nodes can be installed on static servers or on a scalable Kubernetes cluster.
In the Design node, users can create an API service with a chosen model as its endpoint. Once this service is pushed to the API deployer, it can be deployed to (possibly many) API nodes. Using the API Deployer, one can monitor the performance of the model for a given stage of deployment.
From the API Deployer, we can examine different versions of the API and the endpoints we have created. In our example, we are using a static infrastructure defined over the Dataiku static API node. We also have the ability to run test queries against endpoints. For each of these five test queries, we receive different fare predictions.
For an enterprise looking to scale, the ability to deploy a model into production is not enough. Realizing the full potential of a model requires orchestration — a repeatable, efficient process for creating and effectively deploying models into production.
In the case of the NY Taxi Fares project, we have deployed a model for real-time scoring, but managing the model lifecycle remains a manual process. How do we monitor its progress over time? When should the model be re-trained? When should a new version be deployed?
To effectively manage the process of deploying hundreds of models from development to testing to production environments, automation becomes a clear necessity.
Dataiku lets users establish validation feedback loops in order to automate the updating, monitoring, and quality control of a project’s Flow by continually pushing work from the Design node to the Automation node.
This stage of the AI lifecycle encompasses:
Scenarios to automate key processes like rebuilding, retraining, and deployment
Metrics and Checks to monitor model performance in production
Deploying projects to the Automation node
In addition to deploying models into production, another key aspect of orchestration is the automation of reporting. In order to make it easy to communicate real-time results, Dataiku provides users with drag-and-drop tools, such as dashboards, or code options, such as webapps and R Markdown reports.
In Dataiku, the scenario is the place to begin automating tasks, such as rebuilding datasets, retraining models, or redeploying application bundles.
A scenario has two required components:
A trigger that activates a scenario and causes it to run
The steps, or actions, that a scenario takes when it runs
As in many other places in Dataiku, scenarios are a task that can be completed through a visual interface or entirely customized with code.
Triggers are typically time-bound (for example, run a scenario every day at a particular hour) or dataset-bound (run a scenario whenever a certain dataset is modified).
The “Update Parking List” scenario will run the job described in the Steps tab whenever its trigger is activated — in this case, whenever the NYC_Point_Of_Interest dataset is modified.
Dataiku includes a library of predefined steps, such as those for retraining a model, computing certain metrics, creating an export, or updating variables. At the same time, coders can devise their own custom scenarios, executing SQL or Python code.
The “Update Parking List” scenario consists of three predefined steps: rebuilding a particular dataset, executing a macro, and building an export.
In order to keep all stakeholders informed of scenario activity, users can also attach reporters. Scenario reporters can be executed before, during or after the completion of a scenario run. Messages can be sent via email or integrated with services like Hipchat, Twilio, Webhook, Microsoft Teams, or Slack (as done for the “Update Parking List” scenario above). For greater customization, such as pulling less common variables from the Dataiku instance, a custom Python script can be used to generate messages.
Monitoring with metrics and checks¶
Having established scenarios to monitor the status of a project, it becomes important to be able to track patterns of success and failure over time. Dataiku provides a Monitoring menu, where users can examine which scenarios are currently running and the performance of past runs.
In addition to this dashboard, Dataiku provides two important monitoring tools, metrics and checks:
Metrics provide a way to compute various measurements on objects in the Flow, such as the number of records in a dataset or the time to train a model.
Checks allow users to establish conditions for monitoring metrics. For example, users can define a check that verifies that the number of records in a dataset never falls to zero. If the check condition is no longer true, the check will fail, and the scenario will fail too, triggering alerts.
Metrics for a dataset can be found on the Status tab. Common examples include the size of a dataset, the number of records, or basic statistics of a certain column. Metrics are automatically logged, which makes it easy to track the evolution of the status of a dataset.
In this example, we have chosen to track metrics that are important to the project, such as the dataset’s last build date.
Once certain metrics have been identified, users can establish Checks on those metrics. By verifying a check is true, the system allows the user to attach hard or soft minimum or maximum values around key metrics to trigger warnings or failures.
For example, if the performance of a model is established as a metric, a check can be created to ensure a model is not put into production if it falls below a certain hard minimum. Similarly, the same tools can be used to perform data health checks, ensuring that the data meets certain standards before being used in reports or modeling tasks.
When creating a new Check, we can use a simple UI to monitor if a chosen metric is in a certain numeric range or in a particular set of values. Alternatively, we can entirely customize the Check with Python code.
Having created scenarios, metrics and checks in the Design node, one final step in the AI lifecycle of this project is to deploy the project to a production environment, in this case, the Automation node.
While the Design node is a development environment for testing new analyses, the Automation node is an isolated production environment for operational projects.
Starting from the Design node, projects are orchestrated by creating bundles, or complete snapshots of the project. A bundle includes the project configuration and possibly, selected data, folders and models. These bundles are then uploaded to the Automation node. Once activated on the Automation node, scenarios created in the Design node can be activated, and production data can now be used as inputs in the Flow.
With a project deployed in a separate production environment, it is essential to be able to monitor performance. Dataiku makes it easy to maintain version control over bundles deployed in production through a simple UI. Whenever updates are made in the Design node, a new project bundle can be uploaded and activated on the Automation node. Moreover, the bundle currently in production can easily be rolled back to a previous version whenever necessary so that end users never experience failure.
In order to upload this project to the Automation node, we first need to create a bundle, including the saved model.
A key part of any analytics process is communicating results. In addition to scenario reporters providing notifications about the status of automated tasks, users can create insights and dashboards, webapps, and R markdown reports to communicate progress with diverse groups of stakeholders.
Insights & dashboards¶
Any Dataiku project can have an arbitrary number of dashboards. Dashboards consist of slides, which themselves are made of tiles. These tiles can hold insights. Insights can include any Dataiku object, such as a chart, dataset, webapp, model reports, metrics, and more.
Using the groups-based permissions framework, dashboards are also particularly useful for communicating end results with users who may not have full access to a project, and perhaps, the sensitive data it may contain. This provides a path for enterprises to enforce robust data governance policies without hampering collaboration.
This project dashboard displays network analysis visualizations, including maps of dropoff locations and plots of ride revenue forecast.
R Markdown reports¶
Given that Dataiku allows users to code in languages like Python and R, it should not be surprising that R Markdown reports are another tool at the disposal of users. The familiar collaboration layer of Dataiku, however, remains on top, providing users with tools like tags, to-do lists, and discussions.