Dataiku DSS is a collaborative, end-to-end data science and machine learning platform that unites data analysts, data scientists, data engineers, architects, and business users in a common space to bring faster business insights.
In Dataiku DSS, connecting to and ingesting data, then preparing that data for tasks such as machine learning, is performed all in one place. In this one-hour Quick Start program, you’ll work with flight and airport data to discover ways to connect to data, cleanse it using both code and visual recipes, and set up automated metrics and checks.
To follow along or reproduce the tutorial steps, you will need access to the following:
Dataiku DSS - version 9.0 or above
An SQL connection, such as Snowflake or PostgreSQL
If you do not already have your own instance of Dataiku DSS with an SQL connection, you can start a free Dataiku Online Trial from Snowflake Partner Connect. This trial gives you access to an instance of Dataiku Online with a Snowflake connection.
For each section of this quick start, written instructions are recorded in bullet points. Be sure to follow these while using the screenshots as a guide. We also suggest that you keep these instructions open in one tab of your browser and your Dataiku DSS instance open in another.
You can find a read-only completed version of the final project in the public gallery.
Open the Snowflake Partner Connect Instance (Optional)¶
If you are getting started from your Dataiku Online Trial from Snowflake Partner Connect, the first step is to go to your launchpad. If you are getting started from your own instance of Dataiku DSS, you can skip this part and go to Create the Project. From your launchpad, you’ll find your Snowflake Partner Connect instance.
Click Open Dataiku DSS.
Dataiku DSS displays the homepage where you can see two projects.
Dataiku DSS can run on-premise or in the cloud.
Create the Project¶
Dataiku DSS projects are the central place for all work and collaboration for users. Each Dataiku project has a visual flow, including the pipeline of datasets and recipes associated with the project.
This tutorial uses a fictitious project dedicated to predicting flight delays. The purpose of this tutorial is to look for data quality issues, and set up metrics and checks to tell Dataiku DSS to take certain actions if these data quality issues find their way into the Flow again when new data is added or the datasets are refreshed.
From the Dataiku homepage, click +New Project.
Select DSS Tutorials from the list.
Click on the Quick Start section and select Data Engineer Quick Start (Tutorial).
Ignore warning messages by clicking OK. The imported project was created on a design node with specific plugins installed. These plugins are not needed to complete this tutorial.
DSS opens the Summary tab of the project, also known as the project homepage.
About the Visual Flow¶
From the top navigation bar, click the Flow icon to go to the Flow.
The flow of the project is organized into two Flow Zones: one for ingesting and checking the data and another zone for building the machine learning model pipeline. In this tutorial, we will cleanse the data that will be used by the machine learning model, but we will not be building the model.
The Flow is composed of data pipeline elements, including datasets and recipes. Recipes in Dataiku DSS are used to prepare and transform datasets–their icons are easy to spot in the Flow because they are round, whereas datasets are square.