Quick Start | Alteryx to Dataiku#
Get started#
If you have experience working in a data analytics platform such as Alteryx, the prospect of migrating to a cloud-native platform like Dataiku may be exciting, but also somewhat daunting.
With that transition in mind, let’s take a quick tour of a Dataiku project to introduce some of the main conceptual differences that will help you make the leap!
Objectives#
In this quick start, you’ll:
Review a completed Dataiku project from the perspective of a user with an Alteryx background.
Use this experience to understand the high-level differences between working as an analyst in Alteryx vs. Dataiku.
Prerequisites#
For this tutorial, you won’t actually need a Dataiku account of your own. All you need is internet access and a supported web browser (Google Chrome, Mozilla Firefox, or Microsoft Edge).
View the project#
In Alteryx, you would start by creating a new workflow. In Dataiku, you start with a new project.
Before you begin creating your own Dataiku projects, let’s study a completed one from the gallery. The Dataiku gallery is a public read-only Dataiku instance. Accordingly, it has some limitations compared to a normal instance, but it has most of what you need to get started.
Visit the Dataiku gallery: https://gallery.dataiku.com/
Find the deactivated + New Project button, where you would normally get started.
On the left, filter for Fake Job Postings Quick Start, and select the project. Or, directly visit: https://gallery.dataiku.com/projects/QS_JOB_POSTINGS/flow/
Tip
This project starts with a dataset of real and fake job postings, applies some basic data preparation steps, and eventually trains a model to classify real from fake job postings. You’ll have a chance to build it from scratch in the Data Preparation Quick Start!
Data pipelines in Dataiku vs. Alteryx#
Data lineage#
The Flow in Dataiku is similar to a workflow in Alteryx. Both provide a visual representation of a data pipeline. Let’s take a look!
If not already there, use the keyboard shortcut
g
+f
to go to the project’s Flow.
Both an Alteryx workflow and a Dataiku Flow begin with input data sources on the left. While preserving the input data, both platforms provide a visual way to understand the lineage of data transformations, before producing a final output on the right.
Of course, Dataiku introduces its own visual grammar that you’ll quickly learn:
Shape |
Item |
Icon |
---|---|---|
Dataset |
The icon on the square represents the dataset’s storage location, such as Amazon S3, Snowflake, PostgreSQL, etc. |
|
Recipe |
The icon on the circle represents the type of data transformation, such as a broom for a Prepare recipe or coiled snakes for a Python recipe. |
|
Model |
The icon on the diamond represents the type of modeling task, such as prediction, clustering, time series forecasting, etc. |
Tip
In addition to shape, color has meaning too.
Datasets are blue. Those shared from other projects are black.
Visual recipes are yellow. Code recipes are orange. LLM recipes are pink. Plugin recipes are red.
Machine learning elements are green.
Pipeline organization#
One of the biggest differences between a workflow in Alteryx and a project in Dataiku is that you cannot rearrange the objects in a Dataiku Flow as you would on an Alteryx canvas. This can be a challenging mindset shift for users accustomed to dragging and dropping objects to create a personal pipeline.
Dataiku automatically determines the location of objects in the Flow. Think of it as freeing you from the obligation of organizing the pipeline yourself. As pipelines become long, this can become a major burden. Moreover, since you’ll be building these Flows alongside colleagues, having a standardized presentation is helpful.
In Alteryx, you would use the Tool Container to organize tools in a workflow. In Dataiku, you can organize a Flow by dividing it into separate Flow zones.
This project, for example, has two zones (Data Preparation and Machine Learning) for the two main stages of the project.
Click on the full screen icon at the top right corner of the Data Preparation Flow zone to focus on that part of the project.
Click the X in the top right corner to go back to the main Flow (or
g
+f
).In the bottom left, open the Flow views menu.
Select Flow Zones to see the Flow color-coded by zone.
Click Hide Zones to see the entire undivided Flow.
When finished, click the X to return to the default view.
Tip
Another helpful strategy can be hiding parts of a Flow. For example, right click on a dataset, and select Hide all upstream or Hide all downstream to customize the Flow as you need it.
Dataset details#
One of the most frequent tasks for an analyst is tracking basic details or metadata about a particular dataset. In Dataiku, this comes built-in to every dataset.
Let’s highlight the Details tab of the right panel and the records count Flow view as two ways to demonstrate this point.
From the Flow, click once to select the job_postings dataset.
Click the (Details icon) to see associated metadata such as a description, tags, and status.
In the bottom left corner, open the Flow view menu, and select Records count.
Observe the color-coded record count of each dataset.
When finished, click the X to return to the default view.
Exploratory data analysis (EDA) in Dataiku vs. Alteryx#
Dataset sampling#
In Alteryx, you may have used the Sample tool to limit the size of a data stream. In Dataiku, you may find yourself adding a Sample/Filter recipe at the start of the Flow for a similar purpose.
However, sampling is also baked-in to all datasets in the Flow. By default, the Explore tab displays the first 10,000 rows of a dataset.
This can be surprising to new users, but it is necessary for a platform where you may be working with very large datasets as opposed to smaller datasets stored locally on your computer.
Double click on the job_postings dataset at the start of the pipeline to open its Explore tab.
Click the Sample button to check the current sample settings and other available methods.
When finished, close it by clicking the Sample button once more.
Tip
If rather than big data, your reality is wrangling small datasets, such as Excel worksheets, be sure to also check out the Excel to Dataiku Quick Start.
Dataset profiling#
To quickly understand the contents of a dataset in Alteryx, you may have added the Browse tool after each tool. As before, this kind of information in Dataiku can be accessed from all datasets by default, which means less work cleaning up extraneous investigations that distract from the pipeline’s core activities.
Important
Remember that all such metrics are only computed based on the current sample (unless otherwise requested). In this case, that means the first 10,000 rows instead of the full 17,880. Sampling keeps things snappy!
Underneath the name of each column in the job_postings dataset, see the data quality bar representing the percentage of missing values in each column (according to the current sample). Several columns, such as salary_range have a lot of missing values.
Click the Quick column stats icon to view distributions for every column.
Click the header on a column such as location, and select Analyze from the dropdown menu (which has fewer options than normal because this is a read-only instance).
Use the arrows to scroll through statistics of other columns, and then close the dialog when finished.
Tip
Rather than separately adding Interactive Chart or Data Investigation tools on an ad-hoc basis, Dataiku builds this need into every dataset. Navigate to the Charts tab to create drag and drop style visualizations. You’ll find examples for the job_postings dataset.
Data preparation in Dataiku vs. Alteryx#
Dataset actions#
To build a data pipeline in Alteryx, you would drag and drop tools from the tool palette onto the canvas. To do the same in Dataiku, you apply recipes to datasets through the Actions menu of the right panel.
Navigate back to the Flow (
g
+f
). If needed, typeshift
+z
r
to reset the zoom.From the Flow, click once to select any dataset (a blue square), such as job_postings_prepared.
Click the (Actions icon) to see what actions can be applied to the selected dataset.
In particular, see the (inactive) menu of Visual recipes.
Tip
The available actions in the right panel change depending on what you currently have selected. To demonstrate this, select two datasets at the same time, or one dataset and one recipe together.
Tool library#
Compared to Alteryx, it may initially seem that Dataiku has fewer visual data preparation helpers. This is because the Prepare recipe is your Swiss army knife. This one visual recipe contains approximately 100 processors that cover functionality you’d find in Alteryx tools like Select, Formula, Transpose — as well as much more.
The Prepare recipe found in this project, for example, has eight processor steps in its script. Having all of these actions completed in one recipe keeps the overall data pipeline more readable.
From the Flow, double click to open the Prepare recipe (the yellow circle with the broom icon).
Scroll to the end of the script, and click + Add a New Step to open the processor library.
Browse the available steps. You can even add some! The Geography section may be particularly interesting if you work with geospatial data. Close the library when finished.
Experiment toggling the eye and on/off icons to view the impact of a step or to disable a step.
Joining datasets#
In addition to the Prepare recipe, Dataiku offers many other visual recipes for common data transformations. For example, in Alteryx, you would typically use the Join tool to combine datasets based on a common field. In Dataiku, you use a Join recipe.
Let’s take a closer look at the Join recipe in this project. To get there, instead of going back to the Flow, let’s demonstrate another navigation method.
From inside the Prepare recipe, type
shift
+a
to open the Flow navigator.Press the right arrow key until you reach the downstream Join recipe (the yellow circle with the Venn diagram). Then press the
enter/return
key to open it.Click on Left join to view the available types of joins.
Recognize how unmatched rows from the join operation are sent to the unmatched dataset.
Browse the recipe’s steps on the left, such as the Post-filter. Although not used here, many visual recipes include steps to filter data or compute new columns before or after the recipe’s primary operation.
Tip
Later, you can enroll in the Visual Recipes course in the Dataiku Academy to dive into the complete menu of visual recipes!
Data processing in Dataiku vs. Alteryx#
Computation engines#
In addition to its own engine for local processing, Alteryx has a limited number of in-database tools that enable a workflow to be run in-database. Dataiku’s computational environment is quite different.
As a cloud-native platform, you connect to Dataiku through a remote server. In addition to being more secure from IT’s perspective, this enables you to access much more powerful computing resources, which can speed up processing times for large jobs.
Despite all this impressive technology underneath, as a user building a pipeline, you won’t need to worry about selecting a computation engine. Dataiku selects the most optimal engine for your recipe based on where your data is stored, the type of processing at hand, and the infrastructure available to your instance.
With the Join recipe open, click the icon to open the recipe engine dialog.
In the dialog, click Show to view the non-selectable engines as well, and read why they are not available.
Close the dialog when finished.
Tip
You’ll learn more about how Dataiku selects recipe engines in the Data Pipelines course of the Advanced Designer learning path.
Data connections#
All visual recipes in this project use the DSS engine because all datasets are stored in filesystem connections. This would not be the case in a real project!
There’s a handy Flow view to see which connections are used in your project.
Navigate back to the Flow (
g
+f
).At the bottom left, open the Flow views menu.
Select Connections to observe that all datasets downstream of the initial uploaded files are stored on the filesystem.
Depending on your organizations’ infrastructure investments, your data may be stored anywhere from a traditional relational database to a cloud data warehouse — essentially, wherever the data ecosystem moves!
Rather than transferring data across a network, Dataiku creates a view into where the data is stored. Think of the Flow as a visual layer on top of your organization’s existing storage and computing infrastructure.
Tip
Take the Join recipe in this project for example:
If both input and output datasets were stored in Snowflake, Dataiku would have selected the in-database SQL engine.
If both input and output datasets were stored in Amazon S3 buckets, Dataiku would have selected the Spark engine.
These engines would do the actual computation. Dataiku would show a sample of the results (and call upon the complete results whenever you request).
Running data pipelines#
Although Alteryx provides a Detour tool to bypass processes or the ability to turn off a particular container, you generally have limited control over how a workflow will run. Dataiku has much more flexible options for executing a data pipeline.
First, Dataiku has a concept of a dataset being “out of date”. If an upstream recipe has changed, the downstream datasets are out of date. Because Dataiku is aware of the Flow’s dependencies, it can skip computation steps when they are not required.
Second, Dataiku can build pipelines, or sections of pipelines, with respect to an upstream or downstream reference. For example, while working on this project, you might want to:
Build just the job_postings_prepared dataset by computing only the Prepare recipe immediately prior to it.
Build only outputs in the Machine Learning Flow zone.
Select the upstream job_postings dataset, and build all outputs downstream.
Select the downstream test_scored dataset, and compute only the necessary upstream dependencies to get an up-to-date output.
Unfortunately, you can’t try this on a read-only instance. Below is a screenshot of a build dialog you would normally see in this last case.
Right click on the test_scored dataset at the end of the pipeline.
Note the inactive Build option.
Tip
If you would use Batch or Iterative macros to build data pipelines in succession with different parameters, look into Dataiku’s concept of dynamic datasets and repeating recipes.
Intermediate datasets#
One consequence of Dataiku’s architecture is that intermediate datasets in a Flow should not be a major concern. In fact, the presence of intermediate datasets allows you to view and analyze a sample of data at any point in the Flow using standardized methods — without having to add additional tools on an ad hoc basis.
Also, it’s important to keep in mind several points:
No data is stored locally on your computer.
While prototyping your Flow, you may only be building one dataset or one zone instead of the entire pipeline from start to finish every time.
Given the smart computation options, you won’t need to recompute the same data if it is not required.
Since you’re not using a desktop tool, while a job is running, you are free to move on to other tasks. Remember you can have Dataiku open in multiple browser tabs!
As to be discussed below, Dataiku has separate environments for development and production, meaning that you may refactor your Flow to be more efficient before moving it to a production environment.
Automation in Dataiku vs. Alteryx#
Job scheduling#
When you have finished creating a workflow in Alteryx, the next step may be to publish it to Alteryx Server, where you can then schedule the workflow’s execution. Dataiku’s answer for job scheduling has two main components: scenarios and the Automation node.
A scenario is the way to automate actions in Dataiku. These actions could be tasks such as rebuilding a dataset at the end of a pipeline, retraining a model, or exporting a dashboard. You can even dynamically control how these actions execute. For example, if the average of a certain column is outside of a particular range, you can have the scenario stop its execution.
Once you have defined the set of actions to take place, you can define a trigger for when those actions should execute. In addition to time-based triggers, you can also define other types of triggers, such as when a dataset changes or with Python code. The completion of one scenario can even trigger the start of another scenario!
Finally, you can attach reporters to scenarios that send alerts through various messaging channels. For example, after a successful (or failed) run, the scenario can send an email with the results (or error message).
From the top navigation bar, go to the Jobs menu (the play button), and select Scenarios.
Click to open the Score Data scenario.
Click Add Trigger to see options for when the scenario should start.
Click Add Reporter to see options for what kinds of alerts can be sent.
Navigate to the Steps tab near the top right to explore the actions included.
Click Add Step to see what other steps are available.
Tip
If you would often build analytics apps or standard macros in Alteryx, you may be interested in Dataiku applications or Dataiku applications-as-recipes, respectively. Both are ways to repackage Dataiku projects into reusable applications. You’ll learn more in the Dataiku Applications Academy course.
Production environments#
The second component of Dataiku’s answer to job scheduling is an Automation node.
One reason to use Alteryx Server is to get the workflow’s computation off your desktop. Dataiku scenarios are already running remotely. Still, Dataiku as a platform actually consists of multiple nodes for specific stages of the AI lifecycle.
The Design node (what you’ve been looking at in the gallery) is where you’ll spend the vast majority of your time. It is the development sandbox where you actively experiment with building Flows.
Production workflows, on the other hand, require a separate environment to avoid any unforeseen mishaps and allow for proper monitoring.
For a batch workload, this environment is an Automation node.
For a real-time API use case, this would often be an API node, but there are also external deployment options (AWS SageMaker, Azure ML, Google Vertex AI, Snowflake).
There’s even a Govern node for monitoring data and model quality at the organization level.
For batch workloads most common to Alteryx users, when you have finished building your Flow and created your scenario, the next step is to publish the project as a bundle on an Automation node, where the scenario can run undisturbed.
On a normal instance, you’d be able to see a page of project bundles, like this one for example:
Tip
You’ll learn more about pushing project bundles to an Automation node in the Project Deployment Academy course.
Collaboration in Dataiku vs. Alteryx#
Tracking changes#
If accustomed to working individually in a desktop application, real-time collaboration in a web browser promises many exciting opportunities, but it can also be slightly intimidating.
For example, if you and a colleague are working on a recipe simultaneously, Dataiku will show an alert to avoid you overwriting each other’s work. (Think of the floating heads at the top of a shared Google Doc!) If this is a frequent problem, one strategy is prototyping your Flow in a separate Flow zone.
You may also be concerned about colleagues making changes to your Flow. The right panel’s Timeline tab, which tracks changes to an item, is one way to address this question. If you have a question for a colleague, use the Discussions tab just above it. There are also Flow views to inform you about recent modifications.
If something unforeseen does happen, Dataiku projects have a built-in Git repository so you can revert a project to a previous state.
Navigate back to the Flow (
g
+f
), and select a dataset or recipe.Locate the Discussions and Timeline tabs of the right panel.
From the bottom left corner, open the Flow views menu.
Explore views such as Last modification and Recent modifications (which are not too interesting for this simple project!).
From the More Options (…) menu in the top navigation bar, select Version control to view the project’s Git commit history.
Security#
The prospect of collaboration also introduces a security question: will unauthorized colleagues gain access to my project or data? Dataiku addresses this critical issue with a groups-based permission framework.
To summarize, a user can belong to any number of groups. On a per-project basis, project owners grant groups various permissions, such as the ability to write project content, export datasets, or run scenarios.
As you’ve already seen on this gallery project, you have permission to read project content, but do not have write access.
Open the Profile menu at the very top right corner of the screen.
Click the gear icon to see Profile & Settings.
Recognize that this generic Gallery user belongs to a group called public_users.
What’s next?#
Now that you have examined a completed Dataiku project, it’s time to build one yourself!
After signing up for a free trial, you’ll build the same project you’ve just examined from the ground up in the Data Preparation Quick Start.
Once that is completed, if you work a great deal with geospatial data, you’ll also be interested in our course on Geospatial Analytics.
Alternatively, if you were a user of Alteryx Machine Learning, you should also check out the Machine Learning Quick Start for a quick tour of AutoML with Dataiku.