Quick Start | Alteryx to Dataiku#

Get started#

If you have experience working in a data analytics platform such as Alteryx, the prospect of migrating to a cloud-native platform like Dataiku may be exciting, but also somewhat daunting.

With that transition in mind, let’s take a quick tour of a Dataiku project to introduce some of the main conceptual differences that will help you make the leap!

Objectives#

In this quick start, you’ll:

  • Review a completed Dataiku project from the perspective of a user with an Alteryx background.

  • Use this experience to understand the high-level differences between working as an analyst in Alteryx vs. Dataiku.

Prerequisites#

For this tutorial, you won’t actually need a Dataiku account of your own. All you need is internet access and a supported web browser (Google Chrome, Mozilla Firefox, or Microsoft Edge).

View the project#

In Alteryx, you would start by creating a new workflow. In Dataiku, you start with a new project.

Before you begin creating your own Dataiku projects, let’s study a completed one from the gallery. The Dataiku gallery is a public read-only Dataiku instance. Accordingly, it has some limitations compared to a normal instance, but it has most of what you need to get started.

  1. Visit the Dataiku gallery: https://gallery.dataiku.com/

  2. Find the deactivated + New Project button, where you would normally get started.

  3. On the left, filter for Fake Job Postings Quick Start, and select the project. Or, directly visit: https://gallery.dataiku.com/projects/QS_JOB_POSTINGS/flow/

Dataiku screenshot of the home page of the gallery.

Tip

This project starts with a dataset of real and fake job postings, applies some basic data preparation steps, and eventually trains a model to classify real from fake job postings. You’ll have a chance to build it from scratch in the Data Preparation Quick Start!

Data pipelines in Dataiku vs. Alteryx#

Data lineage#

The Flow in Dataiku is similar to a workflow in Alteryx. Both provide a visual representation of a data pipeline. Let’s take a look!

  1. If not already there, use the keyboard shortcut g + f to go to the project’s Flow.

Dataiku screenshot of the Flow.

Both an Alteryx workflow and a Dataiku Flow begin with input data sources on the left. While preserving the input data, both platforms provide a visual way to understand the lineage of data transformations, before producing a final output on the right.

Of course, Dataiku introduces its own visual grammar that you’ll quickly learn:

Shape

Item

Icon

Dataset icon.

Dataset

The icon on the square represents the dataset’s storage location, such as Amazon S3, Snowflake, PostgreSQL, etc.

Recipe icon.

Recipe

The icon on the circle represents the type of data transformation, such as a broom for a Prepare recipe or coiled snakes for a Python recipe.

Dataset icon.

Model

The icon on the diamond represents the type of modeling task, such as prediction, clustering, time series forecasting, etc.

Tip

In addition to shape, color has meaning too.

  • Datasets are blue. Those shared from other projects are black.

  • Visual recipes are yellow. Code recipes are orange. LLM recipes are pink. Plugin recipes are red.

  • Machine learning elements are green.

See also

The Flow presents a dataset-level view of data lineage. For a column-level view of data lineage, select a dataset, navigate to the Schema (Schema icon.) tab in the right panel, and click the data lineage (Data lineage icon.) icon.

Pipeline navigation#

A new software platform brings its own way of navigation. While you can always click around to inspect the Flow, it can also be quite helpful to start learning some of the keyboard shortcuts.

Let’s use one shortcut to preview the data in the Flow.

  1. From the Flow, type a question mark (?) to bring up the menu of keyboard shortcuts.

  2. Close that window, and type shift + p to bring up the dataset preview window. (Or click Preview near the bottom right).

  3. Click once on the job_postings dataset at the start of the pipeline to preview its first 50 rows.

  4. Click on other downstream datasets to preview them as well.

  5. When finished, minimize the Preview menu (or shift + p again).

Dataiku screenshot of a dataset preview.

Tip

See these references for the navigation bar and right panel to get better acquainted with the interface.

Pipeline organization#

In Alteryx, you would drag and drop objects onto a canvas to create a personal pipeline. Dataiku, on the other hand, automatically determines the location of objects in the Flow.

While this can be a challenging mindset shift, it frees you from the obligation of organizing the pipeline yourself. As pipelines become long, this can become a major burden. Moreover, since you’ll be building these Flows alongside colleagues, having a standardized presentation is helpful.

In Alteryx, you would use the Tool Container to organize tools in a workflow. In Dataiku, you can organize a Flow by dividing it into separate Flow zones. By default, the position of Flow zones is automatically set. However, if you enable manual Flow zone positioning (as will be done for this gallery project when it received an upgrade to 13.3), you can drag zones as needed.

This project has two zones (Data Preparation and Machine Learning) for the two main stages of the project.

  1. Click on the full screen (Full screen Flow zone icon.) icon at the top right corner of the Data Preparation Flow zone to focus on that part of the project.

  2. Click the X in the top right corner to go back to the main Flow (or g + f).

  3. In the bottom left, open the Flow views menu.

  4. Select Flow Zones to see the Flow color-coded by zone.

  5. Click Hide Zones to see the entire undivided Flow.

  6. When finished, click the X to return to the default view.

  7. Experiment manually positioning Flow zones by clicking and dragging from the zone header (or alt + click + drag from anywhere in the zone).

Dataiku screenshot of the Flow zones view.

Tip

Another helpful strategy can be hiding parts of a Flow. For example, right click on a dataset, and select Hide all upstream or Hide all downstream to customize the Flow as you need it.

Dataset details#

One of the most frequent tasks for an analyst is tracking basic details or metadata about a particular dataset. In Dataiku, this comes built-in to every dataset.

Let’s highlight the Details (Details icon.) tab of the right panel and the records count Flow view as two ways to demonstrate this point.

  1. From the Flow, click once to select the job_postings dataset.

  2. Click the Details (Details icon.) icon to see associated metadata such as a description, tags, and status.

  3. In the bottom left corner, open the Flow view menu, and select Records count.

  4. Observe the color-coded record count of each dataset.

  5. When finished, click the X to return to the default view.

Dataiku screenshot of the records count Flow view.

Exploratory data analysis (EDA) in Dataiku vs. Alteryx#

Dataset sampling#

When using standard tools in Alteryx, all connected datasets are read into memory when you run a workflow. To get around this fact, you may have set a record limit to cap the size of a data stream and then removed it when needing the full results.

Dataiku, on the other hand, builds sampling into all datasets in the Flow. By default, the Explore tab displays the first 10,000 rows of a dataset. This can be surprising to new users, but it is necessary during the design phase of a project where you may be working with very large datasets as opposed to smaller datasets stored locally on your computer.

  1. Double click on the job_postings dataset at the start of the pipeline to open its Explore tab.

  2. Click the Sample button to check the current sample settings and other available methods.

  3. When finished, close it by clicking the Sample button once more.

Dataiku screenshot of the sampling settings of a dataset.

Tip

If rather than big data, your reality is wrangling small datasets, such as Excel worksheets, be sure to also check out the Excel to Dataiku Quick Start.

Dataset profiling#

To quickly understand the contents of a dataset in Alteryx, you may have added the Browse tool after each tool. As before, this kind of information in Dataiku can be accessed from all datasets by default, which means less work cleaning up extraneous investigations that distract from the pipeline’s core activities.

Important

Remember that all such metrics are only computed based on the current sample (unless otherwise requested). In this case, that means the first 10,000 rows instead of the full 17,880. Sampling keeps things snappy!

  1. Underneath the name of each column in the job_postings dataset, see the data quality bar representing the percentage of missing values in each column (according to the current sample). Several columns, such as salary_range have a lot of missing values.

  2. Click the Quick column stats (Quick column stats icon.) icon to view distributions for every column.

  3. Click the header on a column such as location, and select Analyze from the dropdown menu (which has fewer options than normal because this is a read-only instance).

  4. Use the arrows to scroll through statistics of other columns, and then close the dialog when finished.

Dataiku screenshot of the Explore tab of a dataset.

Tip

Rather than separately adding Interactive Chart or Data Investigation tools on an ad-hoc basis, Dataiku builds this need into every dataset. Navigate to the Charts tab to create drag and drop style visualizations. You’ll find examples for the job_postings dataset.

Data preparation in Dataiku vs. Alteryx#

Dataset actions#

To build a data pipeline in Alteryx, you would drag and drop tools from the tool palette onto the canvas. To do the same in Dataiku, you apply recipes to datasets through the Actions menu of the right panel.

  1. Navigate back to the Flow (g + f). If needed, type shift + z r to reset the zoom.

  2. From the Flow, click once to select any dataset (a blue square), such as job_postings_prepared.

  3. Click the Actions (Actions icon.) icon to see what actions can be applied to the selected dataset.

  4. In particular, see the (inactive) menu of Visual recipes.

Dataiku screenshot of the Actions menu of a dataset.

Tip

The available actions in the right panel change depending on what you currently have selected. To demonstrate this, select two datasets at the same time, or one dataset and one recipe together.

Tool library#

Compared to Alteryx, it may initially seem that Dataiku has fewer visual data preparation helpers. This is because the Prepare recipe is your Swiss army knife. This one visual recipe contains approximately 100 processors that cover functionality you’d find in Alteryx tools like Select, Formula, Transpose — as well as much more.

The Prepare recipe found in this project, for example, has eight processor steps in its script. Having all of these actions completed in one recipe keeps the overall data pipeline more readable.

  1. From the Flow, double click to open the Prepare recipe (the yellow circle with the broom icon).

  2. Scroll to the end of the script, and click + Add a New Step to open the processor library.

  3. Browse the available steps. You can even add some! The Geography section may be particularly interesting if you work with geospatial data. Close the library when finished.

  4. Experiment toggling the eye (View impact icon.) and power (Disable step icon.) icons to view the impact of a step or to disable a step.

Dataiku screenshot of a Prepare recipe.

Tip

The last two steps in the Prepare recipe feature Dataiku formulas, which use a spreadsheet-like expression language for performing row-by-row calculations, manipulating strings, and more. Although you might look to the Window recipe for grouped aggregations like cumulative sums and moving averages, formulas also support relative referencing for the kind of offset calculations you may be accustomed to in Alteryx’s multi-row formula tool.

Joining datasets#

In addition to the Prepare recipe, Dataiku offers many other visual recipes for common data transformations. For example, in Alteryx, you would typically use the Join tool to combine datasets based on a common field. In Dataiku, you use a Join recipe.

Let’s take a closer look at the Join recipe in this project. To get there, instead of going back to the Flow, let’s demonstrate another navigation method.

  1. From inside the Prepare recipe, click the navigator (Flow navigator icon.) icon (or type shift + a).

  2. Press the right arrow key until you reach the downstream Join recipe (the yellow circle with the Venn diagram). Then press the enter/return key to open it.

  3. Click on Left join to view the available types of joins.

  4. Recognize how unmatched rows from the join operation are sent to the unmatched dataset.

  5. Browse the recipe’s steps on the left, such as the Post-filter. Although not used here, many visual recipes include steps to filter data or compute new columns before or after the recipe’s primary operation.

Dataiku screenshot of a Join recipe.

Tip

Later, you can enroll in the Visual Recipes course in the Dataiku Academy to dive into the complete menu of visual recipes!

Data processing in Dataiku vs. Alteryx#

Dataiku’s built-in sampling is a great advantage when it comes to working interactively in the design phase of a project. Of course, you’ll also need to execute recipes to compute the full outputs.

Computation engines#

In addition to its own engine for local processing, Alteryx has a limited number of in-database tools that enable a workflow to be run in-database. Dataiku’s computational environment is quite different.

As a cloud-native platform, you connect to Dataiku through a remote server. In addition to being more secure from IT’s perspective, this enables you to access much more powerful computing resources, which can speed up processing times for large jobs.

Despite all this impressive technology underneath, as a user building a pipeline, you won’t need to worry about selecting a computation engine. Dataiku selects the most optimal engine for your recipe based on where your data is stored, the type of processing at hand, and the infrastructure available to your instance.

  1. With the Join recipe open, click the gear (Gear icon.) to open the recipe engine dialog.

  2. In the dialog, click Show to view the non-selectable engines as well, and read why they are not available.

  3. Close the dialog when finished.

Dataiku screenshot of the recipe engine dialog.

Tip

You’ll learn more about how Dataiku selects recipe engines in the Data Pipelines course of the Advanced Designer learning path.

Data connections#

All visual recipes in this project use the DSS engine because all datasets are stored in filesystem connections. This would not be the case in a real project!

There’s a handy Flow view to see which connections are used in your project.

  1. Navigate back to the Flow (g + f).

  2. At the bottom left, open the Flow views menu.

  3. Select Connections to observe that all datasets downstream of the initial uploaded files are stored on the filesystem.

Dataiku screenshot of the Connections Flow view.

Depending on your organizations’ infrastructure investments, your data may be stored anywhere from a traditional relational database to a cloud data warehouse — essentially, wherever the data ecosystem moves!

Rather than transferring data across a network, Dataiku creates a view into where the data is stored. Think of the Flow as a visual layer on top of your organization’s existing storage and computing infrastructure.

Tip

Take the Join recipe in this project for example:

  • If both input and output datasets were stored in Snowflake, Dataiku would have selected the in-database SQL engine.

  • If both input and output datasets were stored in Amazon S3 buckets, Dataiku would have selected the Spark engine.

These engines would do the actual computation. Dataiku would show a sample of the results (and call upon the complete results whenever you request).

Running data pipelines#

Although Alteryx provides a Detour tool to bypass processes or the ability to turn off a particular container, you generally have limited control over how a workflow will run. Dataiku has much more flexible options for executing a data pipeline.

First, Dataiku has a concept of a dataset being “out of date”. If an upstream recipe has changed, the downstream datasets are out of date. Because Dataiku is aware of the Flow’s dependencies, it can skip computation steps when they are not required.

Second, Dataiku can build pipelines, or sections of pipelines, with respect to an upstream or downstream reference. For example, while working on this project, you might want to:

  • Build just the job_postings_prepared dataset by computing only the Prepare recipe immediately prior to it.

  • Build only outputs in the Machine Learning Flow zone.

  • Select the upstream job_postings dataset, and build all outputs downstream.

  • Select the downstream test_scored dataset, and compute only the necessary upstream dependencies to get an up-to-date output.

Unfortunately, you can’t try this on a read-only instance. Below is a screenshot of a build dialog you would normally see in this last case.

  1. Right click on the test_scored dataset at the end of the pipeline.

  2. Note the inactive Build option.

Dataiku screenshot of the build mode dialog.

Tip

If you would use Batch or Iterative macros to build data pipelines in succession with different parameters, look into Dataiku’s concept of dynamic datasets and repeating recipes.

Intermediate datasets#

One consequence of Dataiku’s architecture is that intermediate datasets in a Flow should not be a major concern. In fact, the presence of intermediate datasets allows you to view and analyze a sample of data at any point in the Flow using standardized methods — without having to add additional tools on an ad hoc basis.

Dataiku screenshot of intermediate datasets in a Flow.

Also, it’s important to keep in mind several points:

  • No data is stored locally on your computer.

  • While prototyping your Flow, you may only be building one dataset or one zone instead of the entire pipeline from start to finish every time.

  • Given the smart computation options, you won’t need to recompute the same data if it is not required.

  • Since you’re not using a desktop tool, while a job is running, you are free to move on to other tasks. Remember you can have Dataiku open in multiple browser tabs!

  • As to be discussed below, Dataiku has separate environments for development and production, meaning that you may refactor your Flow to be more efficient before moving it to a production environment.

Orchestration in Dataiku vs. Alteryx#

After creating a workflow in Alteryx, the next step is often to publish it to Alteryx Server, where it can be scheduled and shared. However, orchestrating multiple workflows, such as triggering a new one upon the completion of another, can be challenging.

Dataiku’s answer for workflow orchestration has three main components:

  • Scenarios to automate actions

  • Data quality rules, metrics, and checks to validate data

  • Production environments to run jobs outside of the development environment

Automation scenarios#

A scenario is the way to automate actions in Dataiku. These actions could be tasks such as rebuilding a dataset at the end of a pipeline, retraining a model, or exporting a dashboard. You can even dynamically control how these actions execute. For example, if the average of a certain column is outside of a particular range, you can have the scenario stop its execution.

Once you have defined the set of actions to take place, you can define a trigger for when those actions should execute. In addition to time-based triggers, you can also define other types of triggers, such as when a dataset changes or with Python code. The completion of one scenario can even trigger the start of another scenario!

Finally, you can attach reporters to scenarios that send alerts through various messaging channels. For example, after a successful (or failed) run, the scenario can send an email with the results (or error message).

  1. From the top navigation bar, go to the Jobs (Play button icon.) menu, and select Scenarios.

  2. Click to open the Score Data scenario.

  3. Click Add Trigger to see options for when the scenario should start.

  4. Click Add Reporter to see options for what kinds of alerts can be sent.

  5. Navigate to the Steps tab near the top right to explore the actions included.

  6. Click Add Step to see what other steps are available.

Dataiku screenshot of a scenario.

Tip

If you would often build analytics apps or standard macros in Alteryx, you may be interested in Dataiku applications or Dataiku applications-as-recipes, respectively. Both are ways to repackage Dataiku projects into reusable applications. You’ll learn more in the Dataiku Applications Academy course.

Data validation#

When you have scenarios running on automated schedules, you’ll need tools to verify that these jobs proceed as planned. This objective can be achieved with data quality rules, metrics, and checks.

The project at hand is not well-monitored. It has just one existing data quality rule that verifies if the record count of the input data is within a certain range.

  1. From the Jobs (Play button icon.) menu in the top navigation bar, click Data Quality (or use the keyboard shortcut g + q).

  2. Normally, this view would show a breakdown of all rules in the project, including those that may be returning errors or warnings.

  3. Click job_postings to view the data quality rule attached to this dataset.

  4. From the Data Quality tab of the job_postings dataset, click View Rules to see the expected record count.

Dataiku screenshot of project-level data quality.

Tip

Learn more about metrics, checks, and data quality rules in the Data Quality & Automation course in the Advanced Designer learning path.

Production environments#

The last leg of Dataiku’s answer to workflow orchestration is an Automation node.

One reason to use Alteryx Server is to get the workflow’s computation off your desktop. Dataiku scenarios are already running remotely. Still, Dataiku as a platform actually consists of multiple nodes for specific stages of the AI lifecycle.

The Design node (what you’ve been looking at in the gallery) is where you’ll spend the vast majority of your time. It is the development sandbox where you actively experiment with building Flows.

Production workflows, on the other hand, require a separate environment to avoid any unforeseen mishaps and allow for proper monitoring.

  • For a batch workload, this environment is an Automation node.

  • For a real-time API use case, this would often be an API node, but there are also external deployment options (AWS SageMaker, Azure ML, Google Vertex AI, Snowflake).

  • There’s even a Govern node for monitoring data and model quality at the organization level.

For batch workloads most common to Alteryx users, when you have finished building your Flow and created your scenario, the next step is to publish the project as a bundle on an Automation node, where the scenario can run undisturbed.

On a normal instance, you’d be able to see a page of project bundles, like this one for example:

Dataiku screenshot of a project bundle.

Tip

You’ll learn more about pushing project bundles to an Automation node in the Project Deployment Academy course.

Collaboration in Dataiku vs. Alteryx#

Sharing assets#

With Alteryx Designer, you would locally download assets, and then manually sync outputs or email files to distribute them. In Dataiku, however, if you make a change in a project, your colleague just needs to refresh the page to see it!

Think of repeatedly saving and emailing new versions of Microsoft Word files (my_project_final_final_final.docx) vs. Google Docs where you always have the latest version.

In fact, when you have an asset ready to share, Dataiku provides a wide variety of options to do so, depending on the situation. Of course, you can export a dataset to a format like CSV or Excel, or use a plugin to share it with a tool like PowerBI or Tableau, but more often you might:

Again, you will run into a limitation on the read-only instance, but you can imagine the possibilities!

  1. Navigate back to the Flow (g + f).

  2. Select the test_scored dataset at the end of the data pipeline.

  3. Open the Actions (Actions icon.) tab in the right panel.

  4. Click the vertical dots (Vertical dots icon.) to find the disabled options for Share and Publish.

  5. Type g + p, and open the existing Project dashboard.

Dataiku screenshot of sharing and publish menus.

Tracking changes#

If accustomed to working individually in a desktop application, real-time collaboration in a web browser promises many exciting opportunities, but it can also be slightly intimidating.

For example, if you and a colleague are working on a recipe simultaneously, Dataiku will show an alert to avoid you overwriting each other’s work. (Think of the floating heads at the top of a shared Google Doc!) If this is a frequent problem, one strategy is prototyping your Flow in a separate Flow zone.

You may also be concerned about colleagues making changes to your Flow. The right panel’s Timeline tab, which tracks changes to an item, is one way to address this question. If you have a question for a colleague, use the Discussions tab just above it. There are also Flow views to inform you about recent modifications.

If something unforeseen does happen, Dataiku projects have a built-in Git repository that automatically commits changes so you can revert a project to a previous state.

  1. Navigate back to the Flow (g + f), and select a dataset or recipe.

  2. Locate the Discussions (Discussions icon.) and Timeline (Timeline icon.) tabs of the right panel.

  3. From the bottom left corner, open the Flow views menu.

  4. Explore views such as Last modification and Recent modifications (which are not too interesting for this simple project!).

  5. From the More Options (Horizontal dots icon.) menu in the top navigation bar, select Version control to view the project’s Git commit history.

Dataiku screenshot of the last modification Flow view.

Security#

The prospect of collaboration also introduces a security question: will unauthorized colleagues gain access to my project or data? Dataiku addresses this critical issue with a groups-based permission framework.

To summarize, a user can belong to any number of groups. On a per-project basis, project owners grant groups various permissions, such as the ability to write project content, export datasets, or run scenarios.

As you’ve already seen on this gallery project, you have permission to read project content, but do not have write access.

  1. Open the Profile menu at the very top right corner of the screen.

  2. Click the gear icon to see Profile & Settings.

  3. Recognize that this generic Gallery user belongs to a group called public_users.

Dataiku screenshot of the profile page.

What’s next?#

Now that you have examined a completed Dataiku project, it’s time to build one yourself!

After signing up for a free trial, you’ll build the same project you’ve just examined from the ground up in the Data Preparation Quick Start.

See also

  • Once that is completed, if you work a great deal with geospatial data, you’ll also be interested in our course on Geospatial Analytics.

  • Alternatively, if you were a user of Alteryx Machine Learning, you should also check out the Machine Learning Quick Start for a quick tour of AutoML with Dataiku.

  • If you tried coding with Python and/or R in Alteryx, you’ll find a much wider set of capabilities for coders inside Dataiku. See the Developer Guide or the Developer learning path to get started.