Diagnosing Performance Issues in Dataiku

This article will discuss tips to help scope and troubleshoot performance issues in Dataiku DSS (DSS).

Performance issues can be some of the most difficult issues to diagnose. We’ll walk through the common causes for performance issues, how to approach each, and discuss the types of questions to ask when scoping a performance issue. Then, we’ll walk through some specific examples of performance issues and the steps taken to resolve these example issues.

Scoping the Issue

Start by asking a few questions to scope the issue you are experiencing. Knowing the answer to each question will help you better understand how to approach the issue and can lead to faster resolution.

  • What is slow? Is the DSS UI loading slowly? Is a specific job taking longer than expected to complete? Are you running into any errors in addition to the performance issue, or is the issue limited to a performance issue? Is there another type of problem beyond the performance issue (e.g., your instance keeps crashing)?

  • Who is experiencing the performance issue? Is every user experiencing the same issue (e.g., the Flow is slow to load for every user), or are only some users experiencing it? (e.g., a notebook is running slowly for one user). While both may have the same level of urgency for your DSS usage, it’s important to figure out who is impacted to accurately diagnose the issue. For instance, if only one user is experiencing a problem, that particular issue may be related to a specific project or browser. If all users are experiencing the issue, it might be related to server resources.

  • When does the issue occur? Is this an intermittent issue that users sometimes experience, or does this happen every time a user performs a specific action (e.g., every time a user explores a specific dataset)? Have you noticed any other type of pattern? For instance, are you experiencing slowness when running a job at the same time each day but not if you run it at a different time?

  • Where is the execution slow? If you are experiencing slowness for a particular job, is the job running locally on the DSS server, or is it a job that’s running remotely on a Kubernetes cluster or a Spark cluster? Is it a SQL query that is executing on a remote SQL database? This information will narrow down what type of performance issue you are encountering. You may be able to explore alternatives based on where the job is executing.

Common Performance Issues

One of the most common categories of performance issues is when a job takes longer than expected to run. In this case, you’ll first want to identify what type of job it is.

Code Recipes

In the case of a code recipe (Python, PySpark, R), it’s possible that the code simply has some inefficiencies that are slowing down the code execution. To diagnose this, a good strategy is to run your code line by line in a notebook. If one line or section takes a very long time to complete, you can try to alter that part of the code to be more performant.

A code recipe running on the DSS server is similar to running the same code outside DSS. This means that the operating system resources will impact a code recipe running in DSS in similar ways that they’ll impact code running outside of DSS. Specifically, concurrent processes on the DSS server can impact the performance of a code recipe.

Visual Recipes

In the case of a visual recipe, two general causes for slow performance are using an inefficient execution engine, or using data formats that don’t allow for the most optimal execution engine.

Inefficient Execution Engine

Let’s say you are trying to join a local filesystem dataset with a PostgreSQL dataset. In order to use the more optimal “In-database (SQL)” engine, you can sync your filesystem dataset to a PostgreSQL dataset, so that both of your input datasets are in PostgreSQL.

Let’s imagine that this was your initial flow, joining a Postgres table with a filesystem dataset, and you’re experiencing that the job takes a while to run:

Project flow with PostgreSQL datasets.

In this case, within the Join recipe, you’ll see a warning that alerts you that Dataiku will use the DSS engine, and that this engine is indeed not optimal for this particular setup:

Recipe with inefficient execution engine warning.

You can optimize this Flow by syncing your filesystem dataset to PostgreSQL, and then performing your Join recipe on two input Postgres datasets instead. This is what the Flow would look like after optimizing it:

Optimized project flow with two input PostgreSQL datasets.

This will allow you to select the In-database (SQL) engine for your Join recipe. As a general rule of thumb, if your data is stored in a database, the In-database (SQL) engine is the best choice for computation and will be the most performant.

Recipe with in-database SQL engine selected.

Like the example above, DSS provides specific fast-path engines that will usually be the most performant. You’ll want to investigate if any recipe could be using a fast-path engine if configured differently.

You can check which engine you are using for a recipe by looking at the bottom left-hand side of the visual recipe. You will see all selectable engines by clicking on the wheel icon next to your engine selection:

Recipe with selectable execution engines.

Suboptimal Dataset Formats

Another way to tell if a visual recipe is not optimized is by looking in the job log for any reference to “Computation will not be distributed.” That’s an indicator that there is something suboptimal in your input/output dataset format, the engine you’ve selected, or the permissions on the input/output dataset connection.

For example, using the fast-path when writing to an S3 CSV dataset requires that the output dataset does not have a header row configured. If you attempt to write to an output S3 CSV dataset that does, you’ll notice an entry in the job log that indicates that this is the case, and that this can lead to a performance issue:

[2022/01/21-17:47:35.980] [null-err-43] [INFO] [dku.utils]  - [2022/01/21-17:47:35.978] Cannot use Csv write fast-path for Csv-S3 dataset: Csv fast-path output is disabled in configuration

[2022/01/21-17:47:35.982] [null-err-43] [INFO] [dku.utils]  - [2022/01/21-17:47:35.978] Writing S3 dataset as remote dataframe. Computation will not be distributed

In each of the above cases, it’s usually best to modify your Flow in a way that will allow you to use the fast-path and preferred engine.

Instance-wide Performance

You could encounter different kinds of instance-wide issues on your DSS instance.

For example, you might encounter multiple jobs appearing to slow down when executed on your DSS node. This is most commonly the case when there are multiple Python processes running on the same instance, including Jupyter notebooks.

You can prevent this situation by ensuring that cgroups are set up on your instance.

If cgroups are not yet enabled on your instance, a single Python process can consume as much RAM and CPU as possible on the instance. Let’s say that there are multiple Python processes running on the same instance. An individual Python job will still be able to consume as much RAM and CPU as is available to it, but the available resources might be considerably lower at certain times, depending on what else is running on the DSS instance.

While a job may still run and complete successfully, this competition for resources between concurrently running Python processes can increase the runtime of each job if they are now bottlenecked by the available resources on the instance.

In addition, if one user is running an extremely intensive job and the instance does not have cgroups set up, this single job can cause slowdown for all other jobs executed on the instance at the same time.

You can take several steps to prevent this situation:

  • Enable cgroups on your instance in order to control memory consumption of processes on your DSS server.

  • Set up automatic termination of Jupyter notebooks on a regular cadence. This can be automated by creating a scenario that runs on a regular basis (i.e. weekly) and runs the Macro step “Kill Jupyter Sessions.” Jupyter notebook sessions are not automatically terminated, which can sometimes lead to unexpected load on the DSS server if users leave notebooks running. This can help prevent such load.

  • Offload jobs to a Kubernetes cluster. This will allow you to scale up and down resources as you need them, which will allow users to run more memory-intensive jobs without interrupting other jobs. You will likely want to set up memory request and cpu request parameters to ensure that you allocate appropriate resources on your cluster for individual jobs, in order to prevent similar competition between processes when executing on your cluster nodes. For more information, visit Kubernetes documentation: memory request and cpu request parameters

Takeaways

Sometimes slow performance is simply a reality.

Let’s say that you’ve reviewed your code for improvements and checked to see if your visual recipes are using the optimal dataset types and execution engines. Consider how much data you are processing. Is it a very wide dataset? Does it have a huge number of rows? If so, it might be difficult to reduce runtime further. Think about ways you can reduce the amount of data used in your recipe.

If you are processing a large number of rows, you can investigate partitioning your data, which builds in a structure to allow you to process specific slices of your data instead of your entire dataset at once. This can be a good option to help improve performance if this will work for your use case.

Think creatively about if you can alter your dataset, change the order in which you process your data, or sync your data to another dataset type that can improve performance.

Job performance troubleshooting

Now that we have a sense for the different types of performance issues that come up and how to approach them, let’s walk through some examples and how you might go about troubleshooting each.

Use Case 1: A Sync Recipe From Snowflake to S3 is Taking Many Hours to Complete

  • What: A specific visual sync recipe job in DSS is slower than expected. Our input dataset is a Snowflake dataset and our output is an S3 dataset with CSV formatting.

  • Who: Because this is impacting a single job, only one user is experiencing this performance issue, and only for this specific job.

  • When: We noticed that this job always takes about the same time to complete, so it consistently takes many hours to complete.

  • Where: This job is using the Spark engine, and it’s executing on our Spark cluster and not on the DSS server.

Visual recipe with Spark execution engine configured.

Troubleshooting Steps

We know this issue is consistent for this particular job. We also know the job is using the Spark engine and is executing on a Spark cluster, so it’s not an issue with the DSS server. Because this is a visual recipe and not a code recipe, we can rule out inefficient user code as the cause. In addition, because it’s a visual recipe, we’ll want to pay special attention to the engine this recipe uses.

Looking through the job logs, we see a line referencing “Computation will not be distributed”:

[2021/10/06-11:10:19.408] [null-err-110] [INFO] [dku.utils]  - [2021/10/06-15:10:19.400] [main] [WARN] [dip.spark.fast-path]: Reading Snowflake dataset as a remote table. Computation will not be distributed

This indicates that there is something suboptimal about our recipe. Let’s start by taking a look at the available execution engines for the sync visual recipe . In this case, there is a specific fast-path engine for Snowflake-to-S3 syncing. We’re using the Spark engine instead.

Let’s review the requirements to use the fast-path engine from Snowflake-to-S3 and make sure we meet all of them. If we don’t already, it’s worth investigating if we can modify our setup so that we do meet all of the requirements and are able to leverage this engine.

For example, if the input dataset is a SQL query dataset, you won’t be able to use the optimal Snowflake-to-S3 engine. In this case, one option would be to alter your Flow so that the input dataset is a SQL table dataset and there is an additional SQL Query recipe that performs any necessary transformations on top of the initial dataset. This would then allow the final Sync recipe step to use the optimal Snowflake to S3 engine.

Once we reviewed everything and switched to the Snowflake-to-S3 engine, we see that the job time was significantly reduced!

Visual recipe with Snowflake to S3 engine configured.

Use Case 2: A Python or PySpark Job is Taking Several Hours to Complete

  • What: Only this single Python recipe is taking longer than expected. The job completes successfully without any errors, but it’s just slower than we would like.

  • Who: Both the user who created the recipe and an admin have executed this job, and both observed that the job takes several hours to complete.

  • When: The user has tried executing the job several times over the course of the last couple days, so it doesn’t appear to be specific to any particular time.

  • Where: This is a Python job that gets executed locally. This means the job is running locally on the DSS server.

Troubleshooting Steps

We know the issue is specific to this job, so this isn’t a generalized performance issue. We also know that the execution time is consistently slow.

To start, let’s evaluate:

  • The size of our input and output data

  • What kind of transformations the recipe is performing

  • The kind of dataset of our input and output datasets (SQL, Cloud Storage etc.)

If you are reading from a dataset that has slow read times or writing out to a dataset type that has slow write times, it’s possible that switching the input or output dataset to a different connection type could improve performance.

DSS contains a number of optimizations that will sometimes optimize this for you, but it’s worth checking if your data format is optimal. In particular, if you are writing out to a database you might find that writing out to a cloud storage dataset is actually more performant. In addition, most analytics databases have a fast-write option you can enable on the database connection settings. If you are writing to Snowflake, Bigquery, or Redshift you’ll want to look into this as a first step.

It can help to approach a problem like this by testing out a couple different iterations to narrow down specifically where the issue is. For example, if you test using both S3 and PostgreSQL input and output datasets, this might help narrow down if the issue is related to the dataset type or not. If you see no difference in performance by changing the dataset type, this points to a performance issue with the code itself and the transformations in the code rather than a read or write issue.

One good way to troubleshoot if the actual transformations in the Python recipe are the cause for the slow performance is to:

  • Convert the recipe to a notebook.

  • Run the code in the notebook line by line (or a small set of lines at a time).

  • Add %%time within each cell before each execution of a small chunk of code. This can help you determine how long each section of your code transformations takes.

This will usually help determine exactly which part of the code is taking a long time to compute. We can then investigate if we can optimize this piece of code to make it more efficient.

If you notice instead that a PySpark job is taking a long time to complete, you can take a somewhat similar approach. Because Spark will not execute transformations until an action is called, you can narrow down if a specific transformation is particularly slow by forcing an action after each transformation or small subset of transformations.

For more information, visit the Apache Spark documentation including transformations and action.

For example, let’s say our code uses PySpark Dataframes and performs a number of transformations on this dataframe. A helpful way to troubleshoot this is by adding a df.show() line after every few transformations, like this: print(pyspark_df.show(5))

This will allow you to narrow down if a particular set of transformations are particularly slow. Once you’ve narrowed this down, you can troubleshoot this specific piece of code and work on optimizing it. Once you’ve successfully optimized it, you can remove all of the df.show() lines, as this method is intended for troubleshooting.

Instance Performance Troubleshooting

Use Case 3: The DSS UI is Slow to Load For All Users

  • What: The DSS UI is taking about a minute to load across all projects and different parts of DSS. This is not specific to viewing a specific project Flow or attempting to view a specific dataset; it is slow across the board for all users.

  • Who: Multiple users have reported that DSS is taking a minute or more to load for them.

  • When: We noticed that this issue started yesterday. Interestingly, we did restart DSS this morning and everything seemed fine again. However, about an hour ago, we started to experience slow load times again.

  • Where: This issue is with the DSS UI, so the issue is likely restricted to the DSS server and not related to externally processed jobs.

Troubleshooting Steps

This issue is impacting all users on the DSS server, so it’s probably not specific to a user’s environment or an individual project. At this point, you can perform some initial investigations of the DSS server. We can break this down into a couple of different steps:

  • Check resource usage on the DSS server. It’s always good to do some brief checks to see if you might be facing a resource issue on your server.

Run the following checks to see what processes are running on your DSS server and how much space you have available:

  • ps auxf

  • top

  • df -h

  • Do some initial investigation of the DSS backend logs.

DSS has detailed logging which can help you diagnose what might be happening. As a general tip, you can tail the backend logs if you are ever trying to identify what’s currently impacting the DSS server at a time of slowness:

tail -f <DATA_DIR>/run/backend.log

If you are seeing a slow UI issue specifically, you also might want to check if you are running into a garbage collection issue on the server. A quick way to do this is by running the following command:

grep -v JEK | grep -v FEK | grep “Full GC” <DATA_DIR>/run/backend.log

It’s common to see entries returned with this command, even if everything is fine on your DSS server. However, it is a problem if you are seeing that each entry consumes several seconds, as this will create a lag in the UI. For example, the following shows garbage collection entries that each take about 30 seconds:

36401.998: [Full GC (Allocation Failure)  12268M->12266M(12288M), 29.3987391 secs] 36431.480: [Full GC (Allocation Failure)  12268M->12266M(12288M), 39.1208729 secs] 36470.651: [Full GC (Allocation Failure)  12268M->12266M(12288M), 39.0166883 secs]

This means your DSS server is encountering a memory issue. Sometimes, this means that increasing your DSS backend.xmx memory setting is necessary.

It’s also possible that a user performed a particular action that caused a memory issue on the server. For example, let’s say a user attempted to view an entire 30GB dataset in the Explore tab of DSS by increasing the sample size setting to view the entire dataset.

This can cause a performance issue on the system.

You will want to educate your users on the reason why sample settings are set to a default of 10,000 to prevent any negative performance implications.

How to Get Support

If you still need help diagnosing an issue, reach out to support.

Please generate a diagnostic report on your job or instance issue and send it to support for troubleshooting.

If DSS is down, you can ssh into your DSS server and create a diagnostic from the command line using this command: <DATA_DIR>/bin/dssadmin run-diagnosis -cfls diag.zip