Troubleshoot | Python or PySpark job takes several hours to complete#

What: Only this single Python recipe is taking longer than expected. The job completes successfully without any errors, but it’s just slower than we would like.
Who: Both the user who created the recipe and an admin have executed this job, and both observed that the job takes several hours to complete.
When: The user has tried executing the job several times over the course of the last couple days, so it doesn’t appear to be specific to any particular time.
Where: This is a Python job that gets executed locally. This means the job is running locally on the DSS server.

Troubleshooting steps#

We know the issue is specific to this job, so this is’’t a generalized performance issue. We also know that the execution time is consistently slow.

To start, let’s evaluate:

The size of our input and output data
What kind of transformations the recipe is performing
The kind of dataset of our input and output datasets (SQL, Cloud Storage etc.)

If you are reading from a dataset that has slow read times or writing out to a dataset type that has slow write times, it’s possible that switching the input or output dataset to a different connection type could improve performance.

DSS contains a number of optimizations that will sometimes optimize this for you, but it’s worth checking if your data format is optimal. In particular, if you are writing out to a database you might find that writing out to a cloud storage dataset is actually more performant. In addition, most analytics databases have a fast-write option you can enable on the database connection settings. If you are writing to Snowflake, BigQuery, or Redshift you’ll want to look into this as a first step.

It can help to approach a problem like this by testing out a couple different iterations to narrow down specifically where the issue is. For example, if you test using both S3 and PostgreSQL input and output datasets, this might help narrow down if the issue is related to the dataset type or not. If you see no difference in performance by changing the dataset type, this points to a performance issue with the code itself and the transformations in the code rather than a read or write issue.

One good way to troubleshoot if the actual transformations in the Python recipe are the cause for the slow performance is to:

Convert the recipe to a notebook.
Run the code in the notebook line by line (or a small set of lines at a time).
Add %%time within each cell before each execution of a small chunk of code. This can help you determine how long each section of your code transformations takes.

This will usually help determine exactly which part of the code is taking a long time to compute. We can then investigate if we can optimize this piece of code to make it more efficient.

If you notice instead that a PySpark job is taking a long time to complete, you can take a somewhat similar approach. Because Spark won’t execute transformations until an action is called, you can narrow down if a specific transformation is particularly slow by forcing an action after each transformation or small subset of transformations.

For more information, visit the Apache Spark documentation including transformations and action.

For example, let’s say our code uses PySpark Dataframes and performs a number of transformations on this dataframe. A helpful way to troubleshoot this is by adding a df.show() line after every few transformations, like this: print(pyspark_df.show(5))

This will allow you to narrow down if a particular set of transformations are particularly slow. Once you’ve narrowed this down, you can troubleshoot this specific piece of code and work on optimizing it. Once you’ve successfully optimized it, you can remove all of the df.show() lines, as this method is intended for troubleshooting.