Troubleshoot | Diagnosing instance-wide performance#
You could encounter different kinds of instance-wide issues on your DSS instance.
For example, you might encounter multiple jobs appearing to slow down when executed on your DSS node. This is most commonly the case when there are multiple Python processes running on the same instance, including Jupyter notebooks.
You can prevent this situation by ensuring that cgroups are set up on your instance.
If cgroups are not yet enabled on your instance, a single Python process can consume as much RAM and CPU as possible on the instance. Let’s say that there are multiple Python processes running on the same instance. An individual Python job will still be able to consume as much RAM and CPU as is available to it, but the available resources might be considerably lower at certain times, depending on what else is running on the DSS instance.
While a job may still run and complete successfully, this competition for resources between concurrently running Python processes can increase the runtime of each job if they are now bottlenecked by the available resources on the instance.
In addition, if one user is running an extremely intensive job and the instance does not have cgroups set up, this single job can cause slowdown for all other jobs executed on the instance at the same time.
You can take several steps to prevent this situation:
Enable cgroups on your instance in order to control memory consumption of processes on your DSS server.
Set up automatic termination of Jupyter notebooks on a regular cadence. This can be automated by creating a scenario that runs on a regular basis (i.e. weekly) and runs the Macro step “Kill Jupyter Sessions.” Jupyter notebook sessions are not automatically terminated, which can sometimes lead to unexpected load on the DSS server if users leave notebooks running. This can help prevent such load.
Offload jobs to a Kubernetes cluster. This will allow you to scale up and down resources as you need them, which will allow users to run more memory-intensive jobs without interrupting other jobs. You will likely want to set up memory request and cpu request parameters to ensure that you allocate appropriate resources on your cluster for individual jobs, in order to prevent similar competition between processes when executing on your cluster nodes. For more information, visit Kubernetes documentation: memory request and cpu request parameters