Product Pillar: Performance Scalability¶
Dataiku believes that enterprises must be able to execute their own path to AI in ways that make sense for the history, constraints, and goals of an organization. Achieving this mission requires providing full flexibility in the deployment of AI solutions in terms of data location, security, scalability, network availability, and infrastructure costs. Accordingly, DSS provides a flexible computation environment that can allocate large-scale computational resources on-demand and optimize the runtime of any workload with performant, scalable technologies.
Alongside removing computational bottlenecks, DSS also seeks to optimize user performance. DSS has a number of reusable components that save time for both clickers and coders, whether it is copying and pasting Flow objects or reusing code libraries, environments, and plugins.
As an enterprise advances on its path to AI, it often finds itself attempting a greater number of projects, using larger datasets, for purposes that require more intensive computation, and which must be accessible to more concurrent users.
The approach DSS takes to solve this problem is by offloading computation from the DSS server. As will be explained in greater detail in later courses, DSS does not ingest more than a sample of a dataset. Datasets in DSS only point to a location in an organization’s existing storage infrastructure. Rather than execute intensive processing itself, DSS can delegate the heavy lifting to more appropriate computers.
Depending on the specific data sources and recipes at hand, DSS automatically chooses the most effective execution engine among those available (such as local, in-database, or Spark). This is possible because DSS, whether visually or through code, connects to the underlying data storage infrastructure and pushes down the calculations to that infrastructure in order to minimize data movement over the network. For example, typical ETL work can be pushed down to a SQL database or Spark cluster. Notebook processing can be offloaded to containers.
This Join recipe joins two datasets stored in HDFS, and so it uses the Spark engine.
Taking advantage of the push-down computation philosophy is the first step to optimizing the runtime of a Flow in DSS. To further increase the efficiency of a project Flow, DSS supports Spark and SQL pipelines and partitioned data.
When running on Spark or an SQL database, a long chain of recipes can be executed without building intermediate output. This helps avoid typical cycles of reading and writing data at each intermediate step.
When working with big data in production, records that have already been processed once typically do not need to be processed again simply because new records have arrived. Partitioned datasets in DSS allow users to segment the data by date or a discrete value and process only the records that need to be worked on.
The Path to Elastic AI¶
Pushing down computation to the underlying storage infrastructure will not remove resource bottlenecks if that infrastructure is not fully-equipped for the task. To advance on the path to Enterprise AI, organizations often need elastic computing resources.
The advent of cloud computing has enabled systems that can automatically scale up and scale down computational resources according to current demand. The unique data and computation abstraction approach of DSS can support organizations in moving towards full AI elasticity. Moreover, this power can be provided to not only advanced coders and system administrators, but also to business users.
Organizations pursuing elastic AI may find that a hybrid infrastructure of on-premises and cloud deployments allows them to allocate large-scale computational resources on-demand, while maintaining security and governance.
Elastic compute promises several key advantages, allowing users to:
Scale local processes beyond a single DSS node; for example, when running Python and R recipes, or training and re-training in-memory machine learning models.
Leverage processing nodes that may have different computing capabilities; for example, to leverage remote machines with GPUs in order to build deep learning models.
Restrict resource usage with Kubernetes’ resource management capabilities.
To bring these advantages to enterprises, DSS is a fully-managed Kubernetes solution compatible with all of the major cloud container services (Amazon EKS, Google Kubernetes Engine (GKE), and Azure Kubernetes Service (AKS)), as well as with on-premises Kubernetes/Docker clusters. Enterprises can scale workloads across multiple clusters. For example, users can create a DSS project that fully automates a process that creates EMR clusters, runs the job activities within the project, and then destroys the clusters.
Another dimension of performance scalability is minimizing repeated or duplicated efforts on behalf of users so that users can focus on more productive tasks. DSS has features that help both clickers and coders maximize re-usage of past work.
In order to avoid duplication of work, project flows in DSS, including visual components, are reusable and portable. Individual steps of a Prepare recipe can be copied and pasted in the Flow for use elsewhere in the project. These steps, or entire sections of a Flow (datasets and recipes together), can also be shared externally to other projects, allowing users to rename and re-tag objects in the process.
We might be interested in working separately on a specific piece of this project, such as the time series revenue forecast. We could use the View dropdown menu to filter the Flow for DSS items with a specific tag and then copy those objects to a new project.
For coders, code notebooks come with pre-built templates to help users get started quickly performing tasks such as principal component analysis, topic modeling, or time series forecasting. Users can also define their own templates for shared use across the enterprise. Moreover, coders can contribute to a shared repository of code samples to prevent duplication of work across teams and projects.
This instance has dozens of Python code snippets ready to be pasted into the clipboard of a Jupyter notebook. This particular code sample saves a DSS dataset from a PySpark dataframe.
Reusing code in DSS, however, is far more robust than just sharing helpful snippets. DSS provides several other mechanisms for reusing Python and R code. Through the Code Libraries menu, users can package code as functions or modules and make them available in specific or all projects. Users can also import code by cloning a library from a remote Git repository. Coders can take reusing their code a step further by packaging functionality into plugins that can be used by anyone, as previously discussed.
We can use the Code Libraries menu of our project to manage the project’s codebase or import additional code from remote Git repositories.
In order to share reusable code, users also need a way to manage the environment in which each piece of code runs. DSS allows users to create and manage an arbitrary number of code environments on an instance. Individual notebooks, recipes, plugins, web apps, and visual machine learning models can all have their own code environments running specific versions of Python or R, along with any corresponding packages.
This project uses a code environment running Python 2.7. Here we can find the full list of installed packages and permissions.