Contents Menu Expand Light mode Dark mode Auto light/dark mode
Dataiku
  • Academy
    • Join the Academy
      Benefit from guided learning opportunities →
      • Quick Starts
      • Learning Paths
      • Certifications
      • Release Highlights
      • Academy Discussions
  • Community
      • Explore the Community
        Discover, share, and contribute →
      • Learn About Us
      • Ask a Question
      • What's New?
      • Discuss Dataiku
      • Using Dataiku
      • Setup and Configuration
      • General Discussion
      • Plugins & Extending Dataiku
      • Product Ideas
      • Programs
      • Frontrunner Awards
      • Dataiku Neurons
      • Community Resources
      • Community Feedback
      • User Research
  • Documentation
    • Reference Documentation
      Comprehensive specifications of Dataiku →
      • Release Notes
      • User's Guide
      • Specific Data Processing
      • Automation & Deployment
      • APIs
      • Installation & Administration
      • Other Topics
  • Knowledge
    • Knowledge Base
      Articles and tutorials on Dataiku features →
      • User Guide
      • Admin Guide
      • Dataiku Solutions
      • Dataiku Cloud
  • Developer
    • Developer Guide
      Tutorials and articles for developers and coder users →
      • Getting Started
      • Concepts and Examples
      • Tutorials
      • API Reference
Dataiku Knowledge Base

User Guide

  • Getting Started
    • Quick Starts
      • Quick Start | Dataiku for data preparation
      • Quick Start | Dataiku for machine learning
      • Quick Start | Dataiku for MLOps
      • Quick Start | Dataiku for AI collaboration
      • Quick Start | Excel to Dataiku
        • Concept | From Excel to Dataiku
      • Quick Start | Alteryx to Dataiku
      • Quick Start | Dataiku for manufacturing data preparation and visualization
    • Dataiku User Interface
      • Concept | Dataiku Cloud Launchpad
      • Concept | Dataiku Design homepage
      • Concept | Project
      • Concept | Flow
      • Concept | Searching in Dataiku
      • Concept | Flow views, search, and filter
      • Tutorial | Explore the Flow
      • Tutorial | Flow zones
      • Reference | Navigation bar
      • Reference | Right panel navigation
      • How-to | Duplicate a Dataiku project
      • How-to | Find the Dataiku version
      • How-to | Rearrange Flow zones
      • Tip | Flow navigation shortcuts
      • Tip | Anchoring for Flow management
      • Tip | Hide or show Flow items
      • Tip | Using project folders
  • Data Sourcing
    • Data Connections
      • Concept | Data connections
      • Concept | Architecture model for databases
      • Concept | Connection changes
      • Tutorial | Configure a connection between Dataiku and an SQL database
      • Tutorial | Data transfer with visual recipes
      • Reference | A primer on connecting to data sources
      • How-to | Remap a connection when importing a project to a Dataiku instance
      • How-to | Utilize MS Access
    • Dataiku Datasets
      • Concept | Dataiku datasets
      • Concept | Dataset characteristics
      • Concept | Sampling on datasets
      • Concept | Dataset conditional formatting
      • Concept | Analyze data quality in the Explore tab
      • Tutorial | Getting started with datasets
      • How-to | Rename a dataset
      • How-to | Reorder or hide dataset columns
      • How-to | Export a filtered dataset
      • How-to | Apply a filter to summary statistics in the Analyze window
      • Tip | Good dataset naming schemes
      • FAQ | Why can’t I drag a folder into Dataiku?
      • FAQ | Where can I see how many records are in my entire dataset?
  • Data Preparation
    • Visual Recipes
      • Concept | Recipes in Dataiku
      • Concept | Sync recipe
      • Concept | Group recipe
      • Concept | Join recipe
      • Concept | Distinct recipe
      • Concept | Pivot recipe
      • Concept | Sample/Filter recipe
      • Concept | Sort recipe
      • Concept | Split recipe
      • Concept | Stack recipe
      • Concept | Top N recipe
      • Concept | Window recipe
      • Concept | Fuzzy join recipe
      • Concept | Geo join recipe
      • Concept | Labeling recipe
      • Concept | Common steps in visual recipes: Pre-filter, Post-filter, & Computed columns
      • Concept | Dynamic dataset and recipe repeat
      • Concept | Generate recipes using Generative AI
      • Tutorial | Group recipe
      • Tutorial | Join recipe
      • Tutorial | Distinct recipe
      • Tutorial | Pivot recipe
      • Tutorial | Top N recipe
      • Tutorial | Window recipe
      • Tutorial | Fuzzy join recipe
      • Tutorial | In-database data visualization and preparation
      • Tutorial | Geo join recipe
      • Tutorial | Compute isochrones and routes with the Geo Router plugin
      • Tutorial | Working with shapefiles and US census data
      • Tutorial | Dynamic recipe repeat
      • How-to | Insert or delete a recipe within the Flow
      • How-to | Segment your data using statistical quantiles
    • Prepare Recipe
      • Concept | Prepare recipe
      • Tutorial | Prepare recipe
      • Tutorial | Smart pattern builder for string pattern extraction
      • Tutorial | Visual logic processors for data preparation
      • Tutorial | Geographic processors
      • Tutorial | Enrich web logs in the Prepare recipe
      • Reference | Performing joins in the Prepare recipe
      • Reference | Using custom Python functions in the Prepare recipe
      • Reference | Handling decimal notations
      • How-to | Normalize number formats in a Prepare recipe
      • How-to | Handle accounting-style negative numbers
      • How-to | Copy-paste Prepare recipe steps
      • How-to | Apply Prepare steps to multiple columns
      • How-to | Standardize text fields using fuzzy values clustering
      • How-to | Reshape data from wide to long format
      • How-to | Generate Prepare recipe steps with AI
    • Dataiku Formulas
      • Concept | Dataiku formulas
      • Concept | Dataiku formulas cheat sheet
      • Concept | Safe sums across columns in Dataiku formulas
      • Tutorial | Relative referencing in Dataiku formulas
      • How-to | Remove scientific notation in a column
      • How-to | Pad a number with leading zeros
      • How-to | Fill empty cells of a column with the value of the corresponding row from another column
      • FAQ | In a formula, how can I check if a variable belongs to a set of values?
    • Data Pipelines & Computation Engines
      • Concept | Computation engines
      • Concept | Build modes
      • Concept | Data pipeline optimization
      • Concept | Where computation happens in Dataiku
      • Tutorial | Build modes
      • Tutorial | Recipe engines
      • How-to | Access job information
      • How-to | Enable SQL pipelines in the Flow
    • The Lab
      • Concept | Visual analyses in the Lab
      • Tutorial | Visual analyses in the Lab
    • Managing Dates
      • Concept | Date handling in Dataiku
      • Reference | How Dataiku handles and displays date and time
    • From Excel to Dataiku
      • Tutorial | Relative referencing in Dataiku formulas
      • How-to | Work with editable datasets
      • How-to | Import an Excel workbook
      • Reference | Data cleaning
      • Reference | Using formulas
      • Reference | Working with dates
      • Reference | Removing duplicates
      • Reference | Filtering rows
      • Reference | Sampling rows
      • Reference | Split a dataset
      • Reference | Append datasets
      • Reference | Joining datasets
      • Reference | Aggregate and pivot
      • Reference | Sorting values
      • Reference | Top values
    • From Alteryx to Dataiku
      • Reference | Alteryx to Dataiku concept mapping
  • Data Visualization
    • Charts
      • Concept | Charts
      • Concept | In-database charts
      • Tutorial | Charts
      • Tutorial | Pivot tables
      • Tutorial | Paneled and animated charts
      • Tutorial | Custom aggregation for charts
      • Tutorial | No-code maps
      • FAQ | How do I display non-aggregated metrics in charts?
      • FAQ | How do I sort on a measure not displayed in charts?
    • Dashboards
      • Concept | Dashboards
      • Tutorial | Use dashboards to build reports
      • Tutorial | Dashboard management
      • How-to | Manage sampling on insights
      • Reference | Understand source data for filters
      • Troubleshoot | Can’t display a web content insight in a dashboard
    • Webapps
      • Concept | Webapps
      • How-to | Display an image in a Bokeh webapp
    • Static Insights
      • Concept | Static insights
      • Tutorial | Static insights
    • Visualization Plugins
      • Concept | Data visualization plugins
  • Collaboration
    • Collaboration Overview
      • Concept | Collaboration
    • Wikis & Flow Documentation
      • Concept | Explain the Flow with generative AI
      • Concept | Workflow documentation in a wiki
      • Reference | Using the project wiki
      • Reference | Sharing and promoting wikis
      • How-to | Create a wiki article
      • How-to | Export a wiki to a PDF
      • How-to | Generate and export Flow documentation
      • Tip | Link Dataiku objects in a wiki article
    • Tags & Object Descriptions
      • Concept | Tags
      • Tip | Suggestions for using tags
      • Tip | Commenting to document Dataiku objects
    • Sharing Projects & Dataiku Assets
      • Concept | Project permissions and asset sharing
      • Concept | Data Catalog
      • Reference | Managing project access
      • How-to | Set up limited access to projects
      • How-to | Manage project access requests
      • How-to | Share project to non-Dataiku users
      • How-to | Manage object sharing
      • How-to | Enable quick sharing of datasets and objects
      • How-to | Copy Flow items to a new or existing project
    • Discussions
      • Concept | Discussions
      • Reference | Managing discussions
      • How-to | Start discussions in a Dataiku object
    • Workspaces
      • Concept | Workspaces
      • Reference | Centralized versus delegated workspaces
      • How-to | Create a workspace
      • How-to | Share a workspace to non-Dataiku users
    • Project Version Control
      • Concept | Version control for Dataiku projects
      • Tutorial | Git for projects
      • How-to | Undo actions in Dataiku
    • Stories
      • Concept | Dataiku stories
      • Tutorial | Dataiku stories with Generative AI
      • Tutorial | Dataiku stories
      • Reference | Story user interface
      • How-to | Enable Story AI
      • How-to | Import a story
  • Data Quality & Automation
    • Variables
      • Concept | Variables in Dataiku
      • Tutorial | Project variables in visual recipes
      • Tutorial | Coding with variables
    • Data Quality
      • Concept | Metrics
      • Concept | Checks
      • Concept | Data quality rules
      • Concept | Metrics & checks (pre-12.6)
      • Concept | Data lineage
      • Tutorial | Data quality
      • Tutorial | Custom metrics, checks, and data quality rules
      • Tutorial | Data quality and SQL metrics
      • FAQ | What’s the difference between distinct and unique value count metrics?
    • Automation Scenarios
      • Concept | Automation scenarios
      • Concept | Custom metrics, checks, data quality rules & scenarios
      • Tutorial | Automation scenarios
      • Tutorial | Scenario reporters
      • Tutorial | Webhook reporters in scenarios
      • Tutorial | Custom step-based scenarios
      • Tutorial | Custom script scenarios
      • How-to | Automate documentation exports in a scenario
      • How-to | Build missing partitions with a scenario
      • Code Sample | Set a timeout for a scenario build step
      • Code Sample | Set email recipients in a “Send email” reporter
      • FAQ | Can I control which datasets in my Flow get rebuilt during a scenario?
    • Dataiku Applications
      • Concept | Dataiku applications
      • Tutorial | Dataiku applications
      • Reference | Use cases of Dataiku applications
    • Partitioning
      • Concept | Partitioning
      • Concept | How partitioning adds value
      • Concept | Partitioned datasets
      • Concept | Jobs with partitioned datasets
      • Concept | Partitioning by pattern
      • Concept | Partitioning in a scenario
      • Concept | Partition redispatch and collection
      • Tutorial | File-based partitioning
      • Tutorial | Column-based partitioning
      • Tutorial | Partitioning in a scenario
      • Tutorial | Repartition a non-partitioned dataset
      • Tip | Interacting with partitioned datasets using the Python API
  • Machine Learning & Analytics
    • Interactive Statistics
      • Concept | Statistics worksheets
      • Concept | Statistics cards
      • Concept | Generate statistics recipe
      • Concept | Variable types for interactive statistics
      • Concept | Factor and response roles in statistics cards
      • Concept | Statistics cards for fit curves and distributions
      • Concept | Correlation matrices in statistical worksheets
      • Concept | Principal Component Analysis (PCA)
      • Concept | Hypothesis testing
      • Concept | Hypothesis test categories
      • Concept | Grouping variables in statistical testing
      • Concept | Adjustment methods for hypothesis test cards
      • Tutorial | Interactive statistics
      • How-to | Export a statistics card as a recipe
    • Machine Learning Concepts
      • Concept | Introduction to machine learning
      • Concept | Predictive modeling
      • Concept | Model validation
      • Concept | Model evaluation
      • Concept | Regression algorithms
      • Concept | Classification algorithms
      • Concept | Clustering algorithms
    • Feature Engineering
      • Concept | Data preparation for machine learning
      • Concept | Generate Features recipe
      • Tutorial | Generate Features recipe
      • Tutorial | Events aggregator plugin
    • AutoML Model Design
      • Concept | Quick models in Dataiku
      • Concept | The Design tab within the visual ML tool
      • Concept | Features handling
      • Concept | Multimodal ML using LLMs
      • Concept | Feature generation & reduction
      • Concept | Algorithm and hyperparameter selection
      • Concept | ML diagnostics
      • Concept | ML assertions
      • Tutorial | Machine learning basics
      • Tutorial | Model overrides
      • Tutorial | ML diagnostics
      • Tutorial | ML assertions
      • Tutorial | Clustering (unsupervised) models with visual ML
      • Tutorial | MLlib with Dataiku
      • How-to | Distributed hyperparameter search
      • FAQ | How does the AutoML tool automatically select or reject features when training a model?
      • Troubleshoot | In visual ML, I get the error “All values of the target are equal” when they’re not
    • AutoML Model Results
      • Concept | The Result tab within the visual ML tool
      • Concept | Model summaries within the visual ML tool
      • Concept | Explainable AI
      • Concept | Partial dependence plots
      • Concept | Subpopulation analysis
      • Concept | Individual prediction explanations
      • Concept | What if? analysis
      • Concept | Advanced What if? simulators
      • Concept | Interpretation of regression model output
      • Tutorial | Advanced What if simulators
      • Tutorial | Exporting a model’s preprocessed data with a Jupyter notebook
      • How-to | Set up What if analysis for a dashboard consumer
      • FAQ | Why don’t the values in the Visual ML chart match the final scores for each algorithm?
    • Model Scoring
      • Concept | Model deployment to the Flow
      • Concept | Scoring data
      • Concept | Model validation and evaluation
      • Tutorial | Model scoring basics
    • Custom Models in Visual ML
      • Concept | Custom preprocessing within the visual ML tool
      • Concept | Custom modeling within the visual ML tool
      • Concept | Tuning XGBoost models in Python
      • Tutorial | Custom preprocessing & modeling within visual ML
      • Tutorial | Azure AutoML from a Dataiku notebook
    • Time Series
      • Concept | Introduction to time series
      • Concept | Time series data types and formats
      • Concept | Time series components
      • Concept | Objectives of time series analysis
      • Concept | Time series analysis with interactive statistics
      • Concept | Time series preparation
      • Concept | Time series resampling
      • Concept | Time series interval extraction
      • Concept | Time series windowing
      • Concept | Time series extrema extraction
      • Concept | Time series forecasting
      • Tutorial | Time series analysis
      • Tutorial | Time series forecasting (Visual ML)
      • Tutorial | Time series preparation
      • Tutorial | Forecasting time series data with R and Dataiku
      • Tutorial | Deep learning for time series
      • Tutorial | Export preprocessed data (for time series models)
    • Causal Prediction
      • Concept | Causal prediction
      • Tutorial | Causal prediction
    • Text Processing
      • Concept | Regular expressions in Dataiku
      • Concept | Introduction to natural language processing (NLP)
      • Concept | Challenges of natural language processing (NLP)
      • Concept | Cleaning text data
      • Concept | Handling text features for machine learning
      • Tutorial | Build a text classification model
    • Images
      • Concept | Pre-trained image classification models
      • Concept | Optimization of image classification models
      • Concept | Object detection
      • Tutorial | Image classification without code
      • Tutorial | Image classification with code
      • Tutorial | Object detection without code
      • How to | Prepare images for use in a model
    • Geospatial Analytics
      • Concept | Geo join recipe
      • Tutorial | Geographic processors
      • Tutorial | No-code maps
      • Tutorial | Geo join recipe
      • Tutorial | Compute isochrones and routes with the Geo Router plugin
      • Tutorial | Working with shapefiles and US census data
      • Reference | Overview of Dataiku’s geospatial features
    • Partitioned Models
      • Concept | Partitioned models
      • Tutorial | Partitioned models
      • How-to | Train a stratified or partitioned model
    • Deep Learning
      • Tutorial | Deep learning within visual ML
      • Tutorial | Deep learning for time series
    • Active Learning
      • Tutorial | Active learning for classification problems
      • Tutorial | Active learning for object detection problems
      • Tutorial | Help on active learning webapp
      • Tutorial | Active learning for object detection problems using Dataiku apps
      • Tutorial | Active learning for tabular data classification problems using Dataiku apps
    • Responsible AI
      • Concept | Responsible AI
      • Concept | Dangers of irresponsible AI
      • Concept | Responsible AI in the data science practice
      • Concept | Basics of bias
      • Concept | Model fairness
      • Concept | Evaluating group fairness
      • Concept | Interpretability
      • Concept | Model transparency
      • Concept | Deployment biases
      • Tutorial | Responsible AI training
      • Reference | RAI further reading
  • Generative AI and Large Language Models (LLMs)
    • LLM Administration
      • Concept | LLM connections
      • Concept | Guardrails against risks from Generative AI and LLMs
    • Text Processing with Visual LLM Recipes
      • Concept | Large language models and the LLM Mesh
      • Concept | Classify text recipe
      • Concept | Summarize text recipe
      • Concept | Prompt Studios and Prompt recipe
      • Tutorial | Classify text with Generative AI
      • Tutorial | Summarize text with Generative AI
      • Tutorial | Prompt engineering with LLMs
      • Tutorial | Processing text with the Prompt recipe
    • Retrieval Augmented Generation (RAG)
      • Concept | Embed recipes and Retrieval Augmented Generation (RAG)
      • Tutorial | Retrieval Augmented Generation (RAG) with the Embed dataset recipe
      • Tutorial | Build a multimodal knowledge bank for a RAG project
      • Tutorial | Build a conversational interface with Dataiku Answers
    • LLMOps
      • Tutorial | LLM evaluation
  • Code
    • Getting Started with Code in Dataiku
      • Concept | Code notebooks
      • Concept | Code recipes
      • Tutorial | Code notebooks and recipes
    • Python and Dataiku
      • Tutorial | Code notebooks and recipes
      • Tutorial | SQL from a Python recipe in Dataiku
      • Tutorial | Sessionization in SQL, Hive, Python, and Pig
      • Tutorial | PySpark in Dataiku
      • Reference | Reading or writing a dataset with custom Python code
      • How-to | Enable auto-completion in a Jupyter notebook
      • Code Sample | Access info about datasets
    • SQL and Dataiku
      • Concept | SQL notebooks
      • Concept | SQL code recipes
      • Concept | AI SQL Assistant
      • Tutorial | SQL notebooks and recipes
    • R and Dataiku
      • Tutorial | Dataiku for R users
      • Tutorial | R Markdown reports
      • Tutorial | Forecasting time series data with R and Dataiku
      • Tutorial | R Shiny webapps
      • Reference | Upgrading and rolling back the R version used in Dataiku
      • How-to | Edit Dataiku recipes in RStudio
      • Troubleshoot | R recipes aren’t working after upgrading or migrating the instance
    • Work Environment
      • Concept | Code environments
      • Concept | External IDE integrations
      • Tutorial | My first Code Studio
      • How-to | Create a code environment
      • How-to | Set a code environment
      • How-to | Edit Dataiku projects and plugins in VS Code
      • How-to | Edit Dataiku projects and plugins in PyCharm
      • How-to | Edit Dataiku projects and plugins in Sublime
      • How-to | Edit Dataiku recipes in RStudio
      • FAQ | Why should I use a code environment?
    • Shared Code
      • Concept | Introduction to shared code
      • Concept | Shared code libraries
      • Concept | Importing code from a remote Git repository
      • Concept | Code samples
      • Tutorial | Shared code
      • Tutorial | Cloning a library from a remote Git repository
      • How-to | Import a notebook from GitHub
      • Tip | Best practices for notebook development between GitHub and Dataiku
    • Dataiku APIs
      • Concept | Dataiku APIs
      • Concept | The dataiku package
      • Concept | Dataiku public API
      • Concept | Usage of Dataiku APIs outside of Dataiku
      • Tutorial | Dataiku public API
      • Tip | Using the API within Dataiku (Basics)
      • Tip | Automating work in Dataiku with the API
      • Tip | Administering Dataiku remotely
    • Managed Folders
      • Concept | Managed folders
      • Tutorial | Managed folders
  • MLOps & Operationalization
    • MLOps Architecture
      • Concept | Definition, challenges, and principles of MLOps
      • Concept | How model development impacts MLOps
      • Concept | Model packaging for deployment
      • Concept | Dataiku architecture for MLOps
    • Batch Deployment
      • Concept | Automation node preparation
      • Concept | Batch deployment
      • Tutorial | Batch deployment
    • Test Scenarios
      • Tutorial | Test scenarios
    • API Deployment
      • Concept | Real-time APIs
      • Concept | API endpoints
      • Concept | API query enrichments
      • Concept | API Deployer
      • Tutorial | Real-time API deployment
    • Model Monitoring
      • Concept | Process governance for MLOps
      • Concept | Model comparisons
      • Concept | Model evaluation stores
      • Concept | Monitoring model performance and drift in production
      • Concept | Monitoring and feedback in the AI project lifecycle
      • Tutorial | Model monitoring with a model evaluation store
      • Tutorial | API endpoint monitoring
      • Tutorial | Model monitoring in different contexts
      • Tutorial | Deployment automation
      • FAQ | How can I get model monitoring metrics in a dataset format?
    • External Models
      • Tutorial | Surface external models within Dataiku
    • Dataiku Govern
      • Concept | Introducing Dataiku Govern
      • Concept | Centralization in Dataiku Govern
      • Concept | Governance layers
      • Concept | Govern item pages
      • Concept | Workflows and project qualification
      • Concept | Governed projects
      • Concept | Business initiatives
      • Concept | Sign-off process
      • Concept | Model maintenance in Dataiku Govern
      • Concept | Govern roles and permissions
      • Concept | Customizing a Dataiku Govern instance
      • Tutorial | Dataiku Govern framework
      • Tutorial | Govern roles and permissions
      • Tutorial | Blueprint Designer
      • Tutorial | Custom Pages Designer
      • Tutorial | Use imported templates in the Blueprint Designer
      • How-to | Export Govern items
      • How-to | Switch artifact templates (blueprint versions)
      • How-to | Subscribe to email notifications
      • How-to | Export and import blueprint and blueprint versions
      • How-to | Add role assignment rules to a Govern item
      • Tip | Embed a dashboard in Dataiku Govern
    • CI/CD Pipelines
      • Tutorial | Getting started with CI/CD pipelines with Dataiku
      • Tutorial | Jenkins pipeline for API services in Dataiku
      • Tutorial | Jenkins pipeline for Dataiku with the Project Deployer
      • Tutorial | Azure pipeline for Dataiku with the Project Deployer
      • Tutorial | Jenkins pipeline for Dataiku without the Project Deployer
    • Feature Store
      • Tutorial | Building your feature store in Dataiku
      • How-to | Add a dataset to the feature store
      • How-to | Add a feature group to the Flow
  • Plugins
    • Plugin Usage
      • Concept | Plugin management
      • Concept | Plugins in Dataiku
      • How-to | Install a plugin
      • How-to | Update a plugin
      • FAQ | Are plugins supported?
      • FAQ | Where can I find the details of a plugin?
    • Plugin Development
      • Concept | Plugin development
      • Concept | Development plugins
      • Concept | Git integration for plugins
      • Reference | Plugin naming policies and conventions
      • Reference | IDE setup to develop Dataiku plugins
      • How-to | Clone a plugin from a remote git repository
      • How-to | Share a plugin as a zip archive
      • How-to | Edit a plugin
      • FAQ | Why should I create plugins?
      • FAQ | What are some examples of plugins?
      • FAQ | Where can I find the code for a plugin?

Dataiku Cloud

  • Space Management
    • Free Trials of Dataiku Cloud
      • How-to | Begin a free trial from Dataiku
      • How-to | Begin a free trial from Snowflake Partner Connect
      • Tip | Working with Snowflake Partner Connect sample projects
    • Users, Profiles & Groups on Dataiku Cloud
      • Reference | Permission management on Dataiku Cloud
      • How-to | Invite users to your Dataiku Cloud space
      • How-to | Automatically attribute profiles and groups to users
      • How-to | Automatically invite users to your instance
      • How-to | Use trial seats
      • How-to | Activate single sign-on (SSO)
      • Troubleshoot | The invited user didn’t receive an email
    • Support on Dataiku Cloud
      • How-to | Contact support on Dataiku Cloud
      • How-to | Grant Dataiku support access to your instance
      • FAQ | Should I email support -at- dataiku -dot- com if I need help?
    • Production Nodes on Dataiku Cloud
      • How-to | Install the Automation node
      • How-to | Install the API node
      • How-to | Use the referenced data deployment mode on Dataiku Cloud
      • How-to | Deploy an API service from the Automation node on Dataiku Cloud
  • Data Transfer and Security on Dataiku Cloud
    • Reference | Relocatable datasets
    • Reference | Data transfer between cloud storage locations
    • How-to | Secure data connections through AWS PrivateLink
    • How-to | Secure data connections through Azure Private Link
    • How-to | Secure data connections through GCP Private Service Connect
    • How-to | Restrict access to Dataiku Cloud IP addresses
    • How-to | Access data sources through a VPN server
  • Compute and Resource Quotas on Dataiku Cloud
    • Reference | Overview of compute engines on Dataiku Cloud
    • Reference | Leveraging fully managed elastic AI compute
    • Reference | Managing elastic AI compute capacity
    • Reference | Managing containerized execution configurations
    • Reference | Resource quota management
    • Tip | Choosing container sizes
    • Tip | Using Spark
    • Troubleshoot | The job takes an unusually long time to complete
    • Troubleshoot | The job queues for a long time and then fails without ever starting

Additional Offerings

  • Dataiku Solutions
    • Retail & CPG
      • Solution | Customer Satisfaction Reviews
      • Solution | Demand Forecast
      • Solution | Distribution Spatial Footprint
      • Solution | Market Basket Analysis
      • Solution | Product Recommendation
      • Solution | Customer Lifetime Value Forecasting
      • Solution | RFM Segmentation
      • Solution | Inventory Allocation Optimization with Grid Dynamics
      • Solution | Markdown Optimization
      • Solution | Store Segmentation
    • Financial Services & Insurance
      • Solution | AML Alerts Triage
      • Solution | Credit Card Fraud
      • Solution | Insurance Claims Modeling
      • Solution | Credit Scoring
      • Solution | Customer Segmentation for Banking
      • Solution | News Sentiment Stock Alert System
      • Solution | Next Best Offer for Banking
      • Solution | Credit Risk Stress Testing (CECL, IFRS9)
      • Solution | Lead Scoring
    • Health & Life Sciences
      • Solution | Optimizing Omnichannel Marketing
      • Solution | Pharmacovigilance
      • Solution | Social Determinants of Health
      • Solution | Clinical Site Intelligence
      • Solution | Molecular Property Prediction
      • Solution | Drug Repurposing through Graph Analytics
      • Solution | Dynamic HCP Segmentation
      • Solution | Real-World Data: Cohort Discovery
    • Manufacturing & Energy
      • Solution | Maintenance Performance and Planning
      • Solution | Batch Performance Optimization
      • Solution | Delivery Dock Optimization
      • Solution | Factories Electricity & CO2 Emissions Forecasting
      • Solution | Production Quality Control
      • Solution | Parameters Analyzer
    • Finance Teams
      • Solution | Financial Forecasting
    • Operations
      • Solution | Process Mining
      • Solution | Reconciliation
    • Governance
      • Solution | Leveraging Compute Resource Usage Data
      • Solution | EU AI Act Readiness
      • Solution | LLM Provider Due Diligence
      • Solution | ISO 42001 Readiness
    • Real Estate
      • Solution | Real Estate Pricing
  • Use Cases
    • Data Preparation Use Cases
      • Tutorial | Airport traffic by US and international carriers
      • Tutorial | Network optimization
    • Classification Use Cases
      • Tutorial | Predictive maintenance in the manufacturing industry
      • Tutorial | Churn prediction
      • Tutorial | Facies classification
    • Clustering Use Cases
      • Tutorial | Web logs analysis
    • Plugin Use Cases
      • Tutorial | A/B testing for event promotion (AB test calculator plugin)
      • Tutorial | Crawl budget prediction for enhanced SEO (OnCrawl plugin)
      • Tutorial | Data governance with the GDPR plugin

Admin Guide

  • Deploying Dataiku
    • Dataiku Architecture
      • Reference | Fleet Manager
      • Reference | The Dataiku elastic AI stack
    • Deploying Dataiku Instances to Cloud Stacks
      • Tutorial | Deploy a Dataiku instance to Cloud Stacks on AWS
      • Tutorial | Deploy a Dataiku instance to Cloud Stacks on Azure
    • Instance Templates
      • Reference | Fleet blueprints
      • How-to | Create or modify an instance template
      • How-to | Grant SSH access
      • How-to | Grant security roles
      • How-to | Use the license override setting
      • Tip | Modifying instance templates and settings
      • Tip | The impact of instance template modifications on disk sizes
      • Tip | The impact of instance template modifications on other elements
    • Setup Actions for Instance Templates
      • Reference | Setup actions
      • How-to | Add a new setup action
      • How-to | Run Ansible tasks
      • How-to | Set up Kubernetes and Spark-on-Kubernetes
      • How-to | Remove a setup action
    • Virtual Networks
      • Reference | Creating or modifying a virtual network
      • How-to | View or edit a virtual network
      • How-to | Edit virtual network names
      • How-to | Assign a public IP address
      • How-to | Assign a virtual network ID and subnet name
      • How-to | Create default or custom security groups
      • How-to | Enable Fleet Management configuration options
      • How-to | Choose DNS strategy
      • How-to | Choose an SSL strategy
      • How-to | Reprovision an instance after applying modifications
    • Instance Management from Fleet Manager
      • Reference | Instance lifecycle management from Fleet Manager
      • Reference | Defining settings at the instance level
      • Reference | Setting the disk sizes
      • Reference | Reprovisioning, deleting or stopping an instance
      • Reference | Defining static IP addresses
      • Reference | Defining an SSL strategy
      • Reference | Using the dashboard and agent logs
      • How-to | Upgrade an instance
      • How-to | Configure automatic snapshots of the data disk
  • Configuring Dataiku
    • License File Management
      • How-to | Configure your DSS license
      • How-to | Update a license file through the license override setting
      • How-to | Select a sublicense
      • How-to | Update a license file for a cloud setup
      • How-to | Fetch usage statistics in Fleet Manager
      • How-to | View license information from the DSS Administration menu
    • User Identity & Authentication
      • Reference | Security model overview
      • Reference | User identity
      • Reference | User profiles
      • Reference | Supported authentication methods
      • How-to | Create a local user (manually)
      • How-to | Add LDAP users via LDAP configuration
      • How-to | Add local users from an Azure Active Directory (AAD)
    • User Groups & Permissions
      • Reference | Global vs. per-resource group permissions
      • Reference | Global group permissions
      • Reference | Per-resource group permissions
      • How-to | Set up user groups (overview)
      • How-to | Create a group and assign it global permissions
      • How-to | Verify group membership and permissions
      • How-to | Grant per-project permissions
      • How-to | Control access to code environments
      • How-to | Control access to managed clusters
      • How-to | Assign access to containerized execution
      • How-to | Assign Deployer infrastructure permissions
      • Tip | Creating a permissions model based on user types
      • Code Sample | Add a group to a Dataiku project using Python
      • FAQ | Which activities require that a user be added to the allowed_user_groups local Unix group?
    • Connection Usage Parameters
      • Reference | “Allow write” and “Allow managed datasets” usage parameters
      • Reference | Usage parameters for cloud storage
      • Reference | Usage parameters for SQL databases
    • Connection Security
      • Tutorial | Using AWS AssumeRole with an S3 connection to persist datasets
      • Reference | Security permissions for data connections
      • Reference | Global vs. per-user connection credentials
    • DSS Metastore Catalog
      • Reference | Dataiku metastore catalog
      • Reference | Querying datasets from metastore-aware engines
      • How-to | Configure an internal metastore
      • How-to | Configure an external metastore (AWS Glue Data Catalog)
      • How-to | Synchronize a dataset to the metastore catalog
      • How-to | Import a dataset from the Hive metastore (HMS)
      • How-to | Interact with AWS Glue
      • How-to | Build a chart using a metastore-aware engine
      • How-to | Query datasets from a metastore-aware notebook
    • Preferred Connections and Format for Dataset Storage
      • Concept | Default, fallback, and forced dataset connections
      • How-to | Configure the global default file format
      • How-to | Adjust the default configuration for preferred connections and file formats for a project
      • Tip | Selecting default file formats and preferred connections
    • Code Environment Administration
      • How-to | Grant permissions to create or manage code environments
      • How-to | Create a new code environment
      • How-to | Manage code environment properties
      • How-to | Configure default code environments
      • How-to | Install system-level package dependencies
      • How-to | Point DSS to a custom Python package repository
      • How-to | Point DSS to a CRAN mirror
      • How-to | Provide access to custom package repositories via an internet proxy
      • FAQ | Does Dataiku support custom package repositories?
  • Operating Dataiku
    • Instance Monitoring
      • Tutorial | Self-healing API service deployments on Kubernetes
      • Tutorial | Forward Dataiku logs to Splunk Cloud Platform
      • Tutorial | Use Datadog to monitor Dataiku-managed Elastic AI clusters
      • Code Sample | Find out which users are logged onto the Dataiku instance
      • Solution | Leveraging Compute Resource Usage Data
    • Diagnosing Performance Issues
      • How-to | Get support
      • Troubleshoot | A code recipe takes a long time to run
      • Troubleshoot | Dataiku isn’t using the optimal engine for a visual recipe
      • Troubleshoot | A visual recipe job log says “Computation will not be distributed”
      • Troubleshoot | Diagnosing instance-wide performance
      • Troubleshoot | Sync recipe from Snowflake to S3 takes many hours to complete
      • Troubleshoot | Python or PySpark job takes several hours to complete
      • Troubleshoot | The Dataiku UI is slow to load for all users
      • Tip | Scoping performance issues
      • Tip | Takeaways for performance troubleshooting
    • Project Cleaning and Maintenance
      • Tutorial | Create a scenario for automating project maintenance macros
      • Reference | Project maintenance macros
      • Reference | Project maintenance macros glossary
  • Go back to the homepage
  • Data Transfer and Security on Dataiku Cloud
Back to top

How-to | Secure data connections through AWS PrivateLink#

For certain plans, Dataiku enables Launchpad administrators to protect access to the following data sources through AWS PrivateLink.

AWS PrivateLink provides private connectivity between your Dataiku instance and supported AWS services without exposing your traffic to the public internet. Once activated, Dataiku Cloud will only connect to your data using one virtual private cloud (VPC) endpoint.

Important

AWS PrivateLink isn’t available in all Dataiku plans. You may need to reach out to your Dataiku Account Manager or Customer Success Manager.

If you run into any error, please contact the support team.

  • Amazon S3 data

  • An AWS-hosted Snowflake database

  • An AWS-hosted Databricks database

  • An AWS-hosted arbitrary data source

  • An AWS-hosted RDS

  • An on-premise data source

Amazon S3 data#

To configure AWS PrivateLink for an Amazon S3 data source:

  1. First, contact the support team so they can provide you with the endpoint to use. You will need to know the AWS region of your S3 buckets.

  2. Add or edit an S3 connection in the Launchpad’s Connections panel, check the box to the Use Path mode, and fill in the Region or Endpoint field with the value provided by support.

  3. Ensure your S3 policy authorizes access to the endpoint.

Note

Athena’s Glue feature won’t work with S3 connections using AWS PrivateLink.

An example of an S3 policy configured to only accept requests from a VPC endpoint:

{
    "Version": "2012-10-17",
    "Id": "Policy1415115909152",
    "Statement": [
        { "Sid": "Access-to-specific-VPCE-only",
        "Principal": "ARN-OF-IAM-USER-ASSUMED-BY-DATAIKU",
        "Action": "s3:*",
        "Effect": "Deny",
        "Resource": ["S3-BUCKET-ARN",
                    "S3-BUCKET-ARN/*"],
        "Condition": {"StringNotEquals": {"aws:sourceVpce": "VPCE-ID"}}
        }
    ]
}

An AWS-hosted Snowflake database#

To configure AWS PrivateLink for a Snowflake database hosted on AWS:

Ensure your Snowflake region is available in Dataiku Cloud#

  1. In the Dataiku Cloud Launchpad, navigate to the Extensions panel.

  2. Click + Add an Extension.

  3. Select AWS Snowflake endpoint.

  4. Select the AWS region of your Snowflake account. If the region you need isn’t available, please contact the support to enable it.

  5. Keep this page open, and continue to the next step in the Snowflake console.

Ask Snowflake support to allow AWS PrivateLink from Dataiku’s AWS account#

  1. In the Snowflake console, go to the Support section in the left panel.

  2. Create a new support case by clicking on Support Case in the top right corner.

  3. Fill the title with something meaningful, for example Enable AWS PrivateLink.

  4. Copy the message from the Dataiku Cloud Launchpad extension page to the support case detail.

    ../_images/support-case-message.png
  5. Adapt the message with your Snowflake account ID and region with the correct information. You can find both in the bottom left corner of the Snowflake console.

  6. In the Where did the issue occur? section, select AWS PrivateLink under the Managing Security & Authentication category, leave the severity to Sev-4, and click on Create Case.

    ../_images/snowflake-support-case.png
  7. Wait for Snowflake support to enable PrivateLink before continuing to the next set of instructions.

Retrieve the PrivateLink config from Snowflake#

  1. Having completed the above set of instructions, in Snowflake, create a new SQL worksheet.

  2. Run the following SQL commands with the ACCOUNTADMIN role:

    alter account set ENABLE_INTERNAL_STAGES_PRIVATELINK = true;
    select SYSTEM$GET_PRIVATELINK_CONFIG();
    
  3. Click on the output to open a new panel on the right.

  4. Click on the Click to Copy icon to copy the JSON result.

    ../_images/snowflake-result.png

Create the AWS Snowflake endpoint extension in the Dataiku Cloud Launchpad#

  1. Return to the Extensions tab of the Dataiku Cloud Launchpad.

  2. If not still open from the first section, click + Add an Extension, and select AWS Snowflake endpoint.

  3. Provide any string as the endpoint name. It will be helpful if it’s a unique identifier.

  4. Select your Snowflake AWS region; it should be available by now.

  5. Check the box to confirm that Snowflake support has enabled PrivateLink for Dataiku’s account.

  6. Paste the JSON you copied from the above set of instructions into the Snowflake PrivateLink config input.

  7. Click Add.

Use the AWS Snowflake endpoint in your Snowflake connections#

You can now use the endpoint you created both in new and existing Snowflake connections. To do that:

  1. In the Dataiku Cloud Launchpad, navigate to a new or existing Snowflake connection.

  2. Select Enable AWS PrivateLink in the Snowflake connection form.

  3. Select the endpoint you created.

../_images/use-endpoint.png

Note

Your Snowflake connection may use an S3 fast-write connection. In that case, you have to setup PrivateLink for it as described in Amazon S3 data if you also want that traffic to go through PrivateLink.

An AWS-hosted Databricks database#

To configure AWS PrivateLink for a Databricks database hosted on AWS:

  1. First, see if your Databricks account and workspace meet the requirements to enable PrivateLink.

Ensure your Databricks region is available in Dataiku Cloud#

  1. In the Dataiku Cloud Launchpad, navigate to the Extensions panel.

  2. Click + Add an Extension.

  3. Select AWS Databricks endpoint.

  4. Select the Databricks AWS region of your Databricks account. If the region you need isn’t available, please contact the support team to enable it.

  5. Keep this page open, and continue to the next step in the Databricks console.

Configure your Databricks account for PrivateLink#

A Databricks administrator must perform the following steps in your Databricks console.

  1. Register the Dataiku’s VPC endpoint provided in the extension form. You can refer to Databricks’s documentation to do so. Note that the region to fill in is your Databricks workspace AWS region — not Dataiku’s.

  2. Ensure your Private Access Settings (PAS) configuration allows this registered VPC endpoint to connect to your workspace. See Databricks’s article to learn more.

Create the AWS Databricks endpoint extension in the Dataiku Cloud Launchpad#

  1. Return to the Extensions tab of the Dataiku Cloud Launchpad.

  2. If not still open from the first section, click + Add an Extension, and select AWS Databricks endpoint.

  3. Provide any string as the endpoint name. It will be helpful if it’s a unique identifier.

  4. Select your Databricks AWS region; it should be available by now.

  5. Check the box to confirm that your Databricks account is configured for PrivateLink.

  6. Fill in the URL of your Databricks workspace.

  7. Click Add.

Use the AWS Databricks endpoint in your Databricks connections#

You can now use the endpoint you created both in new and existing Databricks connections. To do that:

  1. In the Dataiku Cloud Launchpad, navigate to a new or existing Databricks connection.

  2. Select Enable Databricks PrivateLink in the Databricks connection form.

  3. Select the endpoint you created.

Note

Your Databricks connection may use an S3 fast-write connection. In that case, you have to setup PrivateLink for it as described in Amazon S3 data if you also want that traffic to go through PrivateLink.

An AWS-hosted arbitrary data source#

Administrators can leverage AWS PrivateLink to expose any service running inside their VPCs to Dataiku Cloud.

Ensure your data source region is available in Dataiku Cloud#

  1. In the Dataiku Cloud Launchpad, navigate to the Extensions panel.

  2. Click + Add an Extension.

  3. Select AWS private endpoint.

  4. Select the AWS region of your source. If the region you need isn’t available, please contact the support to enable it.

  5. Once your region is available, select it to display the availability zone IDs and the IAM role of your instance.

Create the VPC endpoint service#

You will need a VPC endpoint service in one of the regions and availability zone IDs supported by Dataiku. To create it:

  1. Follow the AWS documentation to create the NLB in one of the availability zone IDs that were displayed from the previous step.

  2. Once your VPC endpoint service is created, add the IAM role of your instance from the previous step to the Allow Principals section.

Note

If security group rules are enforced, you must allow Dataiku’s internal CIDR range on network ACLs and security groups (10.0.0.0/16 and 10.1.0.0/16).

Configure the PrivateLink with Dataiku#

  1. Return to the Extensions tab of the Dataiku Cloud Launchpad.

  2. If not still open from the first section, click + Add an Extension, and select AWS private endpoint.

  3. Provide any string as the endpoint name. It will be helpful if it’s a unique identifier.

  4. Select your Endpoint AWS region; it should be available by now.

  5. Check the box to confirm that your service is configured for PrivateLink.

  6. Fill in your service name.

  7. Click Add.

  8. Accept the PrivateLink request on AWS (Endpoint services > Your endpoint > Endpoint connections).

  9. Click on the PrivateLink extension to retrieve the value Connection Host to use in your downstream connections.

Create the connection#

You can now use the endpoint you created both in new and existing connections:

  1. In the Dataiku Cloud Launchpad, navigate to the Connections panel.

  2. Add a Connection or select the one you want to edit.

  3. Fill the form, and use the value Connection Host displayed in the PrivateLink extension as the host param.

An AWS-hosted RDS#

To setup a AWS PrivateLink with an RDS, you will need to expose it through a network load balancer, and follow the steps from An AWS-hosted arbitrary data source.

Note

Sometimes, the IP address of your RDS can change. See the AWS blog post on Access Amazon RDS across VPCs using AWS PrivateLink and Network Load Balancer for more details and a possible solution.

An on-premise data source#

You can configure AWS PrivateLink for on-premise data sources if you have access to an AWS account:

  1. Connect your on-premise data source to your VPC as described in this AWS white paper on Network-to-Amazon VPC connectivity options.

  2. Follow the steps from An AWS-hosted arbitrary data source to connect to your data source.

Next
How-to | Secure data connections through Azure Private Link
Previous
Reference | Data transfer between cloud storage locations
Copyright © 2025, Dataiku
Made with Sphinx and @pradyunsg's Furo
On this page
  • How-to | Secure data connections through AWS PrivateLink
    • Amazon S3 data
    • An AWS-hosted Snowflake database
      • Ensure your Snowflake region is available in Dataiku Cloud
      • Ask Snowflake support to allow AWS PrivateLink from Dataiku’s AWS account
      • Retrieve the PrivateLink config from Snowflake
      • Create the AWS Snowflake endpoint extension in the Dataiku Cloud Launchpad
      • Use the AWS Snowflake endpoint in your Snowflake connections
    • An AWS-hosted Databricks database
      • Ensure your Databricks region is available in Dataiku Cloud
      • Configure your Databricks account for PrivateLink
      • Create the AWS Databricks endpoint extension in the Dataiku Cloud Launchpad
      • Use the AWS Databricks endpoint in your Databricks connections
    • An AWS-hosted arbitrary data source
      • Ensure your data source region is available in Dataiku Cloud
      • Create the VPC endpoint service
      • Configure the PrivateLink with Dataiku
      • Create the connection
    • An AWS-hosted RDS
    • An on-premise data source