Quick Start | Dataiku for AI collaboration#
Get started#
Recent advancements in generative AI have made it easy to apply for jobs. But be careful! Scammers have also been known to create fake job applications in the hopes of stealing personal information. Let’s see if you — with Dataiku’s help — can spot a real job posting from a fake one!
Objectives#
Rather than designing new elements like in the quick starts for data preparation, machine learning, or MLOps, this quick start focuses on how to collaborate with colleagues and use the AI capabilities they have already created as inputs for your own objectives.
In this quick start, you’ll:
Understand a project’s objectives by reviewing the Flow.
Recognize how group assignments impact project security.
Communicate insights with a dashboard.
Run a colleague’s workload by using both an automation scenario and a Dataiku application.
Note
All capabilities featured in this quick start can be completed with the AI Consumer user profile.
Tip
To check your work, you can review a completed version of this entire project from data preparation through MLOps on the Dataiku gallery.
Create an account#
To follow along with the steps in this tutorial, you need access to a 12.6+ Dataiku instance. If you do not already have access, you can get started in one of two ways:
Start a 14 day free trial. See this how-to for help if needed.
Install the free edition locally for your operating system.
Open Dataiku#
The first step is getting to the homepage of your Dataiku Design node.
Create the project#
Once you are on the Design node homepage, you can create the tutorial project.
From the Dataiku Design homepage, click + New Project.
Click DSS tutorials in the dropdown menu.
In the dialog, click Quick Starts on the left hand panel.
Choose AI Collaboration Quick Start, and then click OK.
Note
You can also download the starter project from this website and import it as a zip file.
Are you using an AI Consumer profile?
AI Consumer profiles do not include the permission to create a new project. However, as a Designer on a trial or free edition, you’ll be able to do this on your own!
If using an AI Consumer profile, have your instance administrator follow the steps below so you can complete the quick start:
Create the project above.
Build the Flow.
Assume the role of the Score Data scenario’s last author by making an arbitrary change to the scenario (such as to the trigger) and saving it.
Grant you permission to access the project.
Review the Flow#
See a screencast covering this section’s steps
One of the first concepts a user needs to understand about Dataiku is the Flow. The Flow is the visual representation of how datasets, recipes (steps for data transformation), and models work together to move data through an analytics pipeline.
See the Flow’s visual grammar#
Dataiku has its own visual grammar to organize AI, machine learning, and analytics projects in a collaborative way.
Shape |
Item |
Icon |
---|---|---|
Dataset |
The icon on the square represents the dataset’s storage location, such as Amazon S3, Snowflake, PostgreSQL, etc. |
|
Recipe |
The icon on the circle represents the type of data transformation, such as a broom for a Prepare recipe or coiled snakes for a Python recipe. |
|
Model |
The icon on the diamond represents the type of modeling task, such as prediction, clustering, time series forecasting, etc. |
Tip
In addition to shape, color has meaning too.
Datasets are blue. Those shared from other projects are black.
Visual recipes are yellow. Code recipes are orange. LLM recipes are pink. Plugin recipes are red.
Machine learning elements are green.
Take a look at the items in the Flow now!
If not already there, from the left-most menu in the top navigation bar, click on the Flow (or use the keyboard shortcut
g
+f
).
Tip
There are many other keyboard shortcuts beyond g
+ f
. Type ?
to pull up a menu or see the Accessibility page in the reference documentation.
Use the right panel to review an item’s details#
To collaborate on a project, you’ll need to quickly get up to speed on what someone else’s Flow accomplishes. Let’s try to figure out the purpose of this one.
Click once on the job_postings dataset to select it.
Click to open the Details icon to learn more about this item.
Click on the Schema tab underneath to see its columns.
Click on the test_scored dataset at the end of the pipeline, and review the same tabs. Note the addition of a prediction column.
Review the recipes that transform job_postings to test_scored beginning with the Prepare recipe at the start of the pipeline. Click once to select each one, and review the Details tabs to help determine what they do.
Note
The model in this project happens to be a simple AutoML model. However, you can think of it as a placeholder for any kind of model — not only those built in Dataiku, but also custom models imported into Dataiku.
You could read the project’s wiki (use the keyboard shortcut g
+ w
) for more information, but from just browsing the Flow, you probably already have a good idea of what this project does. The pipeline prepares some data and builds a prediction model in order to classify a job posting as real or fake.
The readability of the Flow eases the challenge of bringing users of diverse skill sets and responsibilities onto the same platform. For example:
The Flow has visual recipes (in yellow) that can be understood by all, but also custom code (in orange).
The Flow is divided into two interconnected Flow zones, which can be useful for teams focused on different stages of a project.
Build the Flow#
Unlike the initial uploaded datasets, the downstream datasets appear as outlines. This is because they have not been built, meaning that the relevant recipes have not been run to populate these datasets. However, this is not a problem because the Flow contains the recipes required to create these outputs at any time.
Click to open the Flow Actions menu in the bottom right.
Click Build all.
Leaving the default options, click Build to run the recipes necessary to create the items furthest downstream.
When the job completes, refresh the page to see the built Flow.
Collaborate in real-time through a browser#
See a screencast covering this section’s steps
Now that you have built the project, you might want to get straight to work. However, let’s take a moment to review a few collaboration principles.
Work in a browser#
One point not to be overlooked is that you access Dataiku through a web browser (rather than say, for example, a desktop application). This has a number of advantages:
You can work with large datasets in a secure and governed way.
You can avoid lost time for data transfer across networks.
You can better track a project’s version history and user contributions.
Understand the groups-based permission framework#
A browser-based tool also enables a groups-based permission framework. Start by recognizing some basic details about your account.
In the top right corner, click on the Profile icon.
Click the gear icon to open Profile and settings.
Find your user profile and the groups to which you belong.
Common user profiles include Data Designer, Advanced Analytics Designer, Full Designer, and AI Consumer. As an example, if on a free trial, your profile will be designer, and you’ll be a member of the designers and space_administrators groups.
Based on your group membership, you may have projects or workspaces shared with you, and permissions set for what you can do in these items (such as writing project content or only reading project content).
Assuming you created the job postings project yourself, you’ll be able to view the project’s security settings. These settings include information such as the project owner and the specific project permissions granted to each group or user.
Return to the project (for example, using the back arrow in your browser).
From the top navigation bar, go to the More Options (…) menu.
Click on Security to view the permissions matrix for the project.
Tip
Users can invite a colleague to their space from the Users, Profiles & Groups panel of their Launchpad.
Self-managed Dataiku users with the appropriate permissions can do the same from the Administration > Security > Users panel. Then, grant this user access to your project from the above Permissions panel of the Project security page!
Communicate with colleagues#
Once you have your colleagues on the same instance space, you’ll be able to collaborate in real-time.
Start discussions on objects, such as from the Discussions tab of the right sidebar.
Manage requests and review discussions from your Inbox found in the Applications menu in the top right of the navigation bar.
Run a colleague’s workload using an automation scenario#
See a screencast covering this section’s steps
In addition to a being a place for communicating insights, dashboards can also be a tool to interact with project elements that colleagues have created.
For example, a team member may have embedded a custom webapp on a dashboard. You can use the webapp’s functionality through the dashboard.
Another good example of this pattern is scenarios. In Dataiku, scenarios are a set of actions to run, along with conditions for when they should execute. Although scenarios can trigger automatically based on factors like time or dataset changes, you can also trigger them manually — including from a dashboard.
This could be helpful for tasks such as:
Refresh the Flow with the latest batch of data.
Export objects like dashboards, notebooks, reports, wikis, and other kinds of documentation.
Execute some SQL or Python code.
Note
See the Quick Start | Dataiku for MLOps for a walkthrough of your first scenario.
View a scenario#
As an example, take a look at the scenario in the project.
From the Jobs menu in the top navigation bar, click on Scenarios.
Click Score Data to open the scenario.
Navigate to the Steps tab.
Click on the steps to see what actions will run when the scenario is triggered.
In many cases, you may not need to know all details of a colleague’s scenario, but this one is easy to understand. It rebuilds the furthest downstream test_scored dataset (and any necessary upstream dependencies), but only if the data quality rules on the upstream job_postings_prepared dataset pass verification.
Caution
If you created the project yourself, you’ll need to become the scenario’s last author in order to run it. To do this, make an arbitrary change in the scenario, and then save it. For example, on the Settings tab, change something about the trigger (which won’t be used anyway).
If another user shared the project with you, (for example you may be an AI Consumer), this user needs to have made this arbitrary change to become the scenario’s last author in order for you to execute the scenario on their behalf.
Add a scenario tile to a dashboard#
To have a convenient way of triggering a scenario, the project’s author can add a tile for the scenario to a dashboard page.
Navigate back to the Project dashboard (
g
+p
).Click Edit.
Go to Page 2.
Click + New Tile.
Choose Scenario.
In the dialog, choose the Run button option.
Open the Source scenario dropdown, and select Score Data.
Click Add.
Adjust the size of the tile by dragging the corners, and click Save (
Cmd/Ctrl
+s
).
Run a scenario from a dashboard#
Now let’s trigger the scenario!
Click View from the dashboard.
Click Run Now to trigger the Score Data scenario.
Click Logs on the pop-up notification or, beginning from the top navigation bar, go to Scenarios > Score Data > Last runs.
Click on the job log for the build step to view the output.
Tip
Can you see why there was “nothing to do” for the build step? The short answer is build modes! Contrast this outcome with what happens when completing the next section!
Run a colleague’s workload using a Dataiku application#
See a screencast covering this section’s steps
Having a scenario run button on a dashboard simplifies the complexity for end users. But for many situations, end users may not even need access to the original project!
Dataiku applications enable users to package a project as a reusable application and share it with an audience of end users, such as business analysts. These end users can create their own instances of the application, and use it to complete their tasks without ever seeing the original project.
Your project already has been packaged as a Dataiku application. This application allows users to upload a dataset, apply the model to the uploaded data, and download the predictions — using the same scenario you ran above.
View the Dataiku application designer#
Typically, as an end user of a Dataiku application, you won’t need to see the originating project. However, in this case, you can still take a look!
From the top navigation bar, navigate to the More Options (…) menu.
Choose Application Designer.
Scroll through the application, and try to understand how it can be used.
Create an instance of a Dataiku application#
Similar to creating your own copy of the starter project for this tutorial, you need to create your own copy (or instance) of the Dataiku application.
From the top navigation bar, open the waffle menu, and select Dataiku Applications.
Click on the Score Job Postings application.
Click Create App Instance.
Give it a unique name, such as
MYNAME SCORE JOBS
.Click Create.
Use a Dataiku application#
The first field asks you to upload a dataset to be scored by the model. For this example, let’s use an export of the original job_postings dataset, but filtered for jobs in New Zealand.
Download the nz_job_postings.csv file.
Once you have a file to upload, you can use the application to produce a batch of model predictions.
In the Upload data to be scored tile of your Dataiku application instance, click Add a File, and select the nz_job_postings.csv file.
In the Generate predictions tile, click Run Now.
When the run is finished, in the Download predictions tile, click Download.
Without ever seeing the original project, clicking Run Now triggered the same Score Data scenario that you ran from the dashboard in the previous section. This time, however, Dataiku detected a new upstream dataset in place of the job_postings dataset. Therefore, the scenario had actual work to do!
Tip
Import the file you downloaded into Dataiku or any other data tool to confirm that it is indeed the same as the test_scored dataset found in the Flow — but including only results from New Zealand!
What’s next?#
Congratulations! You’ve taken your first steps toward AI collaboration with Dataiku.
If you are interested in learning more about Designer capabilities, please see the quick starts for data preparation, machine learning, or MLOps.
See also
You can also find more Dataiku resources in the following spaces: