Tutorial | Git for projects#

Let’s practice using the Git integration for version control in a Dataiku project!

Get started#

Objectives#

In this tutorial, you will:

  • Create and update new branches for a project.

  • Resolve merge conflicts between branches locally.

  • Connect a local project to a remote Git repository.

  • Push, pull, and merge changes remotely.

Prerequisites#

To complete this tutorial, you should have a firm understanding of the Git model and terminology.

Section

Requirements

Version control with the local Git repository

  • Dataiku 13.1 or later.

Version control with remote Git repositories

  • Dataiku 12.0 or later.

  • An empty remote Git repository for this project specifically.

  • A Dataiku instance that has been set up to work with remote Git repositories. Refer to Working with Git for help.

Create the project#

  1. From the Dataiku Design homepage, click + New Project.

  2. Select Learning projects.

  3. Search for and select Git for Projects.

  4. Click Install.

  5. From the project homepage, click Go to Flow (or g + f).

Note

You can also download the starter project from this website and import it as a zip file.

Once you have the project, open the Version Control page from the More Options (…) menu in the top navigation bar.

Version control with the local Git repository#

Branch the project#

Let’s make a new working branch from the master branch of the project. New projects are automatically on master.

  1. From the branch indicator menu, click Create new branch.

  2. Name the new branch update-wiki.

  3. Choose Duplicate project to work on new branch.

  4. Select Next.

    Dataiku screenshot of the steps to create a new branch for the project.
  5. Select a project folder if you wish.

  6. Click Duplicate and Create Branch.

Your new project on the new branch will open automatically. Your starter project will stay on the master branch.

Important

A Dataiku project can only be on one branch at any given time. If you switch the branch of the current project, it will also be switched for any collaborator who then might accidentally make a change to the wrong branch. This is why it’s best to make a duplicate project for the new branch if you are working in a team.

Edit branched wiki#

Next, we’ll make wiki changes on both the master branch and the update-wiki working branch. Since we duplicated the project, we’ll have to make changes on two separate projects.

In this case, we’re simulating a situation in which two people unknowingly make changes to the same part of the project on separate branches.

  1. From the top navigation bar, open the branched project’s wiki (or g + w).

  2. Click the Edit tab of the Model Training and Design Requirements article.

  3. Change the first heading from Introduction to Overview.

  4. Delete the line that includes Sources: Customer databases, transaction logs, CRM systems.

  5. Save your changes.

    Dataiku screenshot of the edited wiki page on the working branch.
  6. Open the Version Control page.

  7. Note that the latest commit should reflect that you saved your wiki article.

Note

Recall that any change that is saved in a Dataiku project is automatically committed to the local Git repository. In other words, you do not have to stage and commit your changes manually.

Edit master wiki#

Now, we’ll create a conflicting change on the master branch.

  1. In a new tab, open the original project that is on master.

  2. Open the wiki and switch to Edit mode.

  3. This time, change the first heading from Introduction to Purpose.

  4. Save your changes.

Dataiku screenshot of the edited wiki page on the master branch.

Create a merge request#

Let’s see what happens when we try to merge the changes from the update-wiki branch into the master branch.

  1. From the project on master, open the Version Control page.

  2. Click on the Merge dropdown.

  3. Select Create a new merge request.

  4. For the Title, type Update wiki heading.

  5. Select the update-wiki branch project to merge into master.

  6. Click Create Merge Request.

    Dataiku screenshot of the create merge request dialogue window.
  7. In the Commits tab, review the commits that will be merged into master.

  8. In the Changed files tab, you should see the wiki article that you modified.

Resolve a merge conflict#

Since we changed the same line of the wiki on both branches, we need to resolve a merge conflict. You’ll see that you cannot click Merge until the conflict is resolved.

  1. Navigate to the Conflicts to resolve tab.

  2. Delete all of the lines between and including <<<<<<< HEAD and >>>>>>> fork/update-wiki except # Purpose.

    Note

    You can also make other changes to the file during this time if you want them to be included in the merge.

  3. Notice that the line that you removed does not appear in the article.

  4. Save and select Mark as Resolved.

  5. Merge the changes.

  6. Close the Request.

Dataiku screenshot highlighting the lines to delete to resolve the merge conflict.

If you want, return to the wiki article page and make sure that the changes are correctly reflected in the master branch.

Version control with remote Git repositories#

Connect to a remote Git repository#

In this section, you’ll have to connect your project to a remote Git repository. Each project must have its own repository.

  1. On the master branch project, navigate to the Version Control page.

  2. Click on the change tracking indicator and select Add remote.

    Dataiku screenshot highlighting the Add remote option.
  3. Enter the SSH URL of the remote and click OK.

  4. From the change tracking indicator, select Push.

  5. In your remote Git repository, view that the master branch has been successfully pushed.

GitHub screenshot of the project pushed.

Branch the project#

Next, we’ll create a new branch.

  1. From the branch indicator click Create new branch.

  2. Name the new branch prune-flow.

  3. Choose Duplicate project to work on new branch.

  4. Select Next.

    Screenshot of the Create branch dialogue window.
  5. Select a project folder if you wish.

  6. Click Duplicate and Create Branch.

This creates a duplicate project on the prune-flow branch.

Make changes on the branch#

Now, we can make our changes to the duplicate project on the prune-flow branch without disturbing the rest of the data team’s use of the master branch of the project.

  1. In the new project, go to the Flow.

  2. Delete the Orders_by_Country_Category dataset.

Dataiku screenshot highlighting the dataset to delete.

Push changes to the remote repository#

To make your changes appear in the remote repository:

  1. Return to the Version Control page.

  2. From the change tracking indicator menu, select Push.

You will see that the prune-flow branch has been pushed to your remote Git repository.

Screenshot of the repository page in GitHub.

Merge branches#

To merge the changes on prune-flow into master, you can either:

  • Merge the changes locally and push the merge commit to the remote repository.

  • Merge the changes in the remote repository and pull the changes locally.

For instance, if Pull Requests are part of your team’s workflow, you might choose to merge on GitHub.

  1. In this case, merge the changes remotely. (You can use the command line, a Git client, or whatever you are comfortable with.)

  2. Return to the original project on the master branch.

  3. From the change tracking indicator menu, Fetch the changes from the remote Git repo.

    Note

    You’ll notice on the change indicator that your branch is behind the remote master branch, as expected.

  4. Pull the changes to your local Git.

Dataiku screenshot highlighting the Fetch and Pull options of the Project version control page.

Note

Branching and Merge Conflicts: This tutorial describes an extremely simple branch and merge. If multiple collaborators each create a separate branch off of master, and then try to merge their separate branches back to master, they are likely to encounter Git merge conflicts. These can be difficult to resolve, and we may not be able to solve them for you. Your data team should agree on a plan for how to collaborate on projects using Git in order to avoid merge conflicts.

What’s next?#

To learn more about other integrations with Git and Dataiku, check out this page on Working with Git.