Concept: Importing Code from a Remote Git Repository¶
In this lesson, we’ll learn about sharing code by importing it from a remote Git repository. We’ll discuss ways we can do this in both project libraries and notebooks.
The product documentation provides in-depth guidance on setting up a connection between Dataiku and a remote repository.
Imported Git code is known as a Git reference. Once imported into a project library, we can use the classes and functions in notebooks, recipes, and webapps in the same way we use any other code in our project library.
To demonstrate, we’ll import a specific path from a remote Git repository that contains a function that we can use to export Pandas DataFrames as Excel files.
First, we’ll go to the project’s library editor. Then we’ll select the Import from Git feature.
For a secure connection, use SSH, which requires set up of SSH credentials. Learn more in the product documentation on working with Git.
Then, we’ll use the URL of our public repository to connect to it.
Next, we’ll enter the name of the branch we want to “checkout”. The branch we select contains the content we want to import.
Then, we’ll enter the name of the subpath. This lets us import just a part of the repository.
Since we want only the “Custom Excel Functions”, we’ll enter the name of the path.
Then, we’ll enter the “target path” to let Dataiku know where we want to store the imported code in our project library. The “target path” is essential, If we leave this blank, Dataiku will replace the entire Python library in our project, removing any existing files and code.
The syntax starts and ends with a “forward slash”.
Now we’ll just Save and Retrieve the repository to fetch it. Dataiku updates the “External Libraries JSON” file with our new Git reference.
Now that we have retrieved a remote repository, we can use its functions in our project by including an import statement. For example, we can import the Dataframe to Excel function in our Jupyter notebook.
We can always edit, update, delete, or add new references, using Manage References. For example, when we want to keep our project library in sync with the repository it was imported from, we can Update it. “Update” pulls from the remote repository.
As a best practice, avoid making changes locally because modifications made in a project library cannot be pushed back to the original Git repository. A more automated way to manage Git references would be to create a step in a scenario.
If you have Jupyter notebooks that have been developed outside of Dataiku and are available in a Git repository, you can import these notebooks inside a Dataiku project.
To import a notebook from your remote Git repository, add a new notebook and then select Import from Git. After entering the URL and branch, Dataiku lists the notebooks available. Then just select which one you want to import. Imported notebooks are identified with an icon.
You can then manage your imported notebook including pushing changes back to the remote repository.
To find out more, visit Importing Jupyter Notebooks from Git.
In this lesson, we learned about using code from outside of Dataiku in our project using the Git integration feature in both project libraries and notebooks.
In shared code libraries, we learned about sharing the project library with other projects.
While the Git integration feature allows us to connect to a remote Git repository, it is necessary to manage the references in case anything changes in the repository and we want to make the changes available to the project.
By contrast, using a shared code library, we could manage changes all within the parent project without having to update any references in the child projects. However, when project libraries are shared between projects, and we plan to deploy our project to the automation node, the parent project must be deployed so that the project library is available.
To go further, check out the other lessons included in this course!