Performing Joins in the Prepare Recipe¶
Dataiku DSS supports a number of different ways to join datasets.
In terms of non-code options, the Join recipe is a natural choice. However, simple left-joins on relatively small datasets can be performed using a Prepare recipe or visual analysis.
In addition, the Prepare recipe can also execute two other kinds of powerful joins: a fuzzy join and a geo-join.
Left-join for small datasets¶
The processors library includes an option to perform a memory-based left-join.
In the Prepare recipe below, we join the columns gender and ip_address from the Customers dataset to the Orders dataset.
Fuzzy join¶
The Fuzzy join processor performs a join where the values of the join key do not need to match precisely, but are “close” according to the Damerau-Levenshtein distance.
For example, you may want to enrich a customer database with information on the economies of the countries where your customers live. Using the country names as the key in a traditional join may be difficult, because the country names in one dataset may not exactly match the country names in the other, either because of typos or differences in use of the official or common name of a country. A fuzzy join can help you to successfully merge this data.
With Damerau-Levenshtein distance of 1, United States will match with the typo Untied States, but a distance of 2 is required to match more typos like Untied Stats, and a greater distance to match United States of America. Of course, the difficulty at that point is that increasing the Damerau-Levenshtein distance to match United States and United States of America can also create a match with United Kingdom, so you may need to recode values.
In addition to specifying the Damerau-Levenshtein distance, you can also apply string operations like normalizing the text or stemmings words before attempting the join.
Geo-join¶
The Geo-join processor performs a geographic nearest-neighbor join between two datasets with latitude-longitude coordinates.
For example, you may have determined the locations of customers from their IP addresses. You can enrich this data with information about the closest airport to each customer by using a Geo-join that finds the airport with the shortest distance based on geographic location.
The latitude and longitude of the locations to be joined are used as the keys, and must be available in both datasets. If you have a GeoPoint, you can first apply the Extract lat/lon from GeoPoint processor. You can select which columns from the other dataset should be copied to this one, optionally including the distance between the joined locations.