How to standardize text fields using fuzzy values clustering¶
When working with large amounts of disparate, user-entered text data, we often need to standardize or collapse entries into a resolved form.
For example, how can we get a computer to recognize that strings like “Abraham Lincoln”, “Abe Lincoln”, and “Abrahm Lincoln” are actually the same category? We want to map these close variations into a smaller number of clean labels.
In the image below, the emp_title column has thousands of distinct values and a long tail of unique values (those that appear only once).
Text cleaning strategies like normalization that standardize on case and remove punctuation may help in this regard, but they don’t get us all of the way there. Similarly, if it is user-entered data, you might not be able to introduce a rule-based strategy using regular expressions.
The answer lies in a technique called fuzzy matching.
When viewing a dataset in the Explore tab, select Analyze from the context menu of the column header to review the number of distinct and unique values and their distribution.
If you switch over from the Categorical tab to the Values Clustering tab, you have the option to search for clusters.
You can choose a clustering strategy of “Fuzzy” or “Highly fuzzy” depending on how close you require values in the cluster to be. In this case, we’ll choose “Fuzzy”.
You can also compare words or characters. For this use case, we’ll compare words.
If you repeat these steps in a Prepare recipe, you can take action on these clusters.
You can adjust the final, standardized label if needed. Then, you can click “Merge Selected” to perform a merge on all or a selection of clusters so that the original values are mapped to the new cluster label.
To take the first example, all of these 5 variations will now have the same label– “Dept of Homeland Security”.
This merge action creates a fully-transparent and editable Find and Replace step in the Prepare recipe with hundreds - or even thousands - of individual replacements.
Super simple and a huge time saver!