How-to | Standardize text fields using fuzzy values clustering#

When working with large amounts of disparate, user-entered text data, you often need to standardize or collapse entries into a resolved form.

For example, how can a computer recognize that strings like “Abraham Lincoln”, “Abe Lincoln”, and “Abrahm Lincoln” are actually the same category? You want to map these close variations into a smaller number of clean labels.

In the image below, the emp_title column has thousands of distinct values and a long tail of unique values (those that appear only once).

Text cleaning strategies like normalization that standardize on case and remove punctuation may help in this regard, but they don’t entirely solve the problem. Similarly, if it’s user-entered data, you might not be able to introduce a rule-based strategy using regular expressions.

Fuzzy matching#

The answer lies in a technique called fuzzy matching.

When viewing a dataset in the Explore tab, select Analyze from the context menu of the column header to review the number of distinct and unique values and their distribution.

If you switch over from the Categorical tab to the Values Clustering tab, you have the option to search for clusters.

You can choose a clustering strategy of Fuzzy or Highly fuzzy depending on how close you require values in the cluster to be. In this case, choose Fuzzy.
You can also compare words or characters. For this use case, compare words.

If you repeat these steps in a Prepare recipe, you can act on these clusters.

You can adjust the final, standardized label if needed. Then, you can click “Merge Selected” to perform a merge on all or a selection of clusters so that the original values are mapped to the new cluster label.

To take the first example, all five variations will now have the same label— Dept of Homeland Security.

This merge action creates a fully transparent and editable Find and Replace step in the Prepare recipe with hundreds—or even thousands—of individual replacements.

Super simple and a huge time saver!