How-to | Standardize text fields using fuzzy values clustering#

When working with large amounts of disparate, user-entered text data, you often need to standardize or collapse entries into a resolved form.

For example, how can a computer recognize that strings like “Abraham Lincoln”, “Abe Lincoln”, and “Abrahm Lincoln” are actually the same category? You want to map these close variations into a smaller number of clean labels.

In the image below, the emp_title column has thousands of distinct values and a long tail of unique values (those that appear only once).

../../_images/sshot1.png

Text cleaning strategies like normalization that standardize on case and remove punctuation may help in this regard, but they don’t entirely solve the problem. Similarly, if it’s user-entered data, you might not be able to introduce a rule-based strategy using regular expressions.

Fuzzy matching#

The answer lies in a technique called fuzzy matching.

When viewing a dataset in the Explore tab, select Analyze from the context menu of the column header to review the number of distinct and unique values and their distribution.

../../_images/sshot2.png

If you switch over from the Categorical tab to the Values Clustering tab, you have the option to search for clusters.

  • You can choose a clustering strategy of Fuzzy or Highly fuzzy depending on how close you require values in the cluster to be. In this case, choose Fuzzy.

  • You can also compare words or characters. For this use case, compare words.

If you repeat these steps in a Prepare recipe, you can act on these clusters.

You can adjust the final, standardized label if needed. Then, you can click “Merge Selected” to perform a merge on all or a selection of clusters so that the original values are mapped to the new cluster label.

../../_images/sshot3.png

To take the first example, all five variations will now have the same label— Dept of Homeland Security.

This merge action creates a fully transparent and editable Find and Replace step in the Prepare recipe with hundreds—or even thousands—of individual replacements.

Super simple and a huge time saver!