Concept | Challenges of natural language processing (NLP)#

This article looks at some of the problems you might face when using the bag of N-grams approach and ways to solve those problems. Each case will demonstrate the concept with a simple example.

The following conceptual examples draw on the four simple sentences in the image below.

Redundant features#

Dividing these four short sentences into N-grams shows that some features are redundant. Should North with a capital N be treated as a different feature than north with a lowercase n? What about the singular king and the plural kings?

Without any pre-processing, the N-gram approach will consider them as separate features. However, are they really conveying different information? Ideally, one feature should encapsulate all information conveyed by a word.

Sparse features#

You may also notice that this table of features is quite sparse. Most words, and so most features, are only present in one sentence. Only a few words like king are found in more than one sentence. This sparsity makes it difficult for an algorithm to find similarities between sentences as it searches for patterns.

High dimensionality#

Finally, the number of features of a dataset is called its dimensionality. A bag of N-grams approach generates a huge number of features. In this case, four short sentences generated 23 columns. Imagine how many columns a book would generate!

The more features present, the more storage and memory needed to process them. However, it also creates another challenge. The more features present, the more possible combinations between features present.

Therefore, the more data you’ll need to train a model that has an efficient learning process. That is why ML practitioners often look to apply techniques that will reduce the dimensionality of the training data.

Text cleaning tools#

To reduce these three problems, let’s look at three of the most basic yet most valuable data cleaning techniques for NLP:

Normalizing text.
Removing stopwords.
Stemming.

Normalizing text#

Text normalization is about transforming text into a standard format. This involves a number of steps, such as:

Converting all characters to the same case.
Removing punctuation and special characters.
Removing diacritical accent marks.

Applying normalization to the example allows removal of two columns (the duplicate versions of north and but) without losing any valuable information. Combining the title case and lowercase variants also has the effect of reducing sparsity, since these features are now found across more sentences.

Removing stopwords#

Next, you might notice that many of the features are common words, such as the, is, and in. In NLP, the collection of these common words is called stopwords.

Removing words like these, that don’t contain much information by themselves, from the training data is a good idea.

In this example, normalizing the text reduced the dataset from 21 columns to 11 columns.

Stemming#

These simple techniques have made good progress in reducing the dimensionality of the training data, but there is more to do. Note that the singular king and the plural kings remain as separate features in the image above despite containing nearly the same information.

You can apply another pre-processing technique called stemming to reduce words to their word stem. For example, words like assignee, assignment, and assigning all share the same word stem — assign. Reducing words to their word stem can collect more information in a single feature.

Applying stemming to the four sentences reduces the plural kings to its singular form king. This has reduced another feature in the dataset.

However, this operation wasn’t without consequence. You can’t expect stemming to be perfect. Some features were damaged. In this case, the words everywhere and change both lost their last e. In another course, we’ll discuss how another technique called lemmatization can correct this problem by returning a word to its dictionary form.

Next steps#

Thus far, you’ve seen three problems linked to the bag of words approach and introduced three techniques for improving the quality of features. Next, we’ll walk through how to implement these techniques in Dataiku.

Then, you’ll start building models!