Concept Summary: Preparing Text Data

In the previous section, we saw how the bag of N-grams approach allowed us to understand the intuition behind creating features from raw text in an NLP project.

In this lesson, we’ll look at some of the problems we might run into when using the bag of N-grams approach and ways to solve those problems. For each case, we’ll first demonstrate the concept with a simple example and then later walk through its implementation in Dataiku DSS.

Challenges of NLP

For the following conceptual examples, we’ll draw on the four simple sentences in the image below.

Redundant Features

After dividing these four short sentences into N-grams, we find that some features are redundant. Should “North” with a capital “N” be treated as a different feature than “north” with a lowercase “n”? What about the singular “king” and the plural “kings”?


Without any pre-processing, our N-gram approach will consider them as separate features, but are they really conveying different information? Ideally, we want all of the information conveyed by a word encapsulated into one feature.

Sparse Features

You may also notice that this table of features is quite sparse. Most words, and so most features, are only present in one sentence. Only a few words like “king” are found in more than one sentence. This sparsity will make it difficult for an algorithm to find similarities between sentences as it searches for patterns.


High Dimensionality

Finally, the number of features of a dataset is called its dimensionality. A bag of N-grams approach generates a huge number of features. In this case, four short sentences generated 23 columns. Imagine how many columns a book would generate!


The more features you have, the more storage and memory you need to process them, but it also creates another challenge. The more features you have, the more possible combinations between features you will have, and the more data you’ll need to train a model that has an efficient learning process. That is why we often look to apply techniques that will reduce the dimensionality of the training data.

Text Cleaning Tools

To lessen the three problems of redundant features, sparsity in features, and high dimensionality, let’s look at three of the most basic and most valuable data cleaning techniques for NLP:

  • normalizing text,

  • removing stopwords,

  • and stemming.

Normalizing Text

Text normalization is about transforming text into a standard format. This involves a number of steps, such as:

  • converting all characters to the same case,

  • removing punctuation and special characters,

  • and removing diacritical accent marks.


Applying normalization to our example allowed us to eliminate two columns – the duplicate versions of “north” and “but” – without losing any valuable information. Combining the titlecase and lowercase variants also has the effect of reducing sparsity, since these features are now found across more sentences.


Removing Stopwords

Next, you might notice that many of the features are very common words – like “the”, “is”, and “in”. In NLP, the collection of these very common words is called stopwords.

Removing words like these, that do not contain much information by themselves, from the training data is a good idea.

In this example, we’ve reduced the dataset from 21 columns to 11 columns just by normalizing the text.



We’ve made good progress in reducing the dimensionality of the training data, but there is more we can do. Note that the singular “king” and the plural “kings” remain as separate features in the image above despite containing nearly the same information.

We can apply another pre-processing technique called stemming to reduce words to their “word stem”. For example, words like “assignee”, “assignment”, and “assigning” all share the same word stem– “assign”. By reducing words to their word stem, we can collect more information in a single feature.


Applying stemming to our four sentences reduces the plural “kings” to its singular form “king”. We have reduced another feature in the dataset.


However, this operation was not without consequence. We cannot expect stemming to be perfect. Some features were damaged. In this case, the words “everywhere” and “change” both lost their last “e”. In another course, we’ll discuss how another technique called lemmatization can correct this problem by returning a word to its dictionary form.

Exploring Text Data

Having introduced these concepts, let’s see how we can implement these techniques in Dataiku DSS.

Consider a simple dataset of SMS messages. 1 One column is raw SMS messages. The other is a label, 1 for a spam message and 0 for a non-spam message. Our task is to train a model that can classify SMS messages into these two categories.


Just browsing these messages, we can see that normal human language is far from clean. It is filled with abbreviations, misspellings, and unusual punctuation.

It is helpful to explore the data using the Analyze window before cleaning it with a Prepare recipe or attempting to create a model.

After computing clusters, we can see that many messages, particularly spam messages, follow very similar formats, perhaps only changing the phone number where the recipient should reply.


We can also compute counts of the most common words. If we fail to normalize text, we can see that lowercase and uppercase “u” are treated as two different words. Keeping this distinction probably won’t help our classifier, as it will have to learn that these two words carry the same information.


Cleaning Text Data

With some knowledge of our dataset, let’s start cleaning in a Prepare recipe.

Simplify Text

In the processors library, the most important step for text cleaning is to apply the Simplify text processor.


The Simplify text processor takes a column of text as input, and outputs the transformed text to a new column, or in place if the output column field is left empty. This processor offers four kinds of text transformations.

The first, “Normalize text”, transforms all text to lowercase, removes punctuation and accents, and performs unicode normalization.


The second, “Sort words alphabetically”, returns the input string with words sorted in alphanumerical order. This allows us to match together strings written with the same words, but in a different order.


The third option, “Stem words”, tries to reduce words to their grammatical root. This option is available in several different European languages. Note how a word like “watching” becomes “watch”. But it’s not without consequences. Note how “remember” becomes “rememb”.


The last option is to remove stopwords. Recall that stopwords are very common words like “the”, “I”, “a”, “of”, that do not carry much information. Thereby, they create noise in the text data. This transformation is also language-specific.

We can see that a message like “I have a date on Sunday with Will” becomes just “date sunday will”.


Other Processors

The Simplify text processor is not the only tool to help you prepare natural language data in Dataiku DSS.

Other processors, for example, can help you extract numbers or count occurrences of patterns. The Formula language can also be helpful to build new features from text data.

A built-in function like length(), for example, may provide useful information to classify spam messages.


Perhaps a feature like the ratio of the length of the raw SMS to the length of simplified SMS may also contain some useful information.

When we are satisfied with our data cleaning and preparation steps, we can run the Prepare recipe to apply the steps to the whole dataset.

What’s next?

Thus far, we have seen three problems linked to the bag of words approach and introduced three techniques for improving the quality of features. We then walked through how to implement these techniques in Dataiku DSS.

Now, as we’ll show in the next section, we are ready to start building models!