How to apply the nlp-primitives library using Featuretools.
When trying to leverage real-world data in a machine learning pipeline, it is common to come across text. Text data can include a lot of valuable information, but is often ignored because it is hard to translate text into meaningful numbers that algorithms can interpret.
In this article, I explore the use of the nlp-primitives library to create features from text data by using an example dataset to investigate these additional features in a machine learning model. Then, I explain why these primitives have such an outsized effect on the accuracy of the model.
A Machine Learning Demo Using nlp-primitives
In this demo, I will be using data from this Kaggle dataset, which contains 100 reviews for 57 restaurants in the San Francisco area. The dataset is relatively small, but it is adequate to demonstrate feature engineering with Featuretools and the nlp-primitives library. You can follow along using the Jupyter Notebook from this repository.
First, let’s establish the problem: to identify each reviewer’s opinion about a restaurant based on their feedback. The success of the model will be judged on how accurately it can predict the rating given by each reviewer based on the given data and the review. In the dataset, there are 5 possible values of reviews, from 1 star to 5 stars, so the problem is a 5-class classification problem.
In order to evaluate the effectiveness of the NLP primitives, I also created a baseline feature matrix using Deep Feature Synthesis (DFS) without these new primitive functions, so that I could see how effective these new primitives were. With this baseline feature matrix, I created a machine learning model with an accuracy score of about 50%.
baseline_feature_matrix, baseline_features = ft.dfs(entityset=es, target_entity='reviews', verbose=True, ignore_variables=ignore)
built 33 features
base_rfc = RandomForestClassifier(n_estimators=100, class_weight = "balanced", n_jobs=-1) base_rfc.fit(baseline_train_fm, baseline_y_train) base_rfc.score(baseline_test_fm, baseline_y_test)
This model predicted most reviews as being in the most common class of reviews, so it was not very accurate with its predictions, as you can see in the confusion matrix below.
Between the Baseline Model and the model using NLP Primitives, only one factor changed: the use of the nlp-primitives library.
trans = [DiversityScore, LSA, MeanCharactersPerWord, PartOfSpeechCount, PolarityScore, PunctuationCount, StopwordCount, TitleWordCount, UniversalSentenceEncoder, UpperCaseCount] features = ft.dfs(entityset=es, target_entity='reviews', trans_primitives=trans, verbose=True, features_only=True, ignore_variables=ignore, drop_contains=drop_contains, max_depth=4)
Built 333 features
From this small change in the DFS call, the number of features generated increased by a factor of 10.
This library was extremely easy to incorporate, and only a few lines of extra code were involved in importing the library and primitives and then adding these primitives to the default primitives used by the
ft.dfs function to create features. In both the baseline model and the nlp-primitives model, DFS was used to find features, though the nlp-primitives had a modified depth field that allowed DFS to create primitives that stacked on top of the NLP features.
After running DFS and creating the resultant feature matrix, we can split the data into training and testing sets, and use these sets in sklearn machine learning models to test their accuracy.
vot = VotingClassifier(voting='soft', estimators=[('lgr', lgr), ('rfc', rfc), ('hgbc', hgbc)], weights=[3, 1, 6]) vot.fit(train_feature_matrix, y_train) vot.score(test_feature_matrix, y_test)
When using the nlp-primitives library, the models were able to achieve about a 70% accuracy, with the confusion matrix being accurately distributed (the darker blue represents the guesses) and most of the incorrect guesses being very close (±1 - indicated by darker blue being on a more apparent diagonal line) to the actual answer (a perfect algorithm would have a 1 in the downward diagonal representing the predicted and true label agreeing, and a 0 in every other category—learn more about confusion matrices here).
Both of these models use similar training and testing steps (the baseline model uses a slightly less complex function because more complex functions didn’t change the accuracy), yet the accuracy of the model with the NLP features is about 40% higher than the baseline. Since everything else stayed the same, it is clear that the NLP Primitives library is responsible for this massive increase in accuracy. Furthermore, when we examine the feature importances, we see that the features using the NLP primitives are ranked the highest (see the notebook for more details).
Why Do These Primitives Make a Difference?
Data must be formatted as numbers in order for a machine learning model to “learn” from it. Text is hard to put into numbers, or at least hard to put into numbers without losing a lot of meaning. For example, it is fairly simple to get the word count of a body of text, but, oftentimes, this is not an adequate measure of meaning. Even though this might be a useful feature at times, there is so much more to text than the number of words it contains.
So, what is the solution? How can text be encoded into numbers in a meaningful way? One solution is to vectorize the meaning of the text. NLP primitives such as the UniversalSentenceEncoder, LSA (Latent Semantic Analysis), and PartOfSpeechCount use this method. They are both multi-output primitives, meaning that they take one field and create a feature with many fields. In this case, these fields represent the dimensions of a vector. In the below example, each string of text corresponds to two outputs, as LSA (Latent Semantic Analysis) creates a vector of length two for each string given.
from nlp_primitives import LSA import pandas as pd data = ["hello, this is a new featuretools library", "this will add new natural language primitives", "we hope you like it!"] lsa = LSA() pd.DataFrame(lsa(data).tolist()).T
In this next example, the primitive, PartOfSpeechCount, generates 15 values for each input. Each dimension of this vector represents a part of speech, and the number of times that part of speech appears in the input text field.
from nlp_primitives import PartOfSpeechCount data = ["hello, this is a new featuretools library", "this will add new natural language primitives", "we hope you like it!"] pscount = PartOfSpeechCount() pd.DataFrame(pscount(data).tolist()).T
These primitives encode the meaning of the text field in the vector in such a way that two text fields with similar meanings have a similar vector, even if comprised of different words. This makes these methods especially useful, for the ability to encode similar meanings in similar ways allows machine learning models to learn the outcome of specific vectors, and correlate the result of that vector with similar vectors.
However, it is often challenging to deal with many outputs, especially when trying to stack primitives—to use the outputs of some primitives as the inputs of others. This creates more information when many entities, or sources of data, exist. Featuretools handles this very well, and this enables users to gather information across entities as well as within them to leverage the present data to its fullest extent. This is further augmented by Featuretools’ ability to ‘stack’ primitives automatically, even further stretching the power of any individual primitive turned feature in the feature engineering step.
- The nlp-primitives library enhances the accuracy of models when dealing with text data. This additional accuracy stems from encoding the meaning of the text, rather than just calculating simple descriptive metrics of it.
- Using the right tool makes the machine learning process easier. When we changed a few lines to incorporate the nlp-primitives library, the accuracy went up, and the complexity of the code stayed the same.
- Correctness can be evaluated in more ways than one. It is important to understand what is going wrong when a model isn’t accurate—is it just wrong (like the baseline model) or is it wrong, but close to being right (like the nlp-primitives model)?
This library has been released and can be installed on its own or through Featuretools using
pip. But, it is still in its beginning stages of development, and we would welcome any feedback on it. Furthermore, the field of Natural Language Processing is a fast-moving one, so if you see an opportunity to add a new primitive, please suggest it as a GitHub issue, or try to create it yourself—the library is open source, so anyone can contribute!