Using Text Data in EvalML with Woodwork – Alteryx

Text can be a rich and informative type of data. It can be used in a variety of tasks, including sentiment analysis, topic extraction, and spam detection. However, raw text cannot be fed directly to machine learning algorithms, because most models can only understand numeric values. Thus, to utilize text as data in machine learning, it must first be processed and transformed to numeric values.

In this post, we will learn how we can use EvalML to detect spam text messages by framing it as a binary classification problem using text data. EvalML is an AutoML library written in Python that uses Woodwork to detect and specify how data should be treated, and the nlp-primitives library to create meaningful numeric features from raw text data.

Spam Dataset

The dataset we will be using in this demo consists of SMS text messages in English, some of which are tagged as legitimate (“ham”), and others which are tagged as spam. For this demo, we have modified the original dataset from Kaggle by joining all of the input text columns and downsampling the majority class (“ham”) so that the “ham” to “spam” ratio is 3:1. The following references to the data we will be inspecting will always refer to our modified and smaller dataset.

Let’s load in our data and display a few rows to understand what our text messages look like:

from urllib.request import urlopenimport pandas as pdinput_data = urlopen('https://featurelabs-static.s3.amazonaws.com/spam_text_messages_modified.csv')data = pd.read_csv(input_data)X = data.drop(['Category'], axis=1)y = data['Category']display(X.head())

We can plot the frequency of our target values to verify that the ratio of “ham” to “spam” in our modified dataset is approximately 3:1.

y.value_counts().plot.pie(figsize=(10,10))

The ratio of “ham” to “spam” is approximately 3:1

Because the ratio of “ham” to “spam” is 3:1, we can create a trivial model that always classifies a message as the majority “ham” class to obtain a model that has a 75% accuracy. This model would also have a recall score of 0%, since it is unable to classify any of the minority “spam” class samples correctly, and a balanced accuracy score of 50%. This means that a machine learning model should have an accuracy score greater than 75%, a recall score greater than 0%, and a balanced accuracy score greater than 50% to be useful.

	Baseline model (always guesses majority class)
Accuracy	75%
Balanced Accuracy	50%
Recall	0%

Let’s generate a model using EvalML and see if we can do better than this trivial model!

Introducing Woodwork

Before feeding our data into EvalML, we have a more fundamental issue to address: How can we specify that our data should be treated as text data? Using pandas alone, we can’t distinguish between text data and non-text data (such as categorical data) because pandas uses the same object data type to store both. How we can make sure that our models correctly treat our text messages as text data, and not as hundreds of different unique categories?

pandas treats “Message” as an “object” data type by default

EvalML utilizes the open-source Woodwork library to detect and specify how each feature should be treated, independent of its underlying physical data type. This means we can treat columns with the same physical data type differently. For example, we can specify that we want some columns that contain text to be treated as categorical columns, while we treat other columns with text as natural language columns, even if these columns have the same underlying object datatype. This differentiation allows us to clear up the ambiguity between features that may have the same underlying datatype in pandas, but ultimately represent different types of data.

Here, we initialize a Woodwork DataTable with our feature. Our single Message feature is automatically detected as a natural language or text column.

import woodwork as wwX = ww.DataTable(X)# Note: We could have also manually set the Message column to # natural language if Woodwork had not automatically detectedfrom evalml.utils import infer_feature_typesX = infer_feature_types(X, {'Message': 'NaturalLanguage'})

Our “Message” feature is automatically detected as a natural language (text) column

We can also initialize a Woodwork DataColumn for our target.

y = ww.DataColumn(y)

Our target is automatically detected as a categorical column. This makes sense, since we have a binary classification problem with two categories of text messages: spam and ham.

Our target (“y”) is automatically detected as a categorical column

Running AutoMLSearch

Now, let’s feed our data to AutoMLSearch to see if we can produce a nontrivial machine learning model to detect spam. AutoML is the process of automating the construction, training, and evaluation of machine learning models. AutoMLSearch is EvalML’s interface for AutoML.

First, we will split our data into training and test data sets. We will use the training data set to train and find the best model, and then validate our model’s performance on the test data.

EvalML offers a utility method that makes this easy. All we need to do is specify that we have a binary classification problem, and that we want to reserve 20% of our data as test data.

from evalml.preprocessing import split_dataX_train, X_holdout, y_train, y_holdout = split_data(X, y, problem_type='binary', test_size=0.2)

Next, we can set up AutoMLSearch by specifying the problem type and passing in our training data. Again, we have a binary classification problem because we are trying to classify our messages as one of two categories: ham or spam.

automl = AutoMLSearch(X_train=X_train, y_train=y_train, problem_type='binary')

Calling the constructor initializes an AutoMLSearch object that is configured for our data. Now, we can call automl.search() to start the AutoML process. This will automatically generate pipelines for our data, and then train a collection of various models.

automl.search()

EvalML’s AutoML search has trained and evaluated nine different models.

To understand the type of pipelines AutoMLSearch has built, we can grab the best performing pipeline and examine it in greater detail. We can call automl.describe_pipeline(id) to see detailed information about the pipeline’s components and performance, or automl.graph(pipeline) to see a visual representation of our pipeline as a flow of components.

# rankings are ordered from best to worst, # so 0th index is the best pipelinebest_pipeline_id = automl.rankings.iloc[0]["id"])automl.describe_pipeline(best_pipeline_id)

# We can also grab the best performing pipeline like thisautoml.best_pipelineautoml.graph(automl.best_pipeline)

Graphical representation of our best pipeline

By examining the best performing pipeline, we can better understand what AutoMLSearch is doing, and what pipelines it built with our text data. The best pipeline consists of an Imputer, a Text Featurization Component and a Random Forest Classifier component. Let’s break this down and understand how this pipeline was constructed:

AutoMLSearch always adds an Imputer to each generated pipeline to handle missing values. By default, the Imputer will fill the missing values in numeric columns with the mean of each column, and fill the missing values in categorical columns with the most frequent category of each column. Because we don’t have any categorical or numeric columns in our input, the Imputer does not transform our data.
Since AutoMLSearch identified a text column (our Message feature), it appended a Text Featurization Component to each pipeline. This component first cleans the text input by removing all non-alphanumerical characters (except spaces) and converting the text input to lowercase. The component then processes the cleaned text features by replacing each text feature with representative numeric features using LSA and the nlp-primitives package. This component is necessary if we want to handle text features in machine learning, because most machine learning models are not able to handle text data natively. Thus, we need this component to help extract useful information from the raw text input and convert it to numeric values that the models can understand.
Finally, each pipeline has an estimator (a model) which is fitted on our transformed training data and is used to make predictions. Our best pipeline has a Random Forest classifier. If we took a look at some other pipelines, we would also see other pipelines constructed with a LightGBM classifier, Decision Tree classifier, XGBoost classifier, etc.

Best Pipeline Performance

Now, let’s see how well our best pipeline performed on various metrics and if we could beat the baseline trivial model by scoring the pipeline on test data.

>>> scores = best_pipeline.score(X_holdout, y_holdout,  objectives=evalml.objectives.get_core_objectives('binary') + ['recall'])>>> scoresOrderedDict([('MCC Binary', 0.9278003804626707),			 ('Log Loss Binary', 0.1137465525638786),             				 ('AUC', 0.9823022077397945),             ('Precision', 0.9716312056737588),             ('F1', 0.9448275862068964),             ('Balanced Accuracy Binary', 0.9552772006397513),             			('Accuracy Binary', 0.9732441471571907),             ('Recall', 0.9194630872483222)])

Our best pipeline performs much better than the baseline

	Baseline model (always guesses majority class)	Pipeline with Text Featurization Component
Accuracy	75%	97.32%
Balanced Accuracy	50%	95.53%
Recall	0%	91.95%

We have significantly outperformed the baseline model in the three metrics (accuracy, balanced accuracy, and recall) we were focused on! With EvalML, we were able to build a model that is able to detect spam fairly well with just a few lines of code, and even before doing any tuning of the binary classification decision threshold.

The Importance of Text

We previously discussed that Woodwork had automatically detected that our Messages column was a natural language feature. We now understand that AutoMLSearch was able to create a Text Featurization Component because it identified this natural language column. To explain why this was useful, we can manually set our Messages feature as a categorical feature, run the same steps, and compare our scores.

from evalml.utils import infer_feature_types# manually set "Message" feature as categorical X = infer_feature_types(X, {'Message': 'Categorical'}) X_train, X_holdout, y_train, y_holdout = split_data(X, y, problem_type='binary', test_size=0.2, random_seed=0)automl_no_text = AutoMLSearch(X_train=X_train, y_train=y_train,                              problem_type='binary')automl_no_text.search()

If we score the best pipeline found this time, we get an accuracy score of 75.2%, a balanced accuracy score of 50.3%, and a recall score of 0.6%. These scores are only marginally better than the scores for our baseline model!

>>> best_pipeline_no_text = automl_no_text.best_pipeline>>> scores = best_pipeline_no_text.score(X_holdout, y_holdout,objectives=evalml.objectives.get_core_objectives('binary') + ['recall'])>>> scoresOrderedDict([('MCC Binary', 0.0710465299061946),			 ('Log Loss Binary', 0.5576891229036224),             				 ('AUC', 0.5066740407467751),             ('Precision', 1.0),             ('F1', 0.013333333333333332),             ('Balanced Accuracy Binary', 0.5033557046979866),             			('Accuracy Binary', 0.7525083612040134),             ('Recall', 0.006711409395973154)])

The scores for our best pipeline here are not much better than our baseline scores

	Baseline model (always guesses majority class)	Pipeline with Text Featurization Component	Pipeline without Text Featurization Component
Accuracy	75%	97.32%	75.25%
Balanced Accuracy	50%	95.53%	50.34%
Recall	0%	91.95%	0.67%

This means that unlike the previous best model found, this model is not much better than the trivial baseline model, and is no better than always guessing the majority “ham” class. By observing the components that make up this pipeline, we can better understand why.

automl_no_text.graph(best_pipeline_no_text)

Graph of our best pipeline if we treat “Message” as a categorical feature

Because AutoMLSearch was told to treat “Message” as a categorical feature this time, each pipeline included a one-hot encoder (rather than a text featurization component). The one-hot encoder encoded the top 10 most frequent “categories” of these texts; however, because each text is unique, this means that 10 unique text messages were encoded while the rest of the messages were dropped. Doing this removed almost all of the information from our data, so our best pipeline could not do much better than our trivial baseline model.

What’s Next?

In this post, we covered how EvalML can be used to classify text messages as spam or ham (non-spam), and we learned how EvalML can detect and automatically handle text features with the help of Woodwork and the nlp-primitives library. You can learn more about Woodwork and nlp-primitives through their documentation, linked in the resources below. Finally, be sure to check out a blog post our former intern Clara Duffy wrote to learn more about nlp-primitives.

Special thanks to Becca McBrayer for writing the demo which this blog post is based on!