Over the past decade, Natural Language Processing (NLP) has emerged as a rapidly growing field that every business wants to dip its toes into. Natural language is defined as any kind of human-interpretable freeform text. This is anything from grocery lists to text messages to book titles.

Humans take for granted the ability to understand our own natural language, but freeform text is not a data format that computers can easily interpret. NLP bridges that gap and gives a computer the ability to understand human language – whether it’s tweets, grocery lists, or scientific papers.

One step in this kind of natural language data analysis is actually recognizing whether or not our data can be defined as natural language. Some textual data —  text like phone numbers, street addresses, or hashes — shouldn’t be recognized as natural language, as they do not contain the complex grammatical structures and syntax that make up human language.

Typically, users must manually specify types for columns in their dataset, which can get tedious if their datasets have a lot of columns. Think of a dataset with 1000 columns! Even if you spent only a second on each column it would take over 15 minutes to label them all.

Type inference aims to automate this labeling of columns by looking at the information in the column, like a human would. Type inference should be able to look at a column and classify it as natural language or not.

In this post, we’ll walk through the steps that we at Alteryx Machine Learning took to build a function that could perform natural language type inference. To do this, we employed three of our open-source libraries:

  • Woodwork to initially narrow down our training data to columns whose types were yet undefined.
  • Featuretools to automate the process of generating aggregated input features.
  • EvalML to automate model creation and selection.

Armed with a new natural language type inference function, we were then able to use this function to improve those same open-source libraries. By adding this new type inference to Woodwork, it meant that Featuretools and EvalML, which have natural language-specific capabilities (NLP Primitives), could now employ those features without any manual labeling needed on the part of the user.

Creating the Inference Function

Now that we know why natural language processing and its type inference are important, it’s time to delve into how we created such a natural language type inference function that can correctly classify columns of data as either “natural language” or “not natural language.” We defined this problem and our requirements for it explicitly.

The function should:

  • Accurately determine what is/is not natural language
  • Be performant (it has to be faster than manually labeling)
  • Err on the side of caution (a wrong inference is worse than not inferring at all, as including natural language columns in model-building is computationally intensive and only useful if the column is, indeed, natural language)

Additionally, we needed tests. This helped us determine the accuracy of our inference function and if it does what we intend it to do. Raymond Peck assembled a multitude of datasets gathered (such as from Kaggle).

We then filtered for all columns that Woodwork classified as Unknown (since any column whose type is inferable as another should not be considered a viable natural language option), randomly sampled 10000 rows, and manually labeled them as natural language or not. This is a few of the 506 columns that make up our training dataset.  As an exercise, you could try to label the columns below as either Natural Language or Not Natural Language.

 c_420 c_109 c_456 c_197 c_499 ... c_245 c_290 c_500 c_502 c_68
0 Cinemax: Andheri (E) candidates-biography yangians and their applications , handbook of ... ai molev www5 / computer networks , ... PAYPAL DATA SERVICES, INC. Poultry Processor 13.0 107500.0 Engineering, Business Administration, or relat...
1 Kasturba Cinema: Malad taxes not another inventory , rather a catalyst for ... jl lumley , ha panofsky proceedings of acm workshop on formal methods ... ... YALE NEW HAVEN HOSPITAL GM and Head of International Sales & Business ... 120786.0 120,000.00 CS/Engg, Electronics or related Appl Dvlpt Eng...
2 Darpana, Bidhan Sarani social-security prolonged survival after liver transplantation... bi carr , r selby , j madariaga , s iwatsuki preparation , ibm almaden research center , ... EOK TECHNOLOGIES INC Software Developer 36,990.00 107,422.00 Engineering, Math or Physics job offered or Or...
3 Konark: Dilsukhnagar federal-budget,pensions,retirement reinterpreting the empathy-altruism relationsh... rb cialdini , sl brown , bp lewis , c luce , sl & hellip ; on infrared physics , zurich , swit... ... HSBC BANK USA, N.A. Associate Programmer Analyst 90995.0 185000.0 Computer Information Systems Consultant/Softwa...
4 Mahalakshmi Talkies, Strahans Road candidates-biography an annotation management system for relational... l lamport new directions for teaching and learning , ... COGNIZANT TECHNOLOGY SOLUTIONS US CORPORATION Revenue Insights Analyst 115,000.00 92,602.00 Computer or Electrical Engineering, or related...

Creating a Binary Classifier for Natural Language

For those familiar with machine learning, the problem as we’ve defined it above may be quickly recognized as a binary classification problem, or one in which we are trying to classify objects (in this case, entire columns of data) as one of two classifications (as containing natural language or not). To build a binary classification model that could become our natural language type inference function, we needed to treat it like a machine learning problem. This means that we had to create input features and pick a classifier.

As binary classification is a well-known machine learning problem, there are many different classifiers to choose from, so to streamline the process of choosing the best classifier for our problem, we used EvalML to automate the process of choosing a model. That left us with the problem of generating useful input features from our unclassified column whose type we are trying to infer.

Feature Engineering

The goal of feature engineering for classifiers is to create features that distinguish the target class (natural language) from the other classes (all other types). Since the inference function needs to be performant, it’s also desirable to use features that don’t take too long to generate. After much thought, we came up with an array of features we thought would be useful in distinguishing natural language from other column types. Since we’re looking at raw strings, we’re looking mostly at heuristics that we can calculate.

Here’s what we came up with:

  • Total string length
  • Number of words
  • Percentage of whitespace separators
  • Presence of other common separators: ,, ., !, ?
  • Number of unique separators
  • Average number of characters per word
  • Use a word bank with 10/100/1000… most common words

Each of these heuristics describes a single row of a column. Since we’re trying to classify the column as a whole, we need a way of aggregating this information from each of the rows down to a single feature value. To do this, we used our feature generation library, Featuretools, which allowed us to easily perform multiple of these aggregations such as mean, median, max, standard deviation, and skew.

Voila! Now we have tons of features that are both human-interpretable and that our model can learn from.

 COUNT(data) MAX(data.NL_UNIQUE_DELIMITER_COUNT(data)) MAX(data.OTHER_DELIMITER_COUNT(data)) MAX(data.WHITESPACE_COUNT(data)) MEAN(data.NL_UNIQUE_DELIMITER_COUNT(data)) ... STD(data.NUM_COMMON_WORDS(data)) STD(data.TOTAL_LENGTH(data)) SUM(data.MEAN_WORD_LENGTH(data)) SUM(data.NUM_COMMON_WORDS(data)) SUM(data.TOTAL_LENGTH(data))
column_name ... 
c_459 2554 3 8 30 1.608457 ... 1.530754 17.500624 12769.547275 4502 116544
c_315 2390 3 5 3 0.317992 ... 0.149314 2.672392 14341.800000 42 18045
c_419 10000 0 0 0 0.000000 ... 0.000000 0.000000 100000.000000 0 100000
c_418 10000 6 59 36 2.251500 ... 4.931030 27.693656 42011.192778 77453 532141
c_386 10000 6 59 36 1.490600 ... 4.666425 27.608222 44778.795966 42286 294903

Model Optimization Objective

Although EvalML handles model training and selection, we can still choose to optimize further for an objective. In this case, we were looking to optimize for:

  • Precision – fewer false positives.
  • Recall – fewer false negatives.

In this case, we value precision higher than recall, because false positives (incorrectly classifying a column as natural language) are worse than false negatives (missing a natural language column completely). This is because it will be a simpler task for a user to check among the relatively fewer natural language columns that they are all, indeed, natural language than it is to check the rest of the columns for a potentially missed column.

In performing this optimization, we did not have to choose between the two metrics.  Optimizing for f1 score, which combines precision and recall, resulted in very similar precision to models that optimized for just precision but had significantly higher recall as well. Therefore, we optimized for f1 score.

                      precision    recall        f1
Modeling Run                                       
precision_3            1.000000  0.886364  0.886364
0.01_max_depth_2       0.975610  0.909091  0.844181
0.03_max_depth_2       0.975610  0.909091  0.844181
3_max_depth_2          0.975610  0.909091  0.844181
 

Feature Selection

The initial model we trained using all these differently aggregated features produced useful results. But we wanted to optimize the model both for compute time as well as accuracy. Including more input features increases compute time and can cause the model to do worse in some cases.

Therefore, we went through the process of performing feature selection, removing features that weren’t having an impact on our model.

This can be done by looking at the feature’s permutation importance. Permutation importance explains how important individual features are to a model’s performance, and it is calculated by looking at any changes in model performance that come from randomly shuffling the feature values.

  • If randomly shuffling the feature improves the model’s performance, then the presence of the feature in its un-shuffled state is making the model worse, and removing it will improve model performance. This would be seen as a negative feature importance value.
  • If randomly shuffling the feature decreases the model performance, then that feature is important to the performance of the model and should be kept.
  • If random shuffling doesn’t change the model performance, then that individual feature is not having any impact on the model. However, we shouldn’t be so quick as to immediately remove every unimportant column. Two identical features will both have zero permutation importance since the model can use them interchangeably. Together they both have zero permutation importance but if we drop one of them the other is suddenly very important.

Thus, we drop all features with negative permutation importance and one of the two with zero permutation importance. We then retrain the model, repeating the process until we only have important features left.

Note: It’s quite expensive to retrain a model this many times but spending the time upfront to optimize our model allows us to deploy it with more confidence later on.

Fast Model Deployment

Once we had a classifier, we could store and deploy the model and pass any new column that would need to be classified into it, getting our type inference results. But if we could create a decision tree that extracts the classifier’s decisions into “if” statements, we could reduce the complexity of our inference function and make incorporating it as simple as adding a few “if” statements.

Simply plotting the tree can show how effective each “if” statement is.

Decision Tree – Blue leaves of the tree label a column as containing Natural Language

If we look at the node in the green box, all of the decisions on its right side (the yellow box) declare a column as natural language. Therefore, we can traverse the right side of the decision tree and choose to combine all the statements. Here’s that logic from train_model.py:

def infer_rightside_of_tree(pipeline, X, max_depth=1, verbose=False):
    tree = pipeline.get_component(pipeline._estimator_name)._component_obj.tree_
    features = pipeline.input_feature_names[pipeline._estimator_name]
    masks = []
    nodeid = 0
    for i in range(max_depth):
        feature = features[tree.feature[nodeid]]
        threshold = tree.threshold[nodeid]
        if verbose:
            print("{} > {}".format(feature, threshold))
        masks.append(X[feature] > threshold)
        nodeid = tree.children_right[nodeid]

    m = masks.pop()
    while masks:
        n = masks.pop()
        m = m & n
    return m
 

Summary

The best series of “if” statements from the best decision tree was ‘mean(number_1000_common_words)’ > 1.14 (the green box above) which gives us > 95% precision and somewhere between 85% and 90% recall.

To rephrase this in our own natural language:

  • If the column has on average more than one common English word per row, then it is most likely a natural language column.

This is a simple, fast, and accurate solution that is easily transferrable to actionable Python code. We were able to immediately add this logic into Woodwork, getting effective and efficient natural language type inference and, therefore, into both Featuretools and EvalML.

Although this conclusion seems very simple – a lot of analysis went into figuring out that this was the best way to classify natural language given our criteria – we had no way of knowing if any of the other features we created would’ve been better. Additionally, if we ever think of any other features we think could be better – we have a framework (and the code) to determine if they are.

Next Steps

There are a couple of things we can do to improve upon this experiment:

  1. Create more features! Maybe only use the top 10 or 100 most common English words?
  2. Expand to other languages beyond English.
  3. Find a better way to automate feature selection. Hopefully, we can come up with something more efficient. Can we optimize not only for precision and recall but also for how expensive it is to calculate the feature?
  4. Use all the models. If model deployment and feature interpretability is not an issue to worry about, what’s the best model we can create?
  5. Train on more data. The more diverse our natural language dataset is, the more accurate our model will be. Lots of the training data had similar-looking data.

Contributions

If you have ideas for enhancing or improving Woodwork, open-source contributions are welcome. To get started, check out the Woodwork Contributing Guide on GitHub.

Special thanks to Ethan Tu who worked on this project from inception to design to implementation.