Automatic Dataset Normalization for Feature Engineering in Python – Alteryx

A normalized, relational dataset makes it easier to perform feature engineering. Unfortunately, raw data for machine learning is often stored as a single table, which makes the normalization process tedious and time-consuming.

Well, I am happy to introduce you to AutoNormalize, an open-source library that automates the normalization process and integrates seamlessly with Featuretools, another open-source library for automated feature engineering. AutoNormalize detects relationships between columns in your data, then normalizes the dataset accordingly. The normalized dataset can then be returned as either an EntitySet or a collection of DataFrames.

Using AutoNormalize makes it easier to get started with Featuretools and can help provide you with a quick preview of what Featuretools is capable of. AutoNormalize also helps with table normalization, especially in situations when the normalization process is not intuitive.

A Machine Learning Demo Using AutoNormalize

Let’s take a quick look at how AutoNormalize easily integrates with Featuretools and makes automated feature engineering more accessible. To do this, we will walk through a machine learning example with a dataset of customer transactions, and we will predict, one hour in advance, whether customers will spend over $1,200 within the next hour. In this demo, we will use another open source library, Compose, to generate our labels. Below is the pandas DataFrame for our transactions.

	transaction_id	session_id	transaction_time	product_id	amount	customer_id	device	session_start	zip_code	join_date	date_of_birth	brand
0	9202	1	2014-01-01 00:00:00	36	81.46	8	desktop	2014-01-01	13244	2013-05-07 20:59:29	1973-07-28	A
1	5776	1	2014-01-01 00:01:05	36	107.59	8	desktop	2014-01-01	13244	2013-05-07 20:59:29	1973-07-28	A
2	1709	1	2014-01-01 00:32:30	36	57.57	8	desktop	2014-01-01	13244	2013-05-07 20:59:29	1973-07-28	A

Generating Labels

First, we are going to generate labels using Compose. For more detail, please refer to the docs here.

To begin, we create our labeling function and use Compose to create a label maker, which we use to extract labels. We transform these labels to have a threshold of $1,200, and shift cut-off times an hour earlier to predict in advance. This gives us the labels that we will use to train our model.

def total_spent(df_slice):    label = df_slice["amount"].sum()    return label    label_maker = cp.LabelMaker(target_entity='customer_id',                             time_index='transaction_time',                             labeling_function=total_spent,                              window_size='1h')labels = label_maker.search(transaction_df, minimum_data='2h',                             num_examples_per_instance=50,                             gap='2min')                            labels = labels.threshold(1200)labels = labels.apply_lead('1h')labels.head(4)

	customer_id	cutoff_time	total_spent
label_id
0	1	2014-01-01 05:30:50	True
1	1	2014-01-03 17:35:10	True
2	1	2014-01-03 17:37:20	True
3	1	2014-01-03 17:39:30	True

Using AutoNormalize to Create an EntitySet

Our next step is to create an EntitySet, which we will use to generate features. In order to take full advantage of Featuretools, we want our EntitySet to be normalized. This is where AutoNormalize comes in.

Ordinarily, we would need to go through our data, decipher the relationships between different columns, and then decide how to factor out duplicated data. Not only is this unnecessarily time-consuming, it’s also error-prone–especially in situations where we don’t understand our data thoroughly, or when columns are not intuitively named.

After deciding how to normalize, we would also need to deal with creating new DataFrames, and be sure to deduplicate the data correctly. AutoNormalize takes care of all of this, splitting DataFrames and deduplicating the data. This is particularly significant when entries are not 100% accurate, as AutoNormalize picks and chooses the dominant entries, and saves you from having to manually decide which data to retain and which data to drop.

AutoNormalize does all of this with just one simple function call. In order to normalize our DataFrame, all we need to do is call auto_entityset(df, accuracy=0.98, index=None, name=None, time_index=None), which returns a normalized EntitySet.

es = an.auto_entityset(transaction_df, accuracy=1,                        name="transactions",                           time_index='transaction_time')print(es)es.plot()

Entityset: transactions  Entities:    transaction_id [Rows: 10000, Columns: 5]    product_id [Rows: 50, Columns: 2]    session_id [Rows: 200, Columns: 4]    customer_id [Rows: 75, Columns: 4]  Relationships:    transaction_id.session_id -> session_id.session_id    transaction_id.product_id -> product_id.product_id    session_id.customer_id -> customer_id.customer_id

As you can see, AutoNormalize successfully split up our transaction DataFrame into separate product, session, and customer tables. We can now use this EntitySet to calculate features and train our model.

Automated Feature Engineering with Featuretools

To create features, we call dfs() on our EntitySet. Below, you can see the first twenty of the seventy-three features that were created.

feature_matrix, features_defs = ft.dfs(    entityset=es,    target_entity='customer_id',    cutoff_time=labels,    cutoff_time_in_index=True,    verbose=True,)features_defs[:20]

[<Feature: zip_code>,<Feature: COUNT(session_id)>,<Feature: NUM_UNIQUE(session_id.device)>,<Feature: MODE(session_id.device)>,<Feature: SUM(transaction_id.amount)>,<Feature: STD(transaction_id.amount)>,<Feature: MAX(transaction_id.amount)>,<Feature: SKEW(transaction_id.amount)>,<Feature: MIN(transaction_id.amount)>,<Feature: MEAN(transaction_id.amount)>,<Feature: COUNT(transaction_id)>,<Feature: NUM_UNIQUE(transaction_id.product_id)>,<Feature: MODE(transaction_id.product_id)>,<Feature: DAY(join_date)>,<Feature: DAY(date_of_birth)>,<Feature: YEAR(join_date)>,<Feature: YEAR(date_of_birth)>,<Feature: MONTH(join_date)>,<Feature: MONTH(date_of_birth)>,<Feature: WEEKDAY(join_date)>]

Training and Testing our Model

Now, we preprocess the features and split our labels and features into training and testing sets. We train a random forest classifier on the training set, then test its performance by evaluating its predictions against the training set.

y = feature_matrix.pop(labels.name)x = feature_matrix.fillna(0)x, features_enc = ft.encode_features(x, features_defs)x_train, x_test, y_train, y_test = train_test_split(    x,    y,    train_size=.8,    test_size=.2,    random_state=0,)clf = RandomForestClassifier(n_estimators=10, random_state=0)clf.fit(x_train, y_train)y_hat = clf.predict(x_test)print(classification_report(y_test, y_hat))

              precision    recall  f1-score   support           False       0.67      0.06      0.11       129        True       0.71      0.99      0.83       304        accuracy                           0.71       433   macro avg       0.69      0.52      0.47       433weighted avg       0.70      0.71      0.61       433

Below, you can see a plot depicting which features are considered important for predictions.

As you can see, many of the important features, such as STD(session_id.MAX(transaction_id.amount)), are reliant on the relational feature calculations of Featuretools. These features would not be attainable without a normalized EntitySet.

How AutoNormalize Works

The normalization process AutoNormalize takes can be split up into two main steps: discovering dependencies in the data, and breaking up the table according to these discovered dependencies. Each step can be executed individually if desired by the user

Discovering Dependencies

The first step AutoNormalize takes is finding the functional dependencies between various columns. AutoNormalize does this following the DFD algorithm. When determining dependencies, AutoNormalize has an accuracy threshold that can be set by the user. The accuracy threshold determines what percentage of the data must indicate a dependency in order for that dependency to be concluded. This process outputs a Dependencies object, which can be altered by the user if desired. To see an example of a situation where this is necessary check out this demo.

Normalizing the Dataset

The next step is normalizing the DataFrame according to the discovered dependencies. AutoNormalize normalizes from 1st normal form to 3rd normal form. When creating new tables, AutoNormalize chooses the new primary key according to this priority:
1. Least number of attributes
2. “ID” present in the name of an attribute
3. Has attribute with farthest to the left position in the original DataFrame
When the data is not 100% accurate, AutoNormalize automatically retains the dominant table entries, and drops the abnormalities. If normalizing to an EntitySet, a new index column is created when a primary key has more than one attribute. This process can output either an EntitySet or a collection of DataFrames as desired by the user.

Pipelined Together

When auto_entityset() is called, these two steps are pipelined together. However, both of these steps can be executed separately; please refer to the ReadMe for more details.

In Conclusion

AutoNormalize makes it easier to create and work with normalized EntitySets and thus create important features. Its normalization capabilities can also be applied to many other situations, as the normalized data can also be returned as a collection of pandas DataFrames.

We welcome you to give AutoNormalize a try and leave feedback on GitHub. For additional resources and installation instructions, take a look at the ReadMe and additional demos linked below.