Since releasing Featuretools, three years ago, we’ve strived to create tools that make machine learning easier to use. As part of our goal to help everyone solve impactful problems, we’re building innovative open source tools for each step of the machine learning pipeline.
In this article, we’ll introduce Compose, a new open source library tool to automate prediction engineering.
What is Prediction Engineering?
Prediction engineering is the process of generating training examples from historical data to train a machine learning model.
When data scientists want to solve a prediction problem, they typically start by writing a specific labeling script that finds the outcome that they want to predict from historical data. A labeling script contains many lines of code for processing raw data while also including any constraints to deal with challenges as they arise. One common challenge occurs when the training examples skew towards a specific group. This case requires additional code to limit the number of training examples per group.
Eventually, labeling scripts become increasingly tied to the specific details of an individual prediction problem. They are not reusable for other prediction problems, even with the same dataset, and cannot adapt to the changing parameters in a dynamic environment.
As a result, data scientists spend a significant amount of time and effort on the tedious process of developing scripts that find training examples for machine learning models.
By eliminating the bottleneck of writing extensive scripts, data scientists can substantially increase their productivity of training machine learning models. Also, data scientists can explore variations of their prediction problem, which is essential for honing in and deciding what would best solve the business need.
Here at the Alteryx Innovation Labs, we’re working on solutions that make machine learning tasks easier to define and solve. In our effort to structure the labeling process, we’ve released Compose — one of the first libraries to generate training examples automatically by using labeling functions.
What is Compose?
Compose is a machine learning tool to automate prediction engineering. If you have time-driven data, Compose can help you structure prediction problems to generate training examples for supervised learning. An end-user defines an outcome of interest by writing a labeling function, then runs a search to extract training examples from historical data.
By structuring the labeling process with Compose, data scientists can take advantage of many powerful benefits to improve their workflow.
- They can efficiently generate training examples for many prediction problems, even across different domains.
- Their raw data is processed automatically with constraints that can be quickly applied and removed.
- The details of their prediction problem are adjustable parameters (no longer fixed) to the labeling script. By changing the parameters, they can generate variations of the prediction problem from the same dataset, which is crucial for exploring different scenarios and making better decisions.
Considering a retail example
Predict if a customer will make a purchase in the grocery department next month
Suppose the owner of a retail store wants to predict whether customers will make purchases in the grocery department next month. Data scientists will typically write a labeling script that scans the customers’ historical transactions to identify instances where grocery purchases were (and were not) made. These instances are positive and negative examples for training a machine learning model.
The extensive process of writing a labeling script entails several lines of code to deal with the various challenges of working with raw data. For example:
- Scanning through historical transactions requires grouping transactions per customer and consecutively filtering data by monthly transaction periods.
- One customer might shop more frequently in the grocery department than the rest. As a result, the number of purchases skews towards a single customer. In these cases, data scientists will attempt to reduce the bias from learning disproportionately across customers by appending additional lines of code to limit the number of training examples per customer.
- Even once the proper number of training examples are selected, more code may still be needed to enforce other constraints, such as separating consecutive training examples by at least two months to balance across time.
As the labeling script continues to grow, it becomes increasingly challenging to generate new training examples for different departments or transaction periods. If the store owner now wants to make predictions for the electronics department next week, the script cannot quickly adapt to the new prediction problem.
Using Compose to find training examples
Let’s see how we can use Compose for the retail prediction problem. We will make a labeling script that finds training examples from customers’ historical transactions.
transaction_id | transaction_time | amount | department | customer_id |
---|---|---|---|---|
309 | 2020-06-01 19:36:53 | $26.69 | games | 5 |
428 | 2020-07-01 08:21:26 | $58.68 | clothing | 198 |
60 | 2020-07-01 15:27:06 | $25.11 | clothing | 376 |
762 | 2020-07-01 18:33:42 | $35.03 | clothing | 655 |
56 | 2020-09-01 12:54:27 | $17.89 | electronics | 60 |
The first step is structuring the prediction problem by using a labeling function and a label maker. A labeling function is a user-defined function that receives a slice of the training data as input along with any parameters, then returns an associated label as the output. The label maker is the search algorithm that automatically processes the raw data, applies any constraints, and uses our labeling function to extract the training examples.
For this specific example, we define a labeling function to check whether any transactions occurred in a given department. The name of the department is a parameter of the labeling function, so we can easily create variations of the prediction problem for different departments. We also configure the label maker parameters to apply our labeling function over monthly transaction periods per customer. The label maker parameters are described below:
Parameter | Description |
---|---|
target_entity |
The target entity is the customer ID since we want to process transactions for each customer. |
time_index |
The time index is the transaction time. The transaction periods follow this time index to filter out data. |
labeling_function |
The labeling function is the function we define that checks for purchases in a given department. |
window_size |
The window size is the length of a transaction period. We set this parameter value to one month. |
import composeml as cpdef made_purchase(df, department): return df['department'].eq(department).any()lm = cp.LabelMaker( target_entity='customer_id', time_index='transaction_time', labeling_function=made_purchase, window_size='1MS', # one month for the transaction period)
Now, we can run a search to generate training examples. During the search process, the label maker automatically processes the raw data and applies our configured constraints. We reduce the bias from learning disproportionately by limiting the search to 3 training examples per customer. We also separate consecutive training examples by two months to balance across time. The search parameters are described below:
Parameter | Description |
---|---|
df |
The data frame is the table of transactions sorted by the transaction time. |
minimum_data |
The minimum data is the time where we want to start the first transaction period. |
num_examples_per_instance |
The number of examples per instance is how many training examples to find per customer. |
gap |
The gap is the period between training examples to balance across time. |
department |
The department is the department parameter to our labeling function. |
lt = lm.search( df=df.sort_values('transaction_time'), minimum_data='2020-07-01', num_examples_per_instance=3, gap='2MS', # two months to balance across time department='grocery',)
The output from the search is a label times table with three columns:
- The first column is the customer ID associated with the purchase.
- The second column is the start time of the transaction period, also known as the cutoff time for building features. Only data that existed beforehand is valid to use for predictions. Otherwise, there is data leakage in model training.
- The third column is the value calculated by our labeling function.
customer_id | time | made_purchase |
---|---|---|
293 | 2020-09-01 | TRUE |
386 | 2020-09-01 | FALSE |
571 | 2020-09-01 | TRUE |
655 | 2020-07-01 | FALSE |
885 | 2020-09-01 | TRUE |
We now have training examples for the retail prediction problem. If the environment changes, the script can quickly adapt to variations of the prediction problem. For instance, if the store owner now wants to make predictions for the electronics department next week, then we can easily change the window size and department parameters.
lm.window_size = '7d'lt = lm.search(..., department='electronics')
Next Steps
At this point, we have a dynamic labeling script in a few lines of code that generates training examples by scanning the customers’ historical transactions to identify instances where purchases were (and were not) made in a given department. The script can quickly apply constraints that balance training examples across time and reduce bias. It can also quickly adapt to variations of the prediction problem.
The generated training examples are now ready to use in Featuretools for automated feature engineering and EvalML for automated machine learning. Since Compose easily integrates with Featuretools and EvalML, we can automate the entire machine learning workflow.
Although we only considered a retail example in this article, Compose works in many other use cases like predictive maintenance, customer churn, and transactional fraud. For more details on how to use Compose for building AutoML applications, take a look at the tutorials in the documentation.