Deep Feature Synthesis (DFS), the core feature generation algorithm behind our Featuretools library, operates on relational data to generate complex, stacked features that are often highly predictive in machine learning models. This relational data is frequently stored as tables in relational database management systems (RDBMS’s) like Postgres and MySQL.
Before Featuretools can perform automated feature engineering, it needs to construct an EntitySet, a data structure that encapsulates information about the tables and foreign key relationships in the data. To do that, the data needs to be imported from the database into DataFrame(s).
Subtle differences between RDBMS’s and connector API’s mean that importing data from RDBMS’s can be time consuming and error-prone work. If you frequently import data from different database systems, you may find yourself with more than a handful of stray, ad-hoc scripts lying around. Each script may have different dependencies and leverage different connector API’s. Additionally, relationships between tables may have to be manually redefined in Featuretools each time you experiment with a different dataset. This hampers productivity by adding friction and clutter to the machine learning process.
Our new Featuretools-SQL library offers an API to run automated feature engineering on relational data. Simply pass in the connection information for your RDBMS instance and the Featuretools-SQL library will create the appropriate EntitySet object. From there, you can call DFS and perform feature engineering!
Featuretools-SQL currently supports importing data from MySQL, Postgres, and Snowflake. If you would like to import data from a system not listed here, please file an issue on our GitHub repository.
How Our SQL Add-on Can Help
Before we can get started, we need to install Featuretools with the SQL add-on. We can do this through either pip or conda:
Install with pip:
python -m pip install "featuretools[sql]"
Install with conda:
conda install -c conda-forge featuretools
Let’s walk through a small demo. For the purposes of this blog post, I’ve uploaded some of the sample data available here. The data is already available in DataFrame form at that link, but for now we will assume it was stored in a Postgres database instead.
postgres=# \dt
List of relations
Schema | Name | Type | Owner---------------------------------------------
public | customers | table | postgres
public | order_products | table | postgres
public | orders | table | postgres
public | products | table | postgres
There are four tables. Each table has primary key uniqueness constraints and foreign key dependencies.
The first step in our script will be to connect to our database instance.
from featuretools_sql.connector import DBConnector
connector_object = DBConnector(system_name="postgresql",
user="postgres",
host="localhost",
port="5432",
database="postgres",
schema="public")
Then, we have to call the get_entityset
method, and the DBConnector
object will handle the rest.
es = connector_object.get_entityset()
There’s our EntitySet. But suppose we want to make sure that all our tables were indeed imported.
list(es.dataframe_dict.keys())
['orders', ‘customers’, ‘products’, ‘order_products’]
Looks good!
Let’s take a quick peek at the products table to make sure everything was imported correctly.
es.dataframe_dict["products"].head(5)
+--------------+-------------------------------------+-----------------------------+| product_id | description | first_order_products_time ||--------------+-------------------------------------+-----------------------------|| 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 2010-12-01 08:26:00 || 71053 | WHITE METAL LANTERN | 2010-12-01 08:26:00 || 84406B | CREAM CUPID HEARTS COAT HANGER | 2010-12-01 08:26:00 || 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 2010-12-01 08:26:00 || 84029E | RED WOOLLY HOTTIE WHITE HEART. | 2010-12-01 08:26:00 |+--------------+-------------------------------------+-----------------------------+
It looks like the rows in the table were imported without any problems.
Now that we have imported our tables, we can define the foreign key relationships between them. Populating the relationships in an EntitySet is not strictly necessary, but allows DFS to create a class of features called aggregation features across tables. DFS can combine these features with transformation features to automatically generate complex, stacked features.
Featuretools-SQL automatically creates the relationship data structure by querying the metadata tables in the RDBMS to discover the primary and foreign key constraints for each table in the database.
Let’s make sure the data structure was populated correctly.
es.relationships
[<Relationship: order_products.product_id -> products.product_id>,<Relationship: order_products.order_id -> orders.order_id>,<Relationship: orders.customer_name -> customers.customer_name>]
We can see the list was populated with all the relationships. The arrow shows the direction of the foreign key relationship.
We are ready to pass in the es
object into Featuretools’s DFS algorithm for feature engineering on the data from our database.
import featuretools as ft
feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name="orders")
We can examine the last five elements of the feature definitions.
feature_defs[-5:]
[<Feature:customers.NUM_UNIQUE(order_products.WEEKDAY(order_date))>,
<Feature: customers.NUM_UNIQUE(order_products.YEAR(order_date))>,
<Feature: customers.NUM_UNIQUE(order_products.orders.country)>,
<Feature: customers.NUM_UNIQUE(order_products.orders.customer_name)>,
<Feature: customers.PERCENT_TRUE(order_products.orders.cancelled)>]
What about the feature matrix?
feature_matrix[:5]
+------------+------------------------------------------------------------+---------------------------------------------------------+-------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------+
| order_id | customers.NUM_UNIQUE(order_products.WEEKDAY(order_date)) | customers.NUM_UNIQUE(order_products.YEAR(order_date)) | customers.NUM_UNIQUE(order_products.orders.country) | customers.NUM_UNIQUE(order_products.orders.customer_name) | customers.PERCENT_TRUE(order_products.orders.cancelled) |
|------------+------------------------------------------------------------+---------------------------------------------------------+-------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------|
| 581583 | 6 | 2 | 1 | 1 | 0.100457 |
| 581584 | 6 | 2 | 1 | 1 | 0.100457 |
| 581585 | 6 | 1 | 1 | 1 | 0.040146 |
| 581586 | 6 | 2 | 1 | 1 | 0.282143 |
| 581587 | 4 | 1 | 1 | 1 | 0 |
+------------+------------------------------------------------------------+---------------------------------------------------------+-------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------+
To summarize, we went from a Postgres database to a feature matrix ready for machine learning in three function calls: the constructor, the get_entityset
call, and the call to DFS.
This means that you can replace complex scripts that import data from your database instance with just a handful lines of Python – enough to fit neatly into one Jupyter Notebook cell.
We hope this library empowers you to dedicate your time and energy to the feature engineering and machine learning process without having to worry about how to import your data. Go ahead and try it out!
Contributions
If you have feedback, ideas for new features, or information about bugs, please feel free to interact on our GitHub repository or our Slack community.
Acknowledgements
A special thanks to my managers, Dave Reed and Gaurav Sheni, for their help and guidance with this project. I would also like to thank the rest of the Features team and Chris Park for their help throughout this project.