QAFS - Quality Aware Feature Store

Just released QAFS another machine learning engineering tool to help build and maintain ML products. Building machine learning models are becoming easier every year.. doing .fit on a toy dataset and see the metrics is great however when we move to production side the fun is ~~almost~~ over. How do I know the metrics are not killing our company? and what about potentially doing bad recommendations for four months? Actually I do not know, but we can try minimize the risks using Feature Store + Data Quality Checks + Monitoring stack.

The main idea is the ML teams keep the data pipelines simple and be able to scale, for this reason QAFS was build to use an existing infraestructure e.g., run on a local machine and scale to Dask or Spark (not supported yet) cluster.

Feature Store

Using the same features at training and serving time is far the main advantage of an feature store, in this way we know the models are consuming the same features as they are trained. Others caracteristics are sharing features acros projects/teams, feature versioning, query and process data by datetime and metadata tagging.

Feature Store Overview

Feast is a popular Feature Store however deploying and maitain Feast could be painful. For more details about the feature store concept .

Data Quality Check

Code bugs are harder to find and fix until you see data bugs then the “hard” definition will be in another level. Code testing and documentation is an standard software engineer task in any project but an ML project is basically code + data which make engineers life even harder to put these projects in production. The concept of “data is correct” change over time depending of the application some validation done yesterday can be invalid for tomorrow’s data, btw a interesting paper Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks shows that bad data can effectively reduce model capacity by 3 times (if you like papers 😉).

So QAFS couple the feature store and data quality checks concepts by integrating pandera an dataframe validation library when registering features. In this way we can easily build and extend the data validations to any kind e.g., perform data distribuition check (data/model drift).

Install

Installing the python package through pip:


$ pip install qafs

Bellow is an example of usage qafs where we’ll create a feature store and register numbers feature and an squared feature transformation. First we need import the packages and create the feature store, for this example we are using sqlite database and persisting the features in the filesystem:


import qafs
import pandas as pd
import pandera as pa
from pandera import Check, Column, DataFrameSchema
from pandera import io
 
 
fs = qafs.FeatureStore(
    connection_string='sqlite:///test.sqlite',
    url='/tmp/featurestore/example'
)

Features could be stored in namespaces, it help organize the data. When creating numbers we specify the example/numbers feature to point the feature numbersat that namespace example however we can use the arguments name='numbers', namespace='example' as well. Then we must specify the data validation using pandera, in this case the feature is Integer and the values should be greater than 0:


fs.create_namespace('example', description='Example datasets')
fs.create_feature(
    'example/numbers',
    description='Timeseries of numbers',
    check=Column(pa.Int, Check.greater_than(0))
)
 
 
dts = pd.date_range('2020-01-01', '2021-02-09')
df = pd.DataFrame({'time': dts, 'numbers': list(range(1, len(dts) + 1))})
 
fs.save_dataframe(df, name='numbers', namespace='example')

To register our squared transformation feature we’re using the annotation fs.transform and fetching the data from the numbers feature applying the same data validation from numbers:


@fs.transform(
    'example/squared',
    from_features=['example/numbers'],
    check=Column(pa.Int, Check.greater_than(0))
)
def squared(df):
    return df ** 2

When fetch our features we should see:


df_query = fs.load_dataframe(
    ['example/numbers', 'example/squared'], 
    from_date='2021-01-01',
    to_date='2021-01-31'
)
print(df_query.tail(1))
##----
#             example/numbers  example/squared
# time                                        
# 2021-01-31              397           157609
##----

Monitoring

At training time doing data validation checks can lead to a breaking pipeline so we can stop and carefully think about the data but at serving time this will end up into a broken application. QAFS have the ability to log the data validations errors and avoid applications breaks, for that we must set the environment variable QAFS_RAISE_ERROR=false which make use of logging.error where an existing application monitoring stack excel in tasks such log grouping, filtering.. build reports and send notifications.

Ending

Remembering this is an beta release and hope making QAFS a stable tool, check the github page and open an issue for questions or features you would like to see implemented 👊