Typically your favorite machine learning model doesn’t care whether or not your input dataset is professionally and technically correct. However, particularly for machine learning algorithms, the all-encompassing truth garbage in, garbage out holds true and hence it is strongly advised to validate datasets before feeding them into a machine learning algorithm.
Generally, validating datasets is a tedious task since we have to write a plethora of checks to ensure that the dataset contains all required columns and that the columns contain only expected values. Having written many dataset tests by hand, I was quite happy to stumble upon the Python library great_expectations , which is a promising tool to validate datasets in a painless way.
In this blogpost, I want to introduce great_expectations
and share some of my thoughts about why I think this tool is helpful in the toolset of every data person.
The problem – why validate datasets?
From a high-level point of view there are (at least) two kinds of problems occurring while engineering a dataset. First, there are more or less obvious technical errors such as missing rows or columns and wrong datatypes. Second, even when the actual data pipelines are solid and the datasets are put together in a technically correct way, there are often issues with degeneration of data over time. Here, too, we have obvious changes, e.g. additional categories in a categorical column. However, many changes in the data often go undetected. For example:
- the values of a binary column might be approximately evenly distributed between 0 and 1 at the beginning and the distibution could become skewed over time.
- the mean value and standard deviation of sensor data emitted by a physical sensor could drift over time.
Obvious changes in the data or mistakes while engineering the dataset typically lead to errors in the machine learning pipeline and hence are addressed as soon as they occur. The silent changes, however, are more subtle and they potentially impair the performance of the machine learning model visualized in the following picture.
For this reason data monitoring and validation of datasets is crucial when operating machine learning systems.
In the following, we will look at a small example to introduce great_expectations
as a tool for dataset validation.
Small example
In our example, we use the public domain hmeq-dataset from Kaggle. The context of the dataset is automation of the decision-making process for approval of lines of credit. However, in this blogpost we are not interested in the machine learning aspect of the problem. Instead, our goal is to use this dataset in order to show some ideas of the great_expectations
library.
In this small example, we will take a short look at:
- Basic table expectations
- Expectations for categorical data
- Expectations for numeric data
- Saving expectations and validating other datasets
Preliminaries
The recommended way to follow the small example is to create a fresh Python 3.8
environment and install great_expectations
and jupyter
via
1pip install great_expectations 2pip install jupyter
Then, we start a jupyter-notebook
and import the library with
1import great_expectations as ge
Because great_expectations
wraps the popular pandas Python library, we can use pandas
functionality to import datasets. Hence, we may use
1df = ge.read_csv('hmeq.csv')
to read the dataset. In our example, we want to simulate a situation where we generate expectations for a dataset and then apply these expectations to validate, for example, a newer version of the dataset. For this reason, we execute
1df = df.sample(frac=1).reset_index(drop=True) 2split = int(len(df)/2) 3df1 = df[:split] 4df2 = df[split:]
to shuffle the dataset and split it into two subsets. Now, we can create expectations using df1
and validate the dataset df2
.
Basic table expectations
We can generate hypotheses for the table with great_expectations
. For example, we can use
1min_table_length = 2500 2max_table_length = 3500 3df1.expect_table_row_count_to_be_between(min_table_length, max_table_length)
if we have an idea how many rows our dataset should have. Typically, we require specific feature columns in our dataset for our machine learning algorithm. We can create expectations for columns to exist via
1feature_columns = ['LOAN', 'VALUE', 'JOB', 'YOJ', 'CLNO', 'DEBTINC'] 2for col in feature_columns: 3 df1.expect_column_to_exist(col)
Table expectations provide simple sanity checks for the dataset. great_expectations
manages all expectations in a json
file. We can print all established expecations with
1df1.get_expectation_suite()
So far the json
file should look something like this:
1{'data_asset_name': None, 2 'expectation_suite_name': 'default', 3 'meta': {'great_expectations.__version__': '0.8.7'}, 4 'expectations': [{'expectation_type': 'expect_table_row_count_to_be_between', 5 'kwargs': {'min_value': 2500, 'max_value': 3500}}, 6 {'expectation_type': 'expect_column_to_exist', 'kwargs': {'column': 'LOAN'}}, 7 {'expectation_type': 'expect_column_to_exist', 8 'kwargs': {'column': 'VALUE'}}, 9 {'expectation_type': 'expect_column_to_exist', 'kwargs': {'column': 'JOB'}}, 10 {'expectation_type': 'expect_column_to_exist', 'kwargs': {'column': 'YOJ'}}, 11 {'expectation_type': 'expect_column_to_exist', 'kwargs': {'column': 'CLNO'}}, 12 {'expectation_type': 'expect_column_to_exist', 13 'kwargs': {'column': 'DEBTINC'}}], 14 'data_asset_type': 'Dataset'}
Expectations for categorical data
Besides checking the whole dataframe, we can also address specific columns. As an example of categorical data, we use the column 'JOB'
. First, we employ
1df1.expect_column_values_to_be_of_type('JOB', 'object')
to expect the correct dtype which typically is 'object'
in case of categorical data. Next, we can create an expectation for the expected values in the column with
1expected_jobs = ['Other', 'ProfExe', 'Office', 'Mgr', 'Self', 'Sales'] 2df1.expect_column_values_to_be_in_set('JOB', expected_jobs)
A very nice feature of great_expectations
is the possibility to create expectations concerning the distribution of the column values. For this purpose we start by creating a categorical partition of the data.
1expected_job_partition = ge.dataset.util.categorical_partition_data(df1.JOB)
Then, we can use
1df1.expect_column_chisquare_test_p_value_to_be_greater_than('JOB', expected_job_partition)
to prepare a Chi-squared test for comparing categorical distributions.
Expectations for numeric data
As an example of numeric data, we use the column 'LOAN'
. Again, we start with
1df1.expect_column_values_to_be_of_type('LOAN', 'int64')
to prepare a check for the correct dtype. In addition, we can use expectations such as
1df1.expect_column_mean_to_be_between('LOAN', 10000, 20000) 2df1.expect_column_max_to_be_between('LOAN', 50000, 100000) 3df1.expect_column_min_to_be_between('LOAN', 1000, 5000)
to ensure that min, max and mean of our data are in our expected ranges. Moreover, we can create a continuous partition of the data with
1expected_loan_partition = ge.dataset.util.continuous_partition_data(df1.LOAN)
and use
1df1.expect_column_bootstrapped_ks_test_p_value_to_be_greater_than('LOAN', expected_loan_partition)
to prepare a bootstrapped Kolmogorov-Smirnov test for comparing continuous distributions.
Save expectations and validate other datasets
So far we have defined multiple expectations regarding the dataset df1
. In practice, we would require additional expectations concerning other columns of our dataset. For the purpose of our (small) example we stop here. We can save the json file containing our expectations via
1df1.save_expectation_suite('some_expectations.json')
In our workflow, we can (and usually should) place the file some_expectations.json
under version control. Now, we can use the expecations to validate other datasets.
1df2.validate(expectation_suite='some_expectations.json', only_return_failures=True)
In this case, we do not expect to encounter any errors because we randomly split the dataset into two subsets. However, we can see the validation come into play, for example, by dropping a column
1df2_missing = df2.drop(columns=['LOAN']) 2df2_missing.validate(expectation_suite='some_expectations.json', only_return_failures=True)
or by setting a loan value which is too small
1df2_min_low = df2.copy() 2df2_min_low.at[4, 'LOAN'] = 10 3df2_min_low['LOAN'] = df2_min_low['LOAN'].astype('int64') 4df2_min_low.validate(expectation_suite='some_expectations.json', only_return_failures=True)
Conclusion
In the example, we only covered a small subset of the available features of great_expectations
. The tool offers more functionality such as
- more built-in expectations and even custom expecations
- ways to integrate into data pipelines, e.g. with support for Spark
- web-based data profiling and exploration
- slack notification for failed validations
which I have not used outside of small tests.
In my opinion, great_expectations
appears to be a useful addition in the tool kit of each data scientist/engineer. It has a low barrier of entrance since it can basically be reduced to an additional json
file living in the code repository, but it has the potential to significantly simplify validating datasets and, in particular, debugging data pipelines.
At the moment, I am not a great fan of the initialization via great_expectations init
and the resulting folder structure in the project directory. However, I did not use great_expectations
under real conditions and maybe there are advantages of this setup that I do not see at the moment.
Overall, great_expectations
appears to integrate nicely in many machine learning pipelines and I cannot wait to extensively test the tool in future projects. If you have any experiences with great_expectations
, feel free to share them in the comments.
More articles
fromMarcel Mikl
Your job at codecentric?
Jobs
Agile Developer und Consultant (w/d/m)
Alle Standorte
More articles in this subject area
Discover exciting further topics and let the codecentric world inspire you.
Gemeinsam bessere Projekte umsetzen.
Wir helfen deinem Unternehmen.
Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.
Hilf uns, noch besser zu werden.
Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.
Blog author
Marcel Mikl
Do you still have questions? Just send me a message.
Do you still have questions? Just send me a message.