Great Expectations: Validating datasets in machine learning pipelines

17.2.2020 | 6 minutes reading time

Typically your favorite machine learning model doesn’t care whether or not your input dataset is professionally and technically correct. However, particularly for machine learning algorithms, the all-encompassing truth garbage in, garbage out holds true and hence it is strongly advised to validate datasets before feeding them into a machine learning algorithm.

Generally, validating datasets is a tedious task since we have to write a plethora of checks to ensure that the dataset contains all required columns and that the columns contain only expected values. Having written many dataset tests by hand, I was quite happy to stumble upon the Python library great_expectations , which is a promising tool to validate datasets in a painless way.

In this blogpost, I want to introduce great_expectations and share some of my thoughts about why I think this tool is helpful in the toolset of every data person.

The problem – why validate datasets?

From a high-level point of view there are (at least) two kinds of problems occurring while engineering a dataset. First, there are more or less obvious technical errors such as missing rows or columns and wrong datatypes. Second, even when the actual data pipelines are solid and the datasets are put together in a technically correct way, there are often issues with degeneration of data over time. Here, too, we have obvious changes, e.g. additional categories in a categorical column. However, many changes in the data often go undetected. For example:

the values of a binary column might be approximately evenly distributed between 0 and 1 at the beginning and the distibution could become skewed over time.
the mean value and standard deviation of sensor data emitted by a physical sensor could drift over time.

Obvious changes in the data or mistakes while engineering the dataset typically lead to errors in the machine learning pipeline and hence are addressed as soon as they occur. The silent changes, however, are more subtle and they potentially impair the performance of the machine learning model visualized in the following picture.

For this reason data monitoring and validation of datasets is crucial when operating machine learning systems.

In the following, we will look at a small example to introduce great_expectations as a tool for dataset validation.

Small example

In our example, we use the public domain hmeq-dataset from Kaggle. The context of the dataset is automation of the decision-making process for approval of lines of credit. However, in this blogpost we are not interested in the machine learning aspect of the problem. Instead, our goal is to use this dataset in order to show some ideas of the great_expectations library.

In this small example, we will take a short look at:

Basic table expectations
Expectations for categorical data
Expectations for numeric data
Saving expectations and validating other datasets

Preliminaries

The recommended way to follow the small example is to create a fresh Python 3.8 environment and install great_expectations and jupyter via

1pip install great_expectations
2pip install jupyter

Then, we start a jupyter-notebook and import the library with

1import great_expectations as ge

Because great_expectations wraps the popular pandas Python library, we can use pandas functionality to import datasets. Hence, we may use

1df = ge.read_csv('hmeq.csv')

to read the dataset. In our example, we want to simulate a situation where we generate expectations for a dataset and then apply these expectations to validate, for example, a newer version of the dataset. For this reason, we execute

1df = df.sample(frac=1).reset_index(drop=True)
2split = int(len(df)/2)
3df1 = df[:split]
4df2 = df[split:]

to shuffle the dataset and split it into two subsets. Now, we can create expectations using df1 and validate the dataset df2.

Basic table expectations

We can generate hypotheses for the table with great_expectations. For example, we can use

1min_table_length = 2500
2max_table_length = 3500
3df1.expect_table_row_count_to_be_between(min_table_length, max_table_length)

if we have an idea how many rows our dataset should have. Typically, we require specific feature columns in our dataset for our machine learning algorithm. We can create expectations for columns to exist via

1feature_columns = ['LOAN', 'VALUE', 'JOB', 'YOJ', 'CLNO', 'DEBTINC']
2for col in feature_columns:
3    df1.expect_column_to_exist(col)

Table expectations provide simple sanity checks for the dataset. great_expectations manages all expectations in a json file. We can print all established expecations with

1df1.get_expectation_suite()

So far the json file should look something like this:

1{'data_asset_name': None,
2 'expectation_suite_name': 'default',
3 'meta': {'great_expectations.__version__': '0.8.7'},
4 'expectations': [{'expectation_type': 'expect_table_row_count_to_be_between',
5   'kwargs': {'min_value': 2500, 'max_value': 3500}},
6  {'expectation_type': 'expect_column_to_exist', 'kwargs': {'column': 'LOAN'}},
7  {'expectation_type': 'expect_column_to_exist',
8   'kwargs': {'column': 'VALUE'}},
9  {'expectation_type': 'expect_column_to_exist', 'kwargs': {'column': 'JOB'}},
10  {'expectation_type': 'expect_column_to_exist', 'kwargs': {'column': 'YOJ'}},
11  {'expectation_type': 'expect_column_to_exist', 'kwargs': {'column': 'CLNO'}},
12  {'expectation_type': 'expect_column_to_exist',
13   'kwargs': {'column': 'DEBTINC'}}],
14 'data_asset_type': 'Dataset'}

Expectations for categorical data

Besides checking the whole dataframe, we can also address specific columns. As an example of categorical data, we use the column 'JOB'. First, we employ

1df1.expect_column_values_to_be_of_type('JOB', 'object')

to expect the correct dtype which typically is 'object' in case of categorical data. Next, we can create an expectation for the expected values in the column with

1expected_jobs = ['Other', 'ProfExe', 'Office', 'Mgr', 'Self', 'Sales']
2df1.expect_column_values_to_be_in_set('JOB', expected_jobs)

A very nice feature of great_expectations is the possibility to create expectations concerning the distribution of the column values. For this purpose we start by creating a categorical partition of the data.

1expected_job_partition = ge.dataset.util.categorical_partition_data(df1.JOB)

Then, we can use

1df1.expect_column_chisquare_test_p_value_to_be_greater_than('JOB', expected_job_partition)

to prepare a Chi-squared test for comparing categorical distributions.

Expectations for numeric data

As an example of numeric data, we use the column 'LOAN'. Again, we start with

1df1.expect_column_values_to_be_of_type('LOAN', 'int64')

to prepare a check for the correct dtype. In addition, we can use expectations such as

1df1.expect_column_mean_to_be_between('LOAN', 10000, 20000)
2df1.expect_column_max_to_be_between('LOAN', 50000, 100000)
3df1.expect_column_min_to_be_between('LOAN', 1000, 5000)

to ensure that min, max and mean of our data are in our expected ranges. Moreover, we can create a continuous partition of the data with

1expected_loan_partition = ge.dataset.util.continuous_partition_data(df1.LOAN)

and use

1df1.expect_column_bootstrapped_ks_test_p_value_to_be_greater_than('LOAN', expected_loan_partition)

to prepare a bootstrapped Kolmogorov-Smirnov test for comparing continuous distributions.

Save expectations and validate other datasets

So far we have defined multiple expectations regarding the dataset df1. In practice, we would require additional expectations concerning other columns of our dataset. For the purpose of our (small) example we stop here. We can save the json file containing our expectations via

1df1.save_expectation_suite('some_expectations.json')

In our workflow, we can (and usually should) place the file some_expectations.json under version control. Now, we can use the expecations to validate other datasets.

1df2.validate(expectation_suite='some_expectations.json', only_return_failures=True)

In this case, we do not expect to encounter any errors because we randomly split the dataset into two subsets. However, we can see the validation come into play, for example, by dropping a column

1df2_missing = df2.drop(columns=['LOAN'])
2df2_missing.validate(expectation_suite='some_expectations.json', only_return_failures=True)

or by setting a loan value which is too small

1df2_min_low = df2.copy()
2df2_min_low.at[4, 'LOAN'] = 10
3df2_min_low['LOAN'] = df2_min_low['LOAN'].astype('int64')
4df2_min_low.validate(expectation_suite='some_expectations.json', only_return_failures=True)

Conclusion

In the example, we only covered a small subset of the available features of great_expectations. The tool offers more functionality such as

more built-in expectations and even custom expecations
ways to integrate into data pipelines, e.g. with support for Spark
web-based data profiling and exploration
slack notification for failed validations

which I have not used outside of small tests.

In my opinion, great_expectations appears to be a useful addition in the tool kit of each data scientist/engineer. It has a low barrier of entrance since it can basically be reduced to an additional json file living in the code repository, but it has the potential to significantly simplify validating datasets and, in particular, debugging data pipelines.

At the moment, I am not a great fan of the initialization via great_expectations init and the resulting folder structure in the project directory. However, I did not use great_expectations under real conditions and maybe there are advantages of this setup that I do not see at the moment.

Overall, great_expectations appears to integrate nicely in many machine learning pipelines and I cannot wait to extensively test the tool in future projects. If you have any experiences with great_expectations, feel free to share them in the comments.

Was this post helpful?

Blog author

Marcel Mikl

Do you still have questions? Just send me a message.

fromMarcel Mikl

KI, Daten und Infrastruktur – ML-Systeme schnell Ende-zu-Ende verproben...

Heutzutage steht fast alles, was mit den Labels „künstliche Intelligenz (KI)“ oder „Machine Learning (ML)“ versehen ist, für Fortschritt. Seltsamerweise schließt diese Assoziation jedoch häufig die Themen Daten und Dateninfrastruktur nicht ausreichend...

Kultur
Data
Machine Learning

21.6.2021 | 12 Minuten Lesezeit

Marcel Mikl

DISH-O-TRON – Train that vision model!

With this article we continue our endeavor of building dish-o-tron – an AI system designed to prevent the sudden appearance of dirty dishes in the community kitchen sink, and hence turning the community kitchen into a place of peace and harmony. This...

AI
Computer Vision

11.10.2020 | 11 Minuten Lesezeit

Marcel Mikl

DISH-O-TRON – Gather that DATA you must!

This is the second article in our dish-o-tron series (a non-standard Deep Learning tutorial) in which we tackle one of the biggest problems in community kitchens: coming across someone else’s dirty dishes. We are facing this problem by building a state...

AI
Computer Vision
Machine Learning

24.9.2020 | 11 Minuten Lesezeit

Marcel Mikl

DISH-O-TRON – No more dirty dishes thanks to AI

Sadly, to tell you the truth, doing dishes is still a thing. However, so far most of our readers still like our non-standard Deep Learning tutorial. Typically, AI is demonstrated as solving various toy problems. AI plays chess and Go, AI plays video ...

10.9.2020 | 7 Minuten Lesezeit

Marcel Mikl

KI in der Praxis: Fehlerhafte Bauteile mit Rekognition auf AWS identifizieren

Noch vor kurzer Zeit mussten für den Einsatz von künstlicher Intelligenz (KI) unter großem Aufwand eigene KI-Modelle erstellt werden. Heute ist für viele Anwendungsfälle die Einstiegshürde in die Welt der KI durch Cloud-Computing-Dienste stark gesunken...

Cloud
Computer Vision
Data
Künstliche Intelligenz
Machine Learning
Python

29.7.2020 | 11 Minuten Lesezeit

Marcel Mikl

Nico Axtmann

KI in der Praxis: Fehlerhafte Bauteile mit AutoML in der Google Cloud ...

Noch vor kurzer Zeit war der Einsatz von künstlicher Intelligenz (KI) nur mit großem Aufwand und Konstruktion eigener neuronaler Netze möglich. Heute ist die Einstiegshürde in die Welt der KI durch Cloud-Computing-Dienste stark gesunken. So kann man ...

Cloud
Computer Vision
Data
Python
Machine Learning
Google Cloud
Künstliche Intelligenz

8.7.2020 | 11 Minuten Lesezeit

Nico Axtmann

Marcel Mikl

KI für KMU: (Teil-)Automatisierung der Qualitätskontrolle von Bauteilen

Noch vor kurzer Zeit war der Einsatz von künstlicher Intelligenz (KI) nur mit großem Aufwand und ausreichend Spezialwissen möglich. Hauptsächlich große Internet-Konzerne wie Google, Apple und Facebook hatten das Geld, die Daten und die Expertise, um ...

Data
Machine Learning
Künstliche Intelligenz

6.7.2020 | 7 Minuten Lesezeit

Marcel Mikl

Nico Axtmann

Thinking AI means re-thinking data

While doing AI is sexy and cool, data infrastructure is typically not considered any of this. However, production-grade machine learning applications heavily rely on proper data infrastructure. Hence, in order to generate actual business value, solid...

AI
Big Data
Data
Machine Learning

27.5.2020 | 7 Minuten Lesezeit

Marcel Mikl

Wie man Data-Science-Projekte nicht in die PoC-Sackgasse manövriert

Warum gelingt es Data-Science-Initiativen häufig nicht, einen echten Mehrwert zu schaffen? Wir haben einige Ursachen dafür ausgemacht. In diesem Blogpost stellen wir vier typische Fallen für Data-Science-Projekte vor und geben Tipps, wie Du sie umschiffen...

Machine Learning
Data
Künstliche Intelligenz
Softwareentwicklung

27.3.2020 | 11 Minuten Lesezeit

Marcel Mikl

Remote training with GitLab-CI and DVC

In many Data Science projects there is a point in time where the workstation under your desk is not the ideal machine to perform the model training anymore. More potent processors and GPUs are required, e.g. a suitable server in your company’s rack or...

Git
Machine Learning
CI/CD
AI
GitLab

27.1.2020 | 15 Minuten Lesezeit

Marcel Mikl

E-Mail-Klassifizierung mit SpaCy

Noch vor kurzer Zeit war E-Mail-Klassifikation mittels Deep Learning nur mit Spezialwissen und ausreichend Data Science Know-how möglich. Heute existieren sehr gute Open-Source-Bibliotheken mit fertigen Deep-Learning-Modellen, welche sehr weit optimiert...

Data
Machine Learning

28.4.2019 | 8 Minuten Lesezeit

Marcel Mikl

Kunden-E-Mails effizient verarbeiten – mit künstlicher Intelligenz

Einleitung Künstliche Intelligenz (KI) findet sich heutzutage scheinbar überall. Bereits ohne den derzeitigen Hype-Faktor um KI ist der Begriff nur schwer zu greifen. Viele Unternehmen sehen sich unter Zugzwang, KI als neue Technologie einzusetzen und...

Data
Künstliche Intelligenz

7.4.2019 | 7 Minuten Lesezeit

Marcel Mikl

Oliver Moser

Wie trainiert man eigentlich neuronale Netze?

Neuronale Netze sind für Außenstehende häufig von einer mystischen Aura umgeben. Sie werden regelmäßig in Verbindung mit menschlichen Gehirnen gebracht, und ihnen wird eine sich verselbständigende Intelligenz zugeschrieben. Das macht sie für viele mysteriös...

Künstliche Intelligenz

27.8.2018 | 8 Minuten Lesezeit

Marcel Mikl

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Lessons learned: Was wir in einem Jahr ML Orchestrierung mit Dagster gelernt...

In einem gemeinsamen Projekt haben Tom Scholz und ich Machine Learning (ML) Services gebaut, um einem Kunden bei der Analyse von Dokumenten zu helfen. Eine Proof-Of-Concept Lösung war schnell gebaut, die es nun zu operationalisieren gilt. Hierbei war...

Machine Learning
Python
Data
Data Science

12.9.2024 | 27 Minuten Lesezeit

Patrick Soschinski

Tom Scholz

When Business Meets Technology: Vom Datenprodukt zur Datenarchitektur ...

Zusammenfassung Der Data Product Canvas (DPC) ist ein Werkzeug für die leichtgewichtige und iterative Konzeption von Datenprodukten. Dabei steigert er die Effizienz der Produktdefinition, indem er die wesentlichen Einflussbereiche auf Datenprodukte übersichtlich...

Softwarearchitektur
Data
DDD
Digitale Produktentwicklung

6.8.2024 | 21 Minuten Lesezeit

Daniel Engelhardt

Dr. Florian Rademacher

Charge your APIs Volume 28: Verbesserung von Anwendungs- und Datenintegration...

In der heutigen schnelllebigen Welt ist die nahtlose Integration von Anwendungen und Daten entscheidend für den Erfolg eines Unternehmens. In diesem Blogpost werden Konzepte wie die Maslowsche Pyramide, Team Topologies, evolutionäre Architekturen, API...

API
Data
Integration

25.7.2024 | 9 Minuten Lesezeit

Daniel Kocot

Mit Applied Data Products zum datengetriebenen Unternehmen

In den letzten Jahren ist der Hype um den Wert von Daten kontinuierlich gestiegen. Gleichzeitig sind eine Vielzahl von Konzepten und Methoden aufgekommen, wie man als Unternehmen "datengetrieben" werden kann. Vom strategischen Top-Management bis zum ...

Agilität
Big Data
Data
Produktmanagement
Digitalisierung
Data Science
Business Intelligence

18.5.2024 | 8 Minuten Lesezeit

Dr. Florian Rademacher

Stephan Hochhaus

Green Cloud: Daten und Emissionen sparen

Das Internet produziert jährlich 900 Millionen Tonnen CO₂ – das ist deutlich mehr als Deutschland insgesamt emittiert. Hauptverantwortlich ist der immer weiter steigende Stromverbrauch beim Transport und der Speicherung von Daten. Wenn ihr kurz darüber...

Cloud
Green IT
Softwarearchitektur
Data

11.3.2024 | 5 Minuten Lesezeit

Dennis

Charge your APIs Volume 23: REST vs. gRPC

APIs dienen als Verbindungsstück zwischen Daten und Verarbeitung und erlauben uns damit, Daten im richtigen Kontext als Informationen zu interpretieren. Passende fachliche Themen sind dabei präsenter denn je und erreichen bald auch den Endverbraucher...

Java
Softwareentwicklung
Spring
Softwarearchitektur
API
Data

11.2.2024 | 7 Minuten Lesezeit

Sebastian Tiemann

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Im Bereich des maschinellen Lernens wurde eine lange Zeit angenommen, dass die Eingabedaten von Modellen und Gewichten sicher sei und nicht extrahiert werden könnten. In den letzten Jahren veröffentlichte Forschung hat diese Annahme in Frage gestellt...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 8 Minuten Lesezeit

Ihsan Kisi

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Mithilfe von Daten können Unternehmen fundiertere Entscheidungen treffen, ihre Arbeitsabläufe optimieren und mit der Kraft des maschinellen Lernens (ML) einen Vorteil in der wettbewerbsintensiven Geschäftswelt erlangen. Allerdings ist der Umgang mit ...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 7 Minuten Lesezeit

Ihsan Kisi

Große Sprachmodelle: Was ist ein LLM?

Große Sprachmodelle (Large Language Models oder LLM) haben in den letzten Jahren enorme Fortschritte gemacht und spielen eine entscheidende Rolle in verschiedenen Anwendungen. Aber was ist ein LLM? Es ist sinnvoll zu erklären, was ein „einfaches“ Sprachmodell...

Machine Learning

20.6.2023 | 4 Minuten Lesezeit

Elvira Siegel

Bessere SQL-Datenpipelines mit dbt

SQL ist weiterhin aus der Datenanalyse nicht wegzudenken – es ist vergleichsweise einfach zu lernen und Anwender können es ohne zusätzliche Werkzeuge auf einer Datenbank ausführen. Entsprechend ist es bei vielen Datenanalysten und Engineers beliebt. ...

Data

22.2.2023 | 2 Minuten Lesezeit

Matthias Niehoff

ChatGPT im Alltag eines Python-Entwicklers

Seit einigen Tagen spiele ich mit ChatGPT herum. Beruflich und privat konnte ich damit einige Fragen bearbeiten, bspw. welche Alternativen es zu bestimmten Tools gibt, was Vorteile von Teilzeit für den Arbeitgeber sind oder wer ich bin. Leider weiß ChatGPT...

NLP
Python
Künstliche Intelligenz

27.1.2023 | 7 Minuten Lesezeit

Robert Meißner

Manches gehört zusammen, manches besser nicht - Konnaszenz in Python

Wir alle kennen es. Wir bekommen neuen Code und irgendwie macht der merkwürdige Sachen. Teilweise müssen wir Reverse Engineering betreiben. Wir wundern uns, warum eine Umgebungsvariable nicht korrekt gesetzt wird oder der Login schief geht. Bis wir merken...

Python
Softwareentwicklung
Softwarearchitektur

30.11.2022 | 7 Minuten Lesezeit

Robert Meißner

Streaming Wikipedia mit Apache Kafka

Apache Kafka ist in aller Munde und entwickelt sich im Kontext von verteilten Systemen zum De-facto-Standard als Plattform für Event Streaming. Im Rahmen unserer OffProject Time (Weiterbildungszeit) haben wir uns die Plattform auch näher angeschaut und...

Kotlin
Data
Java
Messaging
Spring

15.8.2022 | 10 Minuten Lesezeit

Christoph Metzger

Felix Rieß

„Strawberry JSON Fields Forever“: Filtern nach JSON-Feldern mit GraphQL...

Schon die Beatles besangen ein uraltes Problem in ihrem Song „Strawberry JSON Fields Forever“ : Wie lässt sich mit der GraphQL Library Strawberry für Python nach Werten in JSON-Feldern einer PostgreSQL-Datenbank filtern?SetupUm das zu zeigen, braucht...

Frontend
API
Python

26.6.2022 | 4 Minuten Lesezeit

Michael Eichenseer

Einführung in die Welt der Tourenoptimierung – Echte Routen und realistischere...

In diesem Artikel möchte ich euch mit einem Python Jupyter Notebook zeigen, wie ihr Anwendungsfälle der Tourenoptimierung inklusive Nebenbedingungen lösen und visualisieren könnt. Außerdem zeige ich euch, wie ihr mit OpenStreetMaps die Route zwischen...

Data

21.6.2022 | 7 Minuten Lesezeit

Lukas Heidemann

Einführung in die Welt der Tourenoptimierung – Visualisierung und Lösungsverfahren...

In diesem Artikel möchte ich euch zeigen, wie ihr Probleme der Tourenoptimierung in einem Python Jupyter Notebook lösen und visualisieren könnt. Am Beispiel eines Fahrradkurierdienst zeige ich außerdem, wie das Grundproblem um gängige Nebenbedingungen...

Data

16.6.2022 | 9 Minuten Lesezeit

Lukas Heidemann

Einführung in die Welt der Tourenoptimierung (1/3)

In vielen Unternehmen fallen täglich verschiedene Transportprozesse an. Klassische Beispiele sind die Optimierung von Warenein- und ausgängen, die Einsatzplanung von Servicetechnikern oder die optimale Reihenfolge der Auslieferung bei Lieferdiensten....

Data

12.6.2022 | 8 Minuten Lesezeit

Lukas Heidemann

Smart DistancR – Perspektivisch korrekte Distanzmessung zwischen Personen

Die Corona-Krise ist weiterhin in aller Munde und wird uns mit hoher Wahrscheinlichkeit noch etwas länger begleiten. Wie man aus unterschiedlichen Statistiken erfährt, schwanken die Fallzahlen weiter und sorgen für zusätzliche Restriktionen. Diese werden...

Computer Vision
Künstliche Intelligenz
IoT
Machine Learning

13.12.2021 | 7 Minuten Lesezeit

Michel Ehmen

Machine-Learning-Modelle bewerten – Quality Gates etablieren

Die Qualität bzw. Nützlichkeit von Machine-Learning-Modellen lässt sich mit Hilfe von Testdaten und Metriken bewerten. Allerdings in welchem Umfang? Manuell, automatisiert, einmalig, regelmäßig? Manuell lassen sich die ersten Modelle als Ergebnis eines...

Data
Machine Learning
Softwareentwicklung
CI/CD

7.12.2021 | 7 Minuten Lesezeit

Berthold Schulte

Wie man Java-Klassen in Python benutzt

Generell sollte man zwar für jedes Problem das passende Werkzeug nutzen. Aber oftmals wird man gezwungen, den Hammer Java zu nutzen, weil der Rest des Hauses mit diesem Hammer gebaut wurde. Eine moderne Lösung dieses Problems ist natürlich die Microservice...

Künstliche Intelligenz
Java
Python

15.11.2021 | 8 Minuten Lesezeit

Hendrik Schawe

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Great Expectations: Validating datasets in machine learning pipelines

The problem – why validate datasets?

Small example

Preliminaries

Basic table expectations

Expectations for categorical data

Expectations for numeric data

Save expectations and validate other datasets

Conclusion

Was this post helpful?

Blog author

Get in contact

Get in contact

More articles

KI, Daten und Infrastruktur – ML-Systeme schnell Ende-zu-Ende verproben...

DISH-O-TRON – Train that vision model!

DISH-O-TRON – Gather that DATA you must!

DISH-O-TRON – No more dirty dishes thanks to AI

KI in der Praxis: Fehlerhafte Bauteile mit Rekognition auf AWS identifizieren

KI in der Praxis: Fehlerhafte Bauteile mit AutoML in der Google Cloud ...

KI für KMU: (Teil-)Automatisierung der Qualitätskontrolle von Bauteilen

Thinking AI means re-thinking data

Wie man Data-Science-Projekte nicht in die PoC-Sackgasse manövriert

Remote training with GitLab-CI and DVC

E-Mail-Klassifizierung mit SpaCy

Kunden-E-Mails effizient verarbeiten – mit künstlicher Intelligenz

Wie trainiert man eigentlich neuronale Netze?

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

More articles in this subject area

Lessons learned: Was wir in einem Jahr ML Orchestrierung mit Dagster gelernt...

When Business Meets Technology: Vom Datenprodukt zur Datenarchitektur ...

Charge your APIs Volume 28: Verbesserung von Anwendungs- und Datenintegration...

Mit Applied Data Products zum datengetriebenen Unternehmen

Green Cloud: Daten und Emissionen sparen

Charge your APIs Volume 23: REST vs. gRPC

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Große Sprachmodelle: Was ist ein LLM?

Bessere SQL-Datenpipelines mit dbt

ChatGPT im Alltag eines Python-Entwicklers

Manches gehört zusammen, manches besser nicht - Konnaszenz in Python

Streaming Wikipedia mit Apache Kafka

„Strawberry JSON Fields Forever“: Filtern nach JSON-Feldern mit GraphQL...

Einführung in die Welt der Tourenoptimierung – Echte Routen und realistischere...

Einführung in die Welt der Tourenoptimierung – Visualisierung und Lösungsverfahren...

Einführung in die Welt der Tourenoptimierung (1/3)

Smart DistancR – Perspektivisch korrekte Distanzmessung zwischen Personen

Machine-Learning-Modelle bewerten – Quality Gates etablieren

Wie man Java-Klassen in Python benutzt

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.

Zu den Jobangeboten