Move n-gram extraction into your Keras model!

18.7.2019 | 7 minutes reading time

Move n-gram extraction into your Keras model!

In a project on large-scale text classification, a colleague of mine significantly raised the accuracy of our Keras model by feeding it with bigrams and trigrams instead of single characters. For his experiments he could just modify the preprocessing and the model as he wished, but for production, it was much preferable to just replace the model being served by tensorflow and leave all other code unchanged. And that is what we did — move the bigram and trigram extraction into our neural network. In this blog post, I’ll show you the basic idea , the implementation , an application and the limitations of our approach.

The idea: n-gram extraction via convolution

Suppose we want to process the quote

“I’d far rather be happy than right any day”

of Douglas Adams . Instead of looking at the text as a sequence of characters

I‘d far rather …

a neural network may profit from looking at pairs of adjacent characters, that is, at the sequence of bigrams

I’‘dd ffaarr rraatthheer …

or even at the sequence of trigrams or n-grams for n larger 3. To feed the neural network, we need to convert characters into numbers, for example, using the ASCII or UTF-8 codes. Our bigrams then become sequences of pairs of numbers:

73, 3939, 100100, 3232, 102102, 97 …

If we encode these bigrams using the rule (a, b) ↦ N · a + b, where N is the size of our alphabet, we obtain a sequence of numbers again: in case N=256, this would be

73*256+39=1872739*256+100=10084100*256+32=2563232*256+102=8294 …

More generally, we can encode n-grams for arbitrary n using the rule

(a₀, …,a_n-1) ↦ N^n-1 · a₁ + N^n-2 · a₂ + … + N · a_n-2 + a_n-1.

Here comes the key observation: with this encoding rule,

extracting n-grams becomes a convolution of the sequence of character codes with the kernel (1,N, …, N^n-1).

And this preprocessing step can easily be inserted as a first step into any character-level text-processing neural network.

The implementation

As a warm-up, let us implement the n-gram extraction as a convolution with NumPy . Given a NumPy array of character codes, the n-gram length n and the size of the alphabet N, the following function returns the sequence of encoded n-grams as an array:

1import numpy as np
2
3def ngrams_numpy(array, n, alphabet_size):
4    kernel = np.power(alphabet_size, range(0, n))
5    return np.convolve(array, kernel, mode='valid')

Next, how about the deep learning library Keras ? Suppose we already have a working text-processing model whose input are (batches of) sequences of character codes. Then we can add bigram or n-gram extraction as a first layer using a lambda layer in one line. Indeed, given a batch of samples in form of a tensor of shape (batch_size, sample_length), the following function returns a batch of encoded bigrams in form of a tensor of shape (batch_size, sample_length - 1):

1from keras.layers import Lambda
2
3def bigrams_lambda_layer(alphabet_size):
4    return Lambda(lambda x: x[:,:-1] + x[:,1:] * alphabet_size)

However, lambda layers in Keras may cause problems when saving , loading or checkpointing the model.

For further deployment of a model, for example with tensorflow serving , it might be better to avoid a lambda layer and to use a 1d-convolutional layer with fixed weights as follows:

1import numpy as np
2from keras import layers, backend
3
4def ngram_block(n, alphabet_size):
5    def wrapped(inputs):
6        layer = layers.Conv1D(1, n, use_bias=False, trainable=False)
7        x = layers.Reshape((-1, 1))(inputs)
8        x = layer(x)
9        kernel = np.power(alphabet_size, range(0, n), 
10                          dtype=backend.floatx())
11        layer.set_weights([kernel.reshape(n, 1, 1)])
12        return layers.Reshape((-1,))(x)
13
14    return wrapped

This function can be used like a layer:

1bigrams_tensor = ngram_block(2, alphabet_size)(input_tensor)

See also the source code for the experiment below . What this function does is

create a 1d-convolutional layer layer with one feature map, window size n, zero bias vector and frozen weights that are not changed during training,
reshape the input inputs, which is a tensor of shape (batch_size, sample_length), to a tensor x with shape (batch_size, sample_length, 1) (necessary because convolutional layers operate on sequences of vectors and not on sequences of scalars),
apply the convolutional layer to the reshaped input,
set the kernel of the convolutional layer and
reshape the output of the convolutional layer from (batch_size, sample_length, 1) to (batch_size, sample_length) again.

An experiment

Let us finally see how this idea works out for a classical test case, the 20 newsgroups dataset , where the task is to guess the topic of a given post from its text. We shall use a simple character-level convolutional network and see how n-gram extraction inside the model affects the classification accuracy and and training time.

To load the data, we use the datasets module of scikit-learn :

1from sklearn.datasets import fetch_20newsgroups as fetch
2
3data = fetch(subset="train", remove=("headers", "footers", "quotes"))
4posts, topics = data["data"], data["target"]

Now posts is a list of newsgroup posts as strings, and topics is a list of numbers representing the respective newsgroup topics. For each topic, we have 350 to 600 samples:

Note that this is way too little data for a character-level model to perform well. But let us try nevertheless.

We apply some minimal preprocessing and

convert the characters to lower case,
filter out all characters that are not contained in our chosen ALPHABET,
replace the remaining characters by their index in the ALPHABET,
trim the sequence of indices to a fixed length MAX_LEN,
stack all those sequences in one large NumPy array :

1import numpy as np
2
3ALPHABET = "abcdefghijklmnopqrstuvwxyz1234567890 !$#()-=+:;,.?/"
4MAX_LEN = 1000
5
6def encode_sample(sample, index):
7    indices = [index[char] for char in sample if char in index]     
8    return np.resize(np.array(indices), MAX_LEN)
9
10index = {char: i + 1 for i, char in enumerate(ALPHABET)}
11X = np.stack([encode_sample(x.lower(), index) for x in posts])
12y = np.eye(20)[topics]

Now X is an array of shape (len(posts), MAX_LEN), and y is an array of shape (len(posts), 20) containing the one-hot encoded topics.

As a baseline, we train a simple convolutional model:

1from keras import layers, models, optimizers
2
3LAYER_PARAMS = [[64, 3, 3], [128, 3, 3]]
4EMBEDDING_DIM = 16
5
6def build_model():
7    inputs = layers.Input(shape=(MAX_LEN,))
8    x = layers.Embedding(len(ALPHABET), EMBEDDING_DIM)(inputs)
9    for filters, kernel_size, pool_size in LAYER_PARAMS:
10        x = layers.Conv1D(filters, kernel_size, activation="relu")(x)
11        x = layers.BatchNormalization()(x)
12        x = layers.SpatialDropout1D(0.15)(x)
13        x = layers.MaxPooling1D(pool_size)(x)
14    x = layers.GlobalAveragePooling1D()(x)
15    x = layers.Dense(20, activation="softmax")(x)
16    model = models.Model(inputs=inputs, outputs=x)
17    model.compile(optimizer=optimizers.Adadelta(),
18                  loss="categorical_crossentropy", metrics=["acc"])
19    return model
20
21model = build_model()
22history = model.fit(X, y, epochs=60, batch_size=20, 
23                    validation_split=0.2)

The results are quite poor — the validation accuracy reaches just 60 percent

By careful tuning of hyperparameters, things certainly could be improved a bit.

Now let us see how bigram and trigram extraction will affect performance of the model. Using the function ngram_block, we only need to insert the line x = ngram_block(n, size)(inputs) between the Input and Embedding layers in build_model as follows:

1def build_ngram_model(n):
2    inputs = layers.Input(shape=(MAX_LEN,))
3    x = ngram_block(n, len(ALPHABET))(inputs)
4    x = layers.Embedding(pow(len(ALPHABET), n), n * EMBEDDING_DIM)(x)
5    for filters, kernel_size, pool_size in LAYER_PARAMS:
6        x = layers.Conv1D(filters, kernel_size, activation="relu")(x)
7        x = layers.BatchNormalization()(x)
8        x = layers.SpatialDropout1D(0.05 + 0.1 * n)(x)
9        x = layers.MaxPooling1D(pool_size)(x)
10    x = layers.GlobalAveragePooling1D()(x)
11    x = layers.Dense(20, activation="softmax")(x)
12    model = models.Model(inputs=inputs, outputs=x)
13    model.compile(optimizer=optimizers.Adadelta(),
14                  loss="categorical_crossentropy", metrics=["acc"])
15    return model

We also raised the embedding dimension (because now we want to embed bigrams and trigrams instead of single characters) and use an adaptive spatial dropout rate. Let us see how the n-gram model performs:

1for n in range(1, 4):
2    build_ngram_model(n).fit(X, y, epochs=40, 
3                             batch_size=20, validation_split=0.2)

The training histories show that n-gram extraction yields a significant improvement:

Indeed, the mean validation accuracy of the last 5 training epochs increased by more than 10 percent:

n	1	2	3
mean validation accuracy	0.5796	0.6401	0.7064

One limitation of the technique

Why did we stop at trigrams in the experiment above? The reason is that we do not only encode the n-grams that occur in our samples, but reserve codings for all n-grams that could possibly occur. And that makes a huge difference when n is growing larger:

n	1	2	3	4	5
#(occuring n-grams)	52	2,596	47,203	214,362	551,904
#(potential n-grams)	51	2,601	132,651	6,765,201	345,025,251

And therefore, the embedding layer will need memory increasing exponentially with n. This is the reason why we stick to bigrams or trigrams. By the way, the numbers above where extracted as follows:

1import pandas as pd
2
3def all_ngrams(n):
4    length = MAX_LEN - n + 1
5    def ngrams(x):
6        return set(zip(*[x[i:length + i] for i in range(0, n)]))
7    
8    return set().union(*[ngrams(x) for x in X])
9
10ns = range(1,6)
11alphabet_size = len(ALPHABET)
12cts = {'#(occuring n-grams)': [len(all_ngrams(n)) for n in ns], 
13       '#(potential n-grams)': [pow(alphabet_size, n) for n in ns]}
14pd.DataFrame(cts, index = pd.Index(ns, name='n')).transpose()

Was this post helpful?

Blog author

Thomas Timmermann

Data Scientist

Do you still have questions? Just send me a message.

fromThomas Timmermann

NER with little data? Transformers to the rescue!

How do you solve deep learning problems with too little labelled data? The answer, of course, is transfer learning. In this post, we will apply this concept to named entity recognition (NER) and fine-tune a pre-trained BERT to extract information from...

Data
Machine Learning
AI
NLP
Agile transformation

14.12.2020 | 8 Minuten Lesezeit

Take control of named entity recognition with your own Keras model!

This post shows how to extract information from text documents with the high-level deep learning library Keras : we build, train and evaluate a bidirectional LSTM model by hand for a custom named entity recognition (NER) task on legal texts. In a previous...

Data
Python
AI
NLP
Machine Learning

13.11.2020 | 9 Minuten Lesezeit

NER @ CLI: Custom-named entity recognition with spaCy in four lines

Named entity recognition is a technical term for a solution to a key automation problem: extraction of information from text. Applications include automation of business processes involving documentsdistillation of data from the web by scraping websitesindexing...

Data
AI
NLP
Machine Learning

6.11.2020 | 8 Minuten Lesezeit

Natural Language Processing - Einsteigen und loslegen!

1 Worum geht es? Ob Suchmaschinen, Spamfilter, Chatbots oder Sprachassistenten wie Siri und Alexa — Computer verarbeiten immer mehr Sprache mit immer besserer Genauigkeit und dringen damit immer weiter in unseren Alltag vor. Dahinter stecken anspruchsvolle...

Künstliche Intelligenz
Machine Learning
Python
NLP
Data

7.3.2019 | 10 Minuten Lesezeit

Thomas Timmermann

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Lessons learned: Was wir in einem Jahr ML Orchestrierung mit Dagster gelernt...

In einem gemeinsamen Projekt haben Tom Scholz und ich Machine Learning (ML) Services gebaut, um einem Kunden bei der Analyse von Dokumenten zu helfen. Eine Proof-Of-Concept Lösung war schnell gebaut, die es nun zu operationalisieren gilt. Hierbei war...

Machine Learning
Python
Data
Data Science

12.9.2024 | 27 Minuten Lesezeit

Patrick Soschinski

Tom Scholz

When Business Meets Technology: Vom Datenprodukt zur Datenarchitektur ...

Zusammenfassung Der Data Product Canvas (DPC) ist ein Werkzeug für die leichtgewichtige und iterative Konzeption von Datenprodukten. Dabei steigert er die Effizienz der Produktdefinition, indem er die wesentlichen Einflussbereiche auf Datenprodukte übersichtlich...

Softwarearchitektur
Data
DDD
Digitale Produktentwicklung

6.8.2024 | 21 Minuten Lesezeit

Daniel Engelhardt

Dr. Florian Rademacher

Charge your APIs Volume 28: Verbesserung von Anwendungs- und Datenintegration...

In der heutigen schnelllebigen Welt ist die nahtlose Integration von Anwendungen und Daten entscheidend für den Erfolg eines Unternehmens. In diesem Blogpost werden Konzepte wie die Maslowsche Pyramide, Team Topologies, evolutionäre Architekturen, API...

API
Data
Integration

25.7.2024 | 9 Minuten Lesezeit

Daniel Kocot

Mit Applied Data Products zum datengetriebenen Unternehmen

In den letzten Jahren ist der Hype um den Wert von Daten kontinuierlich gestiegen. Gleichzeitig sind eine Vielzahl von Konzepten und Methoden aufgekommen, wie man als Unternehmen "datengetrieben" werden kann. Vom strategischen Top-Management bis zum ...

Agilität
Big Data
Data
Produktmanagement
Digitalisierung
Data Science
Business Intelligence

18.5.2024 | 8 Minuten Lesezeit

Dr. Florian Rademacher

Stephan Hochhaus

Green Cloud: Daten und Emissionen sparen

Das Internet produziert jährlich 900 Millionen Tonnen CO₂ – das ist deutlich mehr als Deutschland insgesamt emittiert. Hauptverantwortlich ist der immer weiter steigende Stromverbrauch beim Transport und der Speicherung von Daten. Wenn ihr kurz darüber...

Cloud
Green IT
Softwarearchitektur
Data

11.3.2024 | 5 Minuten Lesezeit

Dennis

Charge your APIs Volume 23: REST vs. gRPC

APIs dienen als Verbindungsstück zwischen Daten und Verarbeitung und erlauben uns damit, Daten im richtigen Kontext als Informationen zu interpretieren. Passende fachliche Themen sind dabei präsenter denn je und erreichen bald auch den Endverbraucher...

Java
Softwareentwicklung
Spring
Softwarearchitektur
API
Data

11.2.2024 | 7 Minuten Lesezeit

Sebastian Tiemann

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Im Bereich des maschinellen Lernens wurde eine lange Zeit angenommen, dass die Eingabedaten von Modellen und Gewichten sicher sei und nicht extrahiert werden könnten. In den letzten Jahren veröffentlichte Forschung hat diese Annahme in Frage gestellt...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 8 Minuten Lesezeit

Ihsan Kisi

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Mithilfe von Daten können Unternehmen fundiertere Entscheidungen treffen, ihre Arbeitsabläufe optimieren und mit der Kraft des maschinellen Lernens (ML) einen Vorteil in der wettbewerbsintensiven Geschäftswelt erlangen. Allerdings ist der Umgang mit ...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 7 Minuten Lesezeit

Ihsan Kisi

Datenanalyse auf die schnelle Art – mit Amazon Athena und GitLab

Wenn wir Erkenntnisse aus großen Datenmengen gewinnen wollen, bieten uns Cloud Service Provider inzwischen Lösungen an, dank derer wir uns kein Data Warehouse oder Hadoop-Cluster mehr in den Keller stellen müssen. AWS hat mit Athena, RedShift und EMR...

Cloud
Big Data
AWS
Serverless
GitLab

21.3.2023 | 16 Minuten Lesezeit

Maik Fleuter

Bessere SQL-Datenpipelines mit dbt

SQL ist weiterhin aus der Datenanalyse nicht wegzudenken – es ist vergleichsweise einfach zu lernen und Anwender können es ohne zusätzliche Werkzeuge auf einer Datenbank ausführen. Entsprechend ist es bei vielen Datenanalysten und Engineers beliebt. ...

Data

22.2.2023 | 2 Minuten Lesezeit

Matthias Niehoff

ChatGPT im Alltag eines Python-Entwicklers

Seit einigen Tagen spiele ich mit ChatGPT herum. Beruflich und privat konnte ich damit einige Fragen bearbeiten, bspw. welche Alternativen es zu bestimmten Tools gibt, was Vorteile von Teilzeit für den Arbeitgeber sind oder wer ich bin. Leider weiß ChatGPT...

NLP
Python
Künstliche Intelligenz

27.1.2023 | 7 Minuten Lesezeit

Robert Meißner

Manches gehört zusammen, manches besser nicht - Konnaszenz in Python

Wir alle kennen es. Wir bekommen neuen Code und irgendwie macht der merkwürdige Sachen. Teilweise müssen wir Reverse Engineering betreiben. Wir wundern uns, warum eine Umgebungsvariable nicht korrekt gesetzt wird oder der Login schief geht. Bis wir merken...

Python
Softwareentwicklung
Softwarearchitektur

30.11.2022 | 7 Minuten Lesezeit

Robert Meißner

Mit wenigen Zeilen Code Titel und Vorschaubild generieren

Ich bin ein fauler Mensch. Und ich schreibe viel, u. a. beruflich und privat in Blogs, auf Twitter und auf Wissenschaftsseiten. Das Schreiben per se ist schön. Aber wenn ich mir Titel überlegen muss oder gar Schlagwörter, dann ist der Spaß vorbei. Noch...

11.10.2022 | 7 Minuten Lesezeit

Robert Meißner

Streaming Wikipedia mit Apache Kafka

Apache Kafka ist in aller Munde und entwickelt sich im Kontext von verteilten Systemen zum De-facto-Standard als Plattform für Event Streaming. Im Rahmen unserer OffProject Time (Weiterbildungszeit) haben wir uns die Plattform auch näher angeschaut und...

Kotlin
Data
Java
Messaging
Spring

15.8.2022 | 10 Minuten Lesezeit

Christoph Metzger

Felix Rieß

„Strawberry JSON Fields Forever“: Filtern nach JSON-Feldern mit GraphQL...

Schon die Beatles besangen ein uraltes Problem in ihrem Song „Strawberry JSON Fields Forever“ : Wie lässt sich mit der GraphQL Library Strawberry für Python nach Werten in JSON-Feldern einer PostgreSQL-Datenbank filtern?SetupUm das zu zeigen, braucht...

Frontend
API
Python

26.6.2022 | 4 Minuten Lesezeit

Michael Eichenseer

Einführung in die Welt der Tourenoptimierung – Echte Routen und realistischere...

In diesem Artikel möchte ich euch mit einem Python Jupyter Notebook zeigen, wie ihr Anwendungsfälle der Tourenoptimierung inklusive Nebenbedingungen lösen und visualisieren könnt. Außerdem zeige ich euch, wie ihr mit OpenStreetMaps die Route zwischen...

Data

21.6.2022 | 7 Minuten Lesezeit

Lukas Heidemann

Einführung in die Welt der Tourenoptimierung – Visualisierung und Lösungsverfahren...

In diesem Artikel möchte ich euch zeigen, wie ihr Probleme der Tourenoptimierung in einem Python Jupyter Notebook lösen und visualisieren könnt. Am Beispiel eines Fahrradkurierdienst zeige ich außerdem, wie das Grundproblem um gängige Nebenbedingungen...

Data

16.6.2022 | 9 Minuten Lesezeit

Lukas Heidemann

Einführung in die Welt der Tourenoptimierung (1/3)

In vielen Unternehmen fallen täglich verschiedene Transportprozesse an. Klassische Beispiele sind die Optimierung von Warenein- und ausgängen, die Einsatzplanung von Servicetechnikern oder die optimale Reihenfolge der Auslieferung bei Lieferdiensten....

Data

12.6.2022 | 8 Minuten Lesezeit

Lukas Heidemann

Auslesen von deutschen Empfängeradressen mit Kofax Transformation Modules...

Das Auslesen von Adress-/Anschriftbereichen in Briefen war schon immer eine recht schwierige Problematik. Die Freude war umso größer, als Kofax vor einigen KTM-Versionen (Kofax Transformation Modules ) ein Werkzeug (Adress-Lokator) für das automatisierte...

NLP
Archivierung

7.3.2022 | 6 Minuten Lesezeit

Jürgen Voss

Machine-Learning-Modelle bewerten – Quality Gates etablieren

Die Qualität bzw. Nützlichkeit von Machine-Learning-Modellen lässt sich mit Hilfe von Testdaten und Metriken bewerten. Allerdings in welchem Umfang? Manuell, automatisiert, einmalig, regelmäßig? Manuell lassen sich die ersten Modelle als Ergebnis eines...

Data
Machine Learning
Softwareentwicklung
CI/CD

7.12.2021 | 7 Minuten Lesezeit

Berthold Schulte

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Move n-gram extraction into your Keras model!