How do you solve deep learning problems with too little labelled data? The answer, of course, is transfer learning. In this post, we will apply this concept to named entity recognition (NER) and
- fine-tune a pre-trained BERT to extract information from legal texts,
- encounter a token misalignment problem due to BERT’s preference for sub-word token, and
- observe tremendous improvements on difficult classes compared to the hand-made bi-lstm model of our previous posts.
Let’s get started!
The NER dataset and task
We use the dataset presented by E. Leitner, G. Rehm and J. Moreno-Schneider in
again. It consists of German court decisions with annotations of entities referring to legal norms, court decisions, legal literature and so on of the following form:
The task for our model will be to annotate, given a sample sentence, each word with a tag that indicates whether this word is part of a reference to legal norm, court decisions and so on. For more details, see the first post of this series.
The transformer revolution
In case you haven’t read about transformers, here’s a summary. For details on the the original transformer architecture, see the original paper or one of the many blog posts on the topic.
Transformers transformed natural language processing (NLP) with
- a revolutionary attention mechanism that replaces convolutional or recurrent architectures,
- a shift in transfer learning from pre-training (word vectors) for feature extraction to training generic language models plus fine-tuning on downstream tasks, and
- an exponential growth of model size that brought us performance on par with humans on a number of NLP tasks but also exploding resource consumption with diminishing returns:
To leverage transformers for our custom NER task, we’ll use the Python library huggingface transformers which provides
- a model repository including BERT, GPT-2 and others, pre-trained in a variety of languages,
- wrappers for downstream tasks like classification, named entity recognition, summarization, et cetera and
- convenient ways to fine-tunining on downstream tasks , e.g. in end-to-end pipelines or via TensorFlow or PyTorch .
Get your keyboard ready or follow along just reading!
Setting up the environment
Set up a virtual environment, install the required dependencies and download the dataset similarly as in the preceding blog posts :
1mkdir transformers_ner_project && cd transformers_ner_project 2python3 -m venv .venv && source .venv/bin/activate 3pip install numpy pandas tqdm sklearn transformers[tf-cpu] 4mkdir -p data/01_raw 5curl https://github.com/elenanereiss/Legal-Entity-Recognition/raw/master/data/dataset_courts.zip 6 -L -o data/01_raw/raw.zip 7unzip data/01_raw/raw.zip -d data/01_raw
Alternatively, follow along with Jupyter running inside a TensorFlow Docker container , or with a Google Colab notebook .
Step 1: Loading a pre-trained BERT
With huggingface transformers , it’s super-easy to get a state-of-the-art pre-trained transformer model nicely packaged for our NER task: we choose a pre-trained German BERT model from the model repository and request a wrapped variant with an additional token classification layer for NER with just a few lines:
1from transformers import AutoConfig, TFAutoModelForTokenClassification 2 3MODEL_NAME = 'bert-base-german-cased' 4 5config = AutoConfig.from_pretrained(MODEL_NAME, num_labels=len(schema)) 6model = TFAutoModelForTokenClassification.from_pretrained(MODEL_NAME, 7 config=config) 8model.summary()
The result is a TensorFlow model consisting of the pre-trained BERT transformer, followed by a drop-out and a dense classifier layer which predicts the tag of each token:
Model: "tf_bert_for_token_classification"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
bert (TFBERTMainLayer) multiple 109081344
_________________________________________________________________
dropout_37 (Dropout) multiple 0
_________________________________________________________________
classifier (Dense) multiple 16149
=================================================================
Total params: 109,097,493
Trainable params: 109,097,493
Non-trainable params: 0
_________________________________________________________________
Step 2: Preprocessing
The data files contain sample sentences separated by blank lines, with one token and annotation in BIO format per line:
1an O 2Kapitalgesellschaften O 3( O 4§ B-GS 517 I-GS 6Abs. I-GS 71 I-GS 8und I-GS 92 I-GS 10EStG I-GS 11) O
We read two data files line-by-line, store the sentences as lists of token-tag pairs, and determine the annotation schema just like we did it for training our bi-LSTM model :
1def load_data(filename: str):
2 with open(filename, 'r') as file:
3 lines = [line[:-1].split() for line in file]
4 samples, start = [], 0
5 for end, parts in enumerate(lines):
6 if not parts:
7 sample = [(token, tag.split('-')[-1])
8 for token, tag in lines[start:end]]
9 samples.append(sample)
10 start = end + 1
11 if start < end:
12 samples.append(lines[start:end])
13 return samples
14
15train_samples = load_data('data/01_raw/bag.conll')
16val_samples = load_data('data/01_raw/bgh.conll')
17samples = train_samples + val_samples
18schema = ['_'] + sorted({tag for sentence in samples
19 for _, tag in sentence})
Gotcha! Sub-word tokenization?
But how do we feed the data into our transformer? The answer depends on the model that we chose because it has been pre-trained with a custom sub-word tokenizer. This tokenizer splits an input sentence into a sequence of subword tokens instead of words, using an algorithm like byte-pair encoding or unigram language models . Let’s get hold of the tokenizer that was used to pre-train our model,
1from transformers import AutoTokenizer 2tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
and apply it to some samples. The results are dictionaries where we’re mainly interested in the component input_ids
:
sample | tokenizer(sample)['input_ids'] |
---|---|
'Das ist' | [3, 295, 127, 4] |
'eine Frage' | [3, 155, 1685, 4] |
'eine hochinteressante Frage' | [3, 155, 2426, 21477, 5004, 1685, 4] |
What do we see?
- The tokenizer marks the beginning and the end of a sample with a
3
and4
, respectively. - Common words like
'Das'
,'ist'
,'eine'
,'Frage'
are treated as single tokens. - Less frequent words like
'hochinteressante'
are split up into a sequence of sub-word token.
So we need to
- apply the sub-word tokenizer to every word in our input samples, and
- whenever it does split up a word, tag each sub-word like the entire word.
This can be done as follows:
1import numpy as np
2import tqdm
3
4def tokenize_sample(sample):
5 seq = [
6 (subtoken, tag)
7 for token, tag in sample
8 for subtoken in tokenizer(token)['input_ids'][1:-1]
9 ]
10 return [(3, 'O')] + seq + [(4, 'O')]
11
12def preprocess(samples):
13 tag_index = {tag: i for i, tag in enumerate(schema)}
14 tokenized_samples = list(tqdm(map(tokenize_sample, samples)))
15 max_len = max(map(len, tokenized_samples))
16 X = np.zeros((len(samples), max_len), dtype=np.int32)
17 y = np.zeros((len(samples), max_len), dtype=np.int32)
18 for i, sentence in enumerate(tokenized_samples):
19 for j, (subtoken_id, tag) in enumerate(sentence):
20 X[i, j] = subtoken_id
21 y[i,j] = tag_index[tag]
22 return X, y
23
24X_train, y_train = preprocess(train_samples)
25X_val, y_val = preprocess(val_samples)
Step 3: Fine-tuning BERT on our custom NER task
Training the model is now more or less the same as in the preceding post with our bi-LSTM model:
1import pandas as pd 2 3NR_EPOCHS=10 4BATCH_SIZE=16 5 6optimizer = tf.keras.optimizers.Adam(lr=0.00001) 7loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) 8model.compile(optimizer=optimizer, loss=loss, metrics='accuracy') 9history = model.fit(tf.constant(X_train), tf.constant(y_train), 10 validation_split=0.2, epochs=EPOCHS, 11 batch_size=BATCH_SIZE)
Well, except that now the model has some more parameters and training for just one epoch might take … some hours, depending on your hardware. Here’s the validation accuracy (note the lower bound):
Note the domain of the accuracy and that the x-axis measures the training time in seconds.
Step 4: Evaluation — gotcha again!
Now that we have trained our custom-NER-BERT, we want to apply it and … face another problem: the model predicts tag annotations on the sub-word level, not on the word level. To obtain word-level annotations, we need to aggregate the sub-word level predictions for each word. Two obvious solutions come to mind:
- for each sub-word, choose the tag with highest probability, and then use a majority vote, or
- average the predicted probabilities over all sub-words of a word, and then take the tag with highest average probability.
Given predictions pred
for a sequence seq
of sub-words of shape (len(seq), len(schema))
, this would amount to taking the tag indexed by
scipy.stats.mode(np.argmax(pred, axis=-1))
, using the package SciPy , ornp.argmax(np.mean(pred, axis=0))
,
respectively, or, in the picture below, to go 1. first right, then down or 2. first down, then right:
We choose variant 2 and apply it to the model’s predictions as follows:
1def aggregate(sample, predictions):
2 results = []
3 i = 1
4 for token, y_true in sample:
5 nr_subtoken = len(tokenizer(token)['input_ids']) - 2
6 pred = predictions[i:i+nr_subtoken]
7 i += nr_subtoken
8 y_pred = schema[np.argmax(np.sum(pred, axis=0))]
9 results.append((token, y_true, y_pred))
10 return results
11
12y_probs = model.predict(X_val)[0]
13predictions = [aggregate(sample, predictions)
14 for sample, predictions in zip(val_samples, y_probs)]
Finally, we can evaluate the predictions on the level of tokens as a multi-class classification problem using scikit-learn again as in the preceding blog post . Here is the scatterplot of the resulting f1-Scores versus the support for each tag class:
Conclusion
Let’s see how our new results compare to those of the previous post, and note that I’ve let BERT train 50 times as long as the bi-LSTM:
We see that BERT significantly outperforms the bi-LSTM on difficult classes in our task. Is this only because of the more powerful network architecture and more training time? No! The scatterplot above shows a significant correlation between the f1-score and the supply of training data, and points us to the key advantage of the present approach:
- Before (bi-LSTM), we used it in the form of pre-trained word embeddings.
- Now (BERT), we start from a fully trained language model that embodies much more knowledge.
The upshot is:
The fewer data we have, the more important transfer learning becomes.
More articles
fromThomas Timmermann
Your job at codecentric?
Jobs
Agile Developer und Consultant (w/d/m)
Alle Standorte
More articles in this subject area
Discover exciting further topics and let the codecentric world inspire you.
Gemeinsam bessere Projekte umsetzen.
Wir helfen deinem Unternehmen.
Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.
Hilf uns, noch besser zu werden.
Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.
Blog author
Thomas Timmermann
Data Scientist
Do you still have questions? Just send me a message.
Do you still have questions? Just send me a message.