This post shows how to extract information from text documents with the high-level deep learning library Keras : we build, train and evaluate a bidirectional LSTM model by hand for a custom named entity recognition (NER) task on legal texts.
In a previous post , we solved the same NER task on the command line with the NLP library spaCy . The present approach requires some work and knowledge, but yields a much more flexible solution which we can tune, scale and modify to our needs.
The NER dataset and task
We use the dataset presented by E. Leitner, G. Rehm and J. Moreno-Schneider in
again. It consists of decisions from several German federal courts with annotations of named entities referring to legal norms, court decisions, legal literature and others of the following form:
The task will be to build, train and evaluate a model that, given sample sentences, annotates each token of each sentence with a tag that indicates whether this token is part of a reference to a legal norm, court decision, legal literature and so on.
NER with bi-LSTM for dummies
We implement a standard deep-learning architecture for NER — a bi-directional recurrent neural network — which works as follows:
- Each sentence is split into a sequence of token and each token is represented by a word vector. These word vectors or embeddings are usually pre-trained on a huge corpus of documents so that they encode semantic information. We thus employ general language proficiency to our special task, a technique known as transfer learning . Common methods for pre-training are word2vec , gloVe or fasttext ; we use the word vectors provided by spaCy .
- The model processes the input sequence step by step and maintains an internal memory along the way,
- reading the corresponding input vector,
- combining this input with the internal memory,
- producing an output vector and
- updating the internal memory
at each step. This magic is carried out by a long-short-term memory (LSTM) cell . As a result, we obtain an output sequence ot the same length as the input sequence, and an internal memory state.
- Going backwards, the model reads the input again and produces a second output sequence.
- At each position, the outputs of steps 2 and 3 are combined and fed into a classifier which outputs, for the input word at this position, the probability that should be annotated with the first tag, second tag, and so on.
To improve performance, one can replace the last feed-forward layer by a conditional random field model (CRF) . The resulting architecture is called bi-LSTM-CRF model.
Setting up the environment
First, set up a virtual environment as described in the preceding blog post , and install the required dependencies:
1mkdir keras_ner_project
2cd keras_ner_project
3python3 -m venv .venv
4source .venv/bin/activate
5pip install spacy
6python -m spacy download de_core_news_md
7pip install tensorflow
Alternatively, follow along with Jupyter running inside a TensorFlow Docker container , or with a google colab notebook .
Next, download the data as in the preceding blog post (in case you are inside a Jupyter notebook, put an exclamation mark !
in front of each command to have it executed by the shell):
1mkdir -p data/01_raw
2curl https://github.com/elenanereiss/Legal-Entity-Recognition/raw/master/data/dataset_courts.zip \
3 -L -o data/01_raw/raw.zip
4!unzip data/01_raw/raw.zip -d data/01_raw
Step 1: Preprocessing for NER
The data files contain sample sentences separated by blank lines, with one token and annotation in BIO format per line as follows:
1an O
2Kapitalgesellschaften O
3( O
4§ B-GS
517 I-GS
6Abs. I-GS
71 I-GS
8und I-GS
92 I-GS
10EStG I-GS
11) O
We read such a data file line-by-line and store the sentences as lists of token-tag pairs:
1def load_data(filename: str):
2 with open(filename, 'r') as file:
3 lines = [line[:-1].split() for line in file]
4 samples, start = [], 0
5 for end, parts in enumerate(lines):
6 if not parts:
7 sample = [(token, tag.split('-')[-1]) for token, tag in lines[start:end]]
8 samples.append(sample)
9 start = end + 1
10 if start < end:
11 samples.append(lines[start:end])
12 return samples
13
14train_samples = load_data('data/01_raw/bag.conll')
15val_samples = load_data('data/01_raw/bgh.conll')
16all_samples = train_samples + val_samples
For simplicity, we’ll truncate the sentences to a maximum length and pad shorter input sequences. But first, let us determine the set of all tags in the data and add an extra tag for the padding:
1schema = ['_'] + sorted({tag for sentence in samples for _, tag in sentence})
Next, we represent each token by a word vector, using a pre-trained German language model of the NLP library spaCy :
1import spacy
2import numpy as np
3
4nlp = spacy.load('de_core_news_md')
5EMB_DIM = nlp.vocab.vectors_length
6MAX_LEN = 50
7
8def preprocess(samples):
9 tag_index = {tag: index for index, tag in enumerate(schema)}
10 X = np.zeros((len(samples), MAX_LEN, EMB_DIM), dtype=np.float32)
11 y = np.zeros((len(samples), MAX_LEN), dtype=np.uint8)
12 vocab = nlp.vocab
13 for i, sentence in enumerate(samples):
14 for j, (token, tag) in enumerate(sentence[:MAX_LEN]):
15 X[i, j] = vocab.get_vector(token)
16 y[i,j] = tag_index[tag]
17 return X, y
18
19X_train, y_train = preprocess(train_samples)
20X_val, y_val = preprocess(val_samples)
Now, we got the data ready for NER and can assemble our model!
Step 2: Build the bi-LSTM model
With the wide range of layers offered by Keras , we can can construct a bi-directional LSTM model as a sequence of two compound layers:
- The bidirectional LSTM layer encapsulates a forward- and a backward-pass of an LSTM layer, followed by the stacking of the sequences returned by both passes.
- The second layer applies a dense classification layer to every position of the stacked sequences. Here, the SoftMax
activation function scales the output so that we obtain sequences of probability distributions:
1from tensorflow.keras.models import Sequential
2from tensorflow.keras.layers import Bidirectional, LSTM, TimeDistributed, Dense
3
4def build_model(nr_filters=256):
5 input_shape = (MAX_LEN, EMB_DIM)
6 lstm = LSTM(NR_FILTERS, return_sequences=True)
7 bi_lstm = Bidirectional(lstm, input_shape=input_shape)
8 tag_classifier = Dense(len(schema), activation='softmax')
9 sequence_labeller = TimeDistributed(tag_classifier)
10 return Sequential([bi_lstm, sequence_labeller])
11
12model = build_model()
For more complex architectures involving multiple inputs or outputs, residual connections or the like, Keras offers a more flexible functional API . With this, we can create directed acyclic graphs of tensors connected by applications of layers, and specify a model in terms of its input and output tensors.
Step 3: Train the model
To train a model means to optimize its weights or parameters on data so that the model’s predictions approximate the truth. For Keras to perform this optimization, we need to specify
- how to measure the distance of the prediction to the truth, that is, a loss function ,
- the optimization strategy which is a variant of batch-wise gradient descent .
Additionally, we can specify a metrics to monitor the training progress. Once this has been done using the compile
method , we can call the fit
method for training:
1def train(model, epochs=10, batch_size=32):
2 model.compile(optimizer='Adam',
3 loss='sparse_categorical_crossentropy',
4 metrics='accuracy')
5 history = model.fit(X_train, y_train,
6 validation_split=0.2,
7 epochs=epochs,
8 batch_size=batch_size)
9 return history.history
10
11history = train(model)
Keras provides implementations of all the standard optimizers , loss functions and metrics , and also allows us to supply our own.
The training history contains the losses and metrics achieved on the training and validation data after each epoch. Here, I got the following result:
Note the scale on the y-axis, but don’t get excited by accuracies of 99%: almost all token are labelled by the trivial tag O
and hence accuracy does not tell much about detection of the non-trivial tags.
Step 4: Evaluate the model
To assess the performance of the model, we apply it to the preprocessed validation data and obtain a tensor of the shape (len(val_samples), MAX_LEN, len(schema))
. This tensor contains, for each sample sentence and each token in this sentence, a predicted probability distribution over the tags. We choose the tag with highest probability and return, for each sentence and each token, the true and the predicted tag:
1def predict(model):
2 y_probs = model.predict(X_val)
3 y_pred = np.argmax(y_probs, axis=-1)
4 return [
5 [(token, tag, schema[index]) for (token, tag), index in zip(sentence, tag_pred)]
6 for sentence, tag_pred in zip(val_samples, y_pred)
7 ]
8
9predictions = predict(model)
Finally, we compute precision, recall and f1-score on the level of tag categories using scikit learn ’s classification_report
:
1import pandas as pd
2from sklearn.metrics import classification_report
3
4def evaluate(predictions):
5 y_t = [pos[1] for sentence in predictions for pos in sentence]
6 y_p = [pos[2] for sentence in predictions for pos in sentence]
7 report = classification_report(y_t, y_p, output_dict=True)
8 return pd.DataFrame.from_dict(report).transpose().reset_index()
9
10evaluate(predictions)
Training a model with 1024 filters for 10 epochs, we reach the following scores:
tag | f1-score | precision | recall | support |
---|---|---|---|---|
EUN | 56.9 | 67.0 | 49.5 | 398 |
GRT | 65.9 | 91.0 | 51.6 | 643 |
GS | 94.5 | 96.1 | 92.9 | 6774 |
INN | 41.3 | 88.9 | 26.9 | 119 |
LD | 74.0 | 67.0 | 82.6 | 86 |
LDS | 0.0 | 0.0 | 0.0 | 9 |
LIT | 79.5 | 74.3 | 85.4 | 1681 |
MRK | 0.0 | 0.0 | 0.0 | 49 |
ORG | 25.3 | 32.4 | 20.8 | 159 |
PER | 0.0 | 0.0 | 0.0 | 473 |
RR | 92.0 | 94.4 | 89.8 | 560 |
RS | 90.7 | 97.1 | 85.0 | 8380 |
ST | 71.9 | 93.9 | 58.2 | 79 |
STR | 0.0 | 0.0 | 0.0 | 35 |
UN | 32.7 | 64.9 | 21.8 | 110 |
VO | 2.2 | 4.0 | 1.5 | 66 |
VS | 0.0 | 0.0 | 0.0 | 10 |
VT | 18.0 | 11.7 | 38.9 | 144 |
Let’s see how this compares to the results achieved with spaCy :
It seems that our hand-built NER model does very well! But beware that these experiments do not show a winner: neither of the two approaches has been optimized and we did not compare training time nor compute resources used. The main differentiating factor is that
- spaCy can be used out-of-the-box with no understanding of deep learning
- the approach presented here is much more flexible and tuneable (see below).
What next?
With the deep learning library Keras , build and training our custom NER model took just a few lines, but setting up the data and the training required much more understanding than the command-line approach with spaCy .
To improve performance, we could try to tune the model and
- increase the number of filters, that is, the size of the LSTM cell,
- stack several bidirectional layers on top of each other,
- replace the time-distributed classification layer with a conditional random field (CRF) model or
- address the imbalance of the tag distribution with a focal loss instead of categorical cross-entropy.
But to achieve a significant boost, we need to provide our model with more input by
- labeling more task-specific training data or
- applying more of task-independent language proficiency to our task.
In a next blog post, we shall fine-tune a pre-trained NLP transformer model to our NER task and get state-of-the-art performance.
Stay tuned!
More articles
fromThomas Timmermann
Your job at codecentric?
Jobs
Agile Developer und Consultant (w/d/m)
Alle Standorte
More articles in this subject area
Discover exciting further topics and let the codecentric world inspire you.
Gemeinsam bessere Projekte umsetzen.
Wir helfen deinem Unternehmen.
Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.
Hilf uns, noch besser zu werden.
Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.
Blog author
Thomas Timmermann
Data Scientist
Do you still have questions? Just send me a message.
Do you still have questions? Just send me a message.