Topic Modeling of the codecentric Blog Articles

3.1.2017 | 15 minutes reading time

The major part of big data is unstructured data. When an organization wants to leverage its data or external information from social media with the goal to make better business decisions, a challenge is to retrieve important information from unstructured text documents written in natural language. The main goal of techniques from natural language processing (NLP) is to turn text into structured data that can be used for further analysis.

A particular example of NLP are probabilistic topic models that seek to discover common topics in a collection of documents. Unsupervised machine learning algorithms have been developed to find such topics, which can be used for organizing and managing the collection of documents. Topic models allow to address interesting data science related questions concerning, for instance, recommendations: “What articles are most relevant for a certain topic?”, and clustering: “What are newly published articles discussing and how similar are two articles?”. The derived topics can also be viewed as a dimensionality reduction and can be used as features for subsequent machine learning tasks (feature engineering).

In this article, we present results from a topic modeling in the codecentric blog. The topics are used to analyze the blog content and how it changes over time. Of course one could argue that authors usually assign their blog posts to a category and might use additional tags that give hints about its content. When no such labels are available in a very large collection of documents or if one wants to obtain a more objective clustering, topic modeling is an appropriate tool.

We perform the analysis using Apache Spark with its Python API in a Jupyter Notebook, which you may download here . Spark allows us to build a scalable machine learning (ML) pipeline containing latent Dirichlet allocation (LDA ) topic modeling from its machine learning library (MLlib ). A small Spark cluster can be easily set up, as described in this post . Another advantage in using Spark is that the developed prototypes of a data product can be easily translated to a production environment.

This post is organized in five sections:

LDA Topic Model
Data Preprocessing
Model Training and Evaluation
Results
Summary and Conclusion

The first three rather technical sections describe some theoretical concepts of LDA topic modeling as well as the implementation of data preprocessing and model training. Some readers might want to directly jump to the Results section.

LDA Topic Model
In natural language processing, a probabilistic topic model describes the semantic structure of a collection of documents, the so-called corpus. Latent Dirichlet allocation (LDA ) is one of the most popular and successful models to discover common topics as a hidden structure of the collection of documents. According to the LDA model, text documents are represented by mixtures of topics. This means that a document concerns one or multiple topics in different proportions. A topic can be viewed as a cluster of similar words. More formally, the model assumes each topic to be characterized by a distribution over a fixed vocabulary, and each text document to be generated by a distribution of topics.
The basic assumption of LDA is that the documents have been generated in a two-step random process rather than having been written by a human. The generative process for a document consisting of N words is as follows. The most important model parameter is the number of topics k that has to be chosen in advance. In the first step, the mixture of topics is generated according to a Dirichlet distribution of k topics. Second, from the previously determined topic distribution, a topic is randomly chosen, which then generates a word from its distribution over the vocabulary. The second step is repeated for the N words of the document. Note that LDA is a bag-of-words model and the order of words appearing in the text as well as the order of the documents in the collection is neglected.
When starting with a collection of documents and considering the reverse direction of the generative process, LDA topic modeling is the method to infer what topics might have generated the collection of documents. Further details about LDA can be found in the original paper by Blei et al. or in this nice review about probabilistic topic models. Probabilistic topic models are a suite of algorithms that have been developed to estimate the distribution of topics from a corpus of text documents because there is no exact solution for these distributions.

Data Preprocessing

We follow a typical workflow of data preparation for natural language processing (NLP). Textual data is transformed into numerical feature vectors required as input for the LDA machine learning algorithm. A similar approach is described in a recent blog post about spam detection .

A MySQL table of the blog posts is loaded into a Spark DataFrame using JDBC; an additional Spark submit argument contains the MySQL Connector jar file.

1# read from mysql table, only use published posts sorted by date
2df_posts = ((spark.read.format("jdbc")
3 .option("url", "jdbc:mysql://localhost/ccblog")
4 .option("driver", "com.mysql.jdbc.Driver")
5 .option("dbtable", "wp_2_posts")
6 .option("user", "*****")
7 .option("password", "**********")
8 .load()
9 ).filter("post_type == 'post'").filter("post_status == 'publish'")
10 .sort("post_date"))

From the post content, we first have to extract the text that is decorated with various HTML tags. A beautiful Python library to achieve this is BeautifulSoup . An example raw text is shown in the notebook for the first entry of the post content. The textual data extracted from the HTML file is then normalized by removing numbers, punctuation and other special characters and using lowercase. A so-called tokenizer splits the sentences into words (tokens) that are separated by whitespace. These operations on the Spark DataFrame columns are performed via Spark’s user-defined functions (UDF).

1extractText = udf(
2 lambda d: BeautifulSoup(d, "lxml").get_text(strip=False), StringType())
3removePunct = udf(
4 lambda s: re.sub(r'[^a-zA-Z0-9]', r' ', s).strip().lower(), StringType())
5 
6# normalize the post content (remove html tags, punctuation and lower case..)
7df_posts_norm = df_posts.withColumn("text", removePunct(extractText(df_posts.post_content)))
8 
9# breaking text into words 
10tokenizer = RegexTokenizer(inputCol="text", outputCol="words", 
11                           gaps=True, pattern=r'\s+', minTokenLength=2)
12df_tokens = tokenizer.transform(df_posts_norm)

The RegexTokenizer is an example of a Spark transformer. Inspired by the concept of scikit-learn , transformers and estimators can be connected to a pipeline , i.e., a machine learning workflow comprising the various stages of preprocessing, feature generation, and model training and evaluation.

Language identification

We only want to analyze English blog posts and have to identify the language since no such tag is available in our data set. A simple classification between English and German as the primary language is achieved by comparing the fraction of stop words in the text. Stop words are the most common words of a given language such as “a”, “of”, “the”, “and” in English. Lists of stop words for different languages are provided by NLTK . The Fraction of English stop words in a given article is obtained by counting the number of English stop words that appear at least once in the text, divided by the total number of stop words in the list. Similarly, we calculate the fraction of German stop words and decide which language an article mainly uses by the larger of the two fractions.

1from nltk.corpus import stopwords
2englishSW = set(stopwords.words('english'))
3germanSW = set(stopwords.words('german'))
4 
5nEngSW = len(englishSW)
6nGerSW = len(germanSW)
7 
8RatioEng = udf(lambda l: len(set(l).intersection(englishSW)) / nEngSW)
9RatioGer = udf(lambda l: len(set(l).intersection(germanSW)) / nGerSW)
10 
11df_tokens_en = (df_tokens.withColumn("ratio_en", RatioEng(df_tokens['words']))
12                         .withColumn("ratio_ge", RatioGer(df_tokens['words']))
13                         .withColumn("Eng", col('ratio_en') > col('ratio_ge'))
14                         .filter('Eng'))

Filtering out stop words and stemming

The last preprocessing steps are filtering out the English stop words, as these common words presumably do not help in identifying meaningful topics, and stemming the words such that, for instance, “test”, “tests”, “tested”, and “testing” are all reduced to their word stem “test”. The list of stop words is expanded by moreStopWords, which we manually collect as follows. After having trained an LDA model, we inspect the topics and identify additional stop words, which are filtered out for the subsequent model training. This procedure is repeated, as long as stop words appear in the lists of top words.

1swRemover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol="filtered")
2swRemover.setStopWords(swRemover.getStopWords() + moreStopWords)
3 
4df_finalTokens = swRemover.transform(df_tokens_en)
5 
6# Stemming
7from nltk.stem.snowball import SnowballStemmer
8stemmer = SnowballStemmer("english", ignore_stopwords=False)
9udfStemmer = udf(lambda l: [stemmer.stem(s) for s in l], ArrayType(StringType()))
10 
11df_finalTokens = df_finalTokens.withColumn("filteredStemmed",
12                                           udfStemmer(df_finalTokens["filtered"]))

Feature generation

The feature vectors are then generated following a simple bag-of-words approach using Spark’s CountVectorizer. Each document is represented as a vector of counts, the length of which is given by the number of words in the vocabulary, which we set to 2500. The CountVectorizer is an estimator that generates a model from which the tokenized documents are transformed into count vectors. Words have to appear at least in two different documents and at least four times in a document to be taken into account.

1cv = CountVectorizer(inputCol="filteredStemmed", outputCol="features", vocabSize=2500, minDF=2, minTF=4)
2 
3cvModel = cv.fit(df_finalTokens)
4 
5countVectors = (cvModel
6                .transform(df_finalTokens)
7                .select("ID", "features").cache())
8 
9cvModel.save("path/to/model/file")

Model Training and Evaluation

The Spark implementation of LDA allows online variational inference as a method for learning the model. Data is incrementally processed in small batches, which allows scaling to very large data sets that might even arrive in a streaming fashion.

1df_training, df_testing = countVectors.randomSplit([0.9, 0.1], 1)
2 
3numTopics = 20 # number of topics
4 
5lda = LDA(k = numTopics, seed = 1, optimizer="online", optimizeDocConcentration=True,
6 maxIter = 50,           # number of iterations
7 learningDecay = 0.51,   # kappa, learning rate
8 learningOffset = 64.0,  # tau_0, larger values downweigh early iterations
9 subsamplingRate = 0.05, # mini batch fraction 
10 )
11 
12ldaModel = lda.fit(df_training)
13 
14lperplexity = ldaModel.logPerplexity(df_testing)
15 
16ldaModel.save(path)

In general, the data set is split into a training set and a testing set in order to evaluate the model performance via a measure such as the perplexity , i.e., a measure of how well the word counts of the test documents are represented by the topic’s word distributions. However, we find it more useful to evaluate the model manually by looking at the resulting topics and the corresponding distribution of words. A good result is obtained training a 20-topic LDA model on the entire corpus of the English codecentric blog articles. Using a more quantitive performance measure would allow a hyper-parameter tuning. A grid search for the optimal parameters such as the number of topics is facilitated by Spark’s pipeline concept. The ML models are saved for later usage.

Results
In the following we present results of a 20-topic model trained on the entire data set of English codecentric blog articles that were published until and including November 2016. A visualization of the distribution of words for the two top topics is given by the word clouds in Fig.1 and Fig.2. The size of the words correspond to their relative weights; words having a large weight are more often generated by this topic. With the top words and by inspection of some documents discussing a given topic, it is often possible to manually assign somewhat summarizing labels to the topics. The topics that correspond to the word clouds in Fig.1 and Fig.2 are labeled “Agility” and “Testing”, respectively. Note that some words are reduced to a non-valid word stem like “stori” or “softwar”.
Figure 1. Word cloud of the topic labeled “Agility“.
Figure 2. Word cloud of the topic labeled “Testing“.
Labeling of topics and identifying top documents

The twelve most meaningful topics of our 20-topic model are listed in Tab.1. These topics are selected by hand and meaningful is of course a quite subjective measure. We exclude for instance topics where two very different themes appear. For each topic, we suggest a label that summarizes what the topic is about and provide the top words in the order of their probability to be generated. In order to identify the top document for a given topic, we order the documents by their probability to discuss that topic. The top document is defined as the document having the largest contribution from the given topic compared to all other documents.

Table 1. The twelve top topics of a 20-topic model trained on all English codecentric blog posts.
topic	label	top words	top document
0	Testing	test, file, application, server, project	Testing JavaScript on various platforms with Karma and SauceLabs , Ben Ripkens
1	DevOps	build, plugin, run, imag, maven	How to enter a Docker container , Alexander Berresch
2	Memory Management	java, gc, time, jvm, memory	Useful JVM Flags – Part 2 (Flag Categories and JIT Compiler Diagnostics) , Patrick Peschlow
3	Data/Search	data, index, field, query, operator	Big Data – What to do with it? (Part 1 of 2) , Jan Malcomess
4	Reactive Systems	state, node, system, cluster, data	A Map of Akka , Heiko Seeberger
5	Math	method, latex, value, point, parameter	The Machinery behind Machine Learning – Part 1 , Stefan Kühn
6	Spring	spring, public, class, configure, batch	Boot your own infrastructure – Extending Spring Boot in five steps , Tobias Flohre
7	Frontend	module, type, grunt, html, import	Elm Friday: Imports (Part VIII) , Bastian Krol
8	Database	mongodb, document, id, db, name	Spring Batch and MongoDB , Tobias Trelle
9	Functional Programming	function, name, var, node, call	Functional JavaScript using Lo-Dash, an underscore.js alternative , Ben Ripkens
10	Agility	team, develop, time, agile, work	What Agile Software Development has in common with Sailing , Thomas Jaspers
11	Mobile App	app, notif, object, return, null	New features in iOS 10 Notifications , Martin Berger

Next we determine the number of documents having the same main topic. Remember that a document usually concerns several topics in different proportions. The main topic of a document is defined as the topic with the largest probability.

1getMainTopicIdx = udf(lambda l: int(numpy.argmax([float(x) for x in l])), IntegerType())
2 
3countTopDocs = (ldaModel
4                .transform(countVectors)
5                .select(getMainTopicIdx("topicDistribution").alias("idxMainTopic"))
6                .groupBy("idxMainTopic").count().sort("idxMainTopic"))

For each document in our data set we identify the topic index for which the probability is the largest, i.e., the main topic. Grouping by the topic index, counting, and sorting results in the counts of documents per topics plotted in Fig.3. The most discussed topics in the entire collection of blog articles are topic 0 – “Testing”, topic 6 – “Spring”, and topic 10 – “Agility”.

Figure 3. For each topic, we count the number of documents that discuss the topic with the largest probability (main topic). Only the 12 most meaningful topics of the 20-topic model are shown.

Evolution of blog content over time

How many blog articles were published on a specific topic during one year? This question is addressed in Fig.4 illustrating for the top topics, “Testing”, “Spring”, and “Agility”, the number of documents that discuss the topic with the largest probability as a function of time. At first glance, it appears that “Agility” became less important after a hype in 2009, as seen by the red line in Fig.4. However, another explanation would be that in later years, agile methodologies are not exclusively discussed as a main topic in a document but rather co-appear with other topics in smaller proportions. A growing number of article are dedicated to both the topics “Spring” and “Testing”, with some oscillations for the latter. What might also be interesting to look at is the number of documents that discuss a specified topic with a probability larger than some threshold value rather than considering only the largest probability, as in Fig.4. However, we do not go into detail here and only provide a glimpse on possible analyses.

Figure 4. Time evolution of the number of documents with the same main topic. Results are shown for the three top topics obtained from the LDA model trained on the entire data set.

Evolution of topics over time
Another interesting question is at what time topics appear or disappear and how the words representing a topic change over time. For the results in Fig.4 only a single LDA model was trained on the entire data set. The resulting topic distribution is fixed and does not change over time. In order to study the evolution of topics over time in a systematic way, machine learning researchers have developed dynamic topic models.
Here, we take a simpler approach investigating how the distribution of topics change over time. Several different LDA models are trained on the blog articles of a specific year including articles from all previous years. Thus, we obtain topic distributions for the collection of documents published during the years 2008-2010, 2008-2011, …, 2008-2016. We then try to identify the same topics, which might contain different words. In principle, this approach allows to predict next years’ topics given all the articles from the previous years. Without going into details, we present as an example in Tab.2 the top ten words for the topic “Agility” from different LDA models trained with data until and including consecutive years.
Table 2. Top ten words for the topic “Agility” from different LDA models trained on blog articles until and including the given year. The order of the words from top to bottom represent the probability to be generated.
2010 2011 2012 2013 2014 2015 2016
agil
scrum
team
project
develop
manag
stori
point
sprint
meet team
develop
project
scrum
agil
manag
time
product
softwar
test team
agil
develop
scrum
product
softwar
project
manag
continu
stage agil
team
role
session
product
develop
manag
peopl
plan
time develop
agil
team
scrum
work
time
session
product
peopl
softwar develop
work
agil
time
team
softwar
test
problem
code
point team
develop
time
agil
work
project
product
softwar
scrum
problem
As can be seen in Fig.5, the probability of top words to appear in a text about “Agility” changes over time. For instance, there is a slight decrease in the use of the words “agile” and “scrum” in the period from 2010 until 2016.
Figure 5. Time evolution of some words in the topic “Agility”. The weights of the words, shown as a function of time, correspond to the probability to appear in a document about agility.

Table 2. Top ten words for the topic “Agility” from different LDA models trained on blog articles until and including the given year. The order of the words from top to bottom represent the probability to be generated.
2010	2011	2012	2013	2014	2015	2016
agil scrum team project develop manag stori point sprint meet	team develop project scrum agil manag time product softwar test	team agil develop scrum product softwar project manag continu stage	agil team role session product develop manag peopl plan time	develop agil team scrum work time session product peopl softwar	develop work agil time team softwar test problem code point	team develop time agil work project product softwar scrum problem

The topic distribution of this article
In order to test the trained LDA topic model, we now predict the topics for the present article. We use the LDA model trained on the entire data set and predict the present article before writing this paragraph. As a result, we obtain the topic distribution depicted in Fig.6 as a pie chart. The two main topics with about 20 percent are “Functional Programming” and “Data/Search”, which is quite appropriate. All other topics having less than 5 percent probability are collected in the “Other” part.
Figure 6. The topic distribution for this article predicted by the LDA model trained on the entire dataset of all English codecentric blog articles.

Summary and Conclusion
In this article, we analyze the content of the codecentric blog by means of Spark’s implementation of LDA topic modeling. Data preprocessing steps necessary for NLP are described. Training a 20-topic model on all blog posts allows to identify a number of meaningful topics. Some exploratory investigations on the time evolution of the blog content and the topics are performed using different LDA models trained on articles until a specified year. We thereby obtain hints on how topics and words have changed over time. In the last part we successfully predict the topics of the present blog article.
In a follow-up post, it would be interesting to use German blog posts and see whether the topics depend on the language. It might be worth to compare LDA with e.g. non-negative matrix factorization and more elaborate (dynamic) topic models with different features such as tf-idf. Further insight into how topics tend to co-occur could be gained by modeling the connection between topics in a graph in order to study relations between different topics. As a concluding remark, note that topic modeling is not restricted to text documents but can also be applied to other unstructured data such as images or video clips, e.g., video behavior mining , where visual features are interpreted as words.

References

David M. Blei, Andrew Y. Ng, Michael I. Jordan. “Latent Dirichlet Allocation” Journal of Machine Learning Research 3. 993-1022. 2003.

Blei, David. “Probabilisitic Topic Models.” Communications of the ACM. 55.4: 77-84. 2012.

Hoffman, Matthew, Francis R. Bach, and David M. Blei. “Online learning for latent dirichlet allocation.” Advances in neural information processing systems. 2010.

David M. Blei and John D. Lafferty. “Dynamic topic models.” In Proceedings of the 23rd international conference on Machine learning. ACM, 113-120. 2006.

Was this post helpful?

Blog author

Matthias Radtke

Do you still have questions? Just send me a message.

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

A/B Testing: Tool support and testing GrowthBook

In the previous blog post we introduced some general concepts of A/B testing: we explored the main aspects, defined test types and explained the most common statistical methods. Now we want to explore the areas in which A/B testing tools can provide...

Testing
Python
Data
UX/UI
Analysis
JavaScript

18.3.2024 | 20 [Missing String "readingTime"]

Francesca Diana

An introduction to federated learning in an industrial context: Advanced

In the Machine Learning space, it was long believed that sharing learnings or weights was safe in the sense that the input data couldn't be extracted. However, this belief has been challenged by researchers coming out over the years. Nowadays, numerous...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 9 [Missing String "readingTime"]

An introduction to federated learning in an industrial context: Fundamentals

With the help of data, companies are able to make more informed decisions, optimize their workflows and gain an edge in the competitive world of business using the power of Machine Learning (ML). However, handling data has become increasingly difficult...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 8 [Missing String "readingTime"]

Python and CDK (Part 2): Taking control of Python dependencies in AWS ...

In Part 1 of this series, Developing AWS Lambda Functions with Python and CDK, we covered the initial setup of a CDK and Python project. We walked through the process of creating a basic Hello World* Lambda function, testing it with a unit test, defining...

AWS
Serverless
Python

2.6.2023 | 2 [Missing String "readingTime"]

Python and CDK (Part 1): Developing AWS Lambda functions with Python and...

This blog post assumes that you are familiar with Python development and know the basic concepts of Amazon CDK. What's more, you should have an AWS account and have configured the AWS CLI. If you're new to CDK, go here, if you need to configure the AWS...

AWS
Serverless
Python

6.3.2023 | 6 [Missing String "readingTime"]

Simple Fraud Detection with PyMC

In one of my last projects, we were facing a prediction problem with very limited data. Each set of data took a specialist hours to compile, and results were not always successful. Therefore, we were looking for a tool to handle these requirements, as...

Python
Data Science

26.1.2023 | 7 [Missing String "readingTime"]

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

In this article, we'll explore how to use the Poetry package manager to manage the dependencies of a machine learning project that makes use of the M1 GPU for TensorFlow training. We'll cover the motivation for using Poetry in this context, and we'll...

Machine Learning
Apple
Data
AI
Python

11.1.2023 | 3 [Missing String "readingTime"]

Denis Stalz-John

Let's build a modern CMD tool with Python using Typer and Rich

Let's build a modern CMD tool with Python using Typer and Rich I often have a need for a small CMD tool for my projects - e.g. to query an API or perform some operation. What do I want from the tool? Quick development cycle Nice output, e.g. with syntax...

API
Python

14.10.2022 | 12 [Missing String "readingTime"]

Python on an M1 chip: Running smoothly using Docker

I have been working as a data scientist at codecentric for several years now. Thus, my language of choice is Python and I am using it in several projects on a daily basis. Last year, I got pretty excited about the announcement of the new versions of ...

Data
Machine Learning
Apple
Python

14.2.2022 | 6 [Missing String "readingTime"]

Denis Stalz-John

Evaluating machine learning models: Establishing quality gates

The quality or usefulness of machine learning models can be evaluated using test data and metrics. However, to what extent? Manually, automated, once, regularly? Manually, the first models as the result of a proof of concept can certainly still be evaluated...

Data
Machine Learning
Software development
CI/CD

7.12.2021 | 8 [Missing String "readingTime"]

Berthold Schulte

How to use Java classes in Python

There is an old truism: “Use the right tool for the job.” However, in building software, we are often forced to nail in screws, just because the rest of the application was built with the figurative hammer Java. Of course, one of the preferred solutions...

AI
Java
Python

15.11.2021 | 8 [Missing String "readingTime"]

The universal recommender in Action(ML)

IntroductionRecommender systems have become crucial for many different businesses. E-commerce uses recommenders to guide their customers in finding the right products and to assure they stay on the site. Newspapers or entertainment websites want to keep...

AI
NoSQL
Data
Machine Learning
Python

18.4.2021 | 11 [Missing String "readingTime"]

Francesca Diana

NER with little data? Transformers to the rescue!

How do you solve deep learning problems with too little labelled data? The answer, of course, is transfer learning. In this post, we will apply this concept to named entity recognition (NER) andfine-tune a pre-trained BERT to extract information from...

Data
Machine Learning
AI
NLP
Agile transformation

14.12.2020 | 8 [Missing String "readingTime"]

Take control of named entity recognition with your own Keras model!

This post shows how to extract information from text documents with the high-level deep learning library Keras : we build, train and evaluate a bidirectional LSTM model by hand for a custom named entity recognition (NER) task on legal texts.In a previous...

Data
Python
AI
NLP
Machine Learning

13.11.2020 | 9 [Missing String "readingTime"]

IIoT product development: lessons from past projects

In this overview article on industrial IoT product development we will guide you along the essential questions and directions to consider. We will go with you along the relevant topics and preconditions, when you start to connect large numbers of small...

IIoT
IoT
Python

11.11.2020 | 10 [Missing String "readingTime"]

NER @ CLI: Custom-named entity recognition with spaCy in four lines

Named entity recognition is a technical term for a solution to a key automation problem: extraction of information from text. Applications includeautomation of business processes involving documentsdistillation of data from the web by scraping websitesindexing...

Data
AI
NLP
Machine Learning

6.11.2020 | 9 [Missing String "readingTime"]

DISH-O-TRON – Gather that DATA you must!

This is the second article in our dish-o-tron series (a non-standard Deep Learning tutorial) in which we tackle one of the biggest problems in community kitchens: coming across someone else’s dirty dishes. We are facing this problem by building a state...

AI
Computer Vision
Machine Learning

24.9.2020 | 11 [Missing String "readingTime"]

Marcel Mikl

Why user-oriented development is so important – the story of tactics.ai

In this blog post, we want to give you an insight into the product development of tactics.ai. Our initial idea was a data-driven football analysis tool that applies machine learning techniques to analyze the strengths and weaknesses of opponents and ...

Agile
AI
Startup
Machine Learning
Product management

23.8.2020 | 8 [Missing String "readingTime"]

Denis Stalz-John

Thinking AI means re-thinking data

While doing AI is sexy and cool, data infrastructure is typically not considered any of this. However, production-grade machine learning applications heavily rely on proper data infrastructure. Hence, in order to generate actual business value, solid...

AI
Big Data
Data
Machine Learning

27.5.2020 | 7 [Missing String "readingTime"]

Marcel Mikl

Kofax Transformation Modules: Natural Language Processing, sentiments ...

Kofax Transformation Modules (KTM) offers several tools for document classification and data extraction. There are some older blog articles about these tools:– Document classification – Data extraction with format locators – Machine Learning The ...

Content Management
AI
Archiving
NLP

6.4.2020 | 8 [Missing String "readingTime"]

Topic Modeling of the codecentric Blog Articles

Language identification

Filtering out stop words and stemming

Labeling of topics and identifying top documents

Evolution of blog content over time

References

Was this post helpful?

Blog author

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

A/B Testing: Tool support and testing GrowthBook

An introduction to federated learning in an industrial context: Advanced

An introduction to federated learning in an industrial context: Fundamentals

Python and CDK (Part 2): Taking control of Python dependencies in AWS ...

Python and CDK (Part 1): Developing AWS Lambda functions with Python and...

Simple Fraud Detection with PyMC

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

Let's build a modern CMD tool with Python using Typer and Rich

Python on an M1 chip: Running smoothly using Docker

Evaluating machine learning models: Establishing quality gates

How to use Java classes in Python

The universal recommender in Action(ML)

NER with little data? Transformers to the rescue!

Take control of named entity recognition with your own Keras model!

IIoT product development: lessons from past projects

NER @ CLI: Custom-named entity recognition with spaCy in four lines

DISH-O-TRON – Gather that DATA you must!

Why user-oriented development is so important – the story of tactics.ai

Thinking AI means re-thinking data

Kofax Transformation Modules: Natural Language Processing, sentiments ...