How to use Wikipedia’s full dump as corpus for text classification with NLTK

26.3.2013 | 1 minutes reading time

Wikipedia is not only a never ending rabbit hole of information. You start with an article on a topic you want to know about, and you end up hours later with an article that has nothing to do with the original topic you’ve looked up. And all the time, you’ve been just clicking your way from one article to another.

But from a different perspective, Wikipedia is probably the biggest crowd-sourced information platform with a built-in review process and as many languages as its users want it to be (despite the fact that, together with Google, it has almost completely ousted printed encyclopaedias). So if this is not Big Data, then what is (pardon my sarcasm)?

And what is the most important part for this tiny post: Wikipedia comes with a more or less consistently maintained categorisation. Categories plus text itself are classes in natural language processing (NLP). So I just thought: why not use Wikipedia for text classification? So I ended up with an implementation of a natural language processing corpus based on Wikipedia’s full article dump, using groups of categories as classes and anti-classes. It can be used for whatever text you want to classify, of course as long as you follow Wikipedia’s terms of use and accept the categorisation and article quality. If you don’t, then, well, contribute and improve the quality like others do.

The whole code including a step by step usage instructions is out on GitHub: https://github.com/pavlobaron/wpcorpus . Any constructive feedback and help are welcome.

Was this post helpful?

Blog author

Pavlo Baron

Do you still have questions? Just send me a message.

fromPavlo Baron

Data Lab @ codecentric

I am happy to announce Data Lab @ codecentric. With Data Lab @ codecentric, we want to extend and to focus our technical and subject-specific expertise in data analysis, data mining, data security and data privacy as well as in corresponding areas. With...

10.1.2014 | 1 minutes reading time

Pavlo Baron

Graphlr: indexing antlr3 generated Java AST through a Neo4j graph

While working on my Sonar fork which allows to simulate refactoring without actually touching source files I have once again realized what a PITA it is to traverse the antlr-generated Abstract Syntax Tree (AST) for Java. The mechanism is absolutely ...

Software architecture
Java
Database
NoSQL
Software development

28.6.2012 | 2 minutes reading time

Pavlo Baron

How to simulate refactoring / restructuring of Java code with Sonar

During my IT life I had to analyze many code bases – something that you would call an architecture review. Or some might call it a part of architecture review. As for me, I don’t believe in reviews without actively touching the source code. Touching ...

Agile
Software architecture
CI/CD
Agile methods
Software development
Java

23.5.2012 | 5 minutes reading time

Pavlo Baron

travis-ci, or how continuous integration will become fun again

First of all, I need to say that I definitely will not compare tools in this post. This usually leads to nothing but flame wars and too much blog moderation effort. What I want is to show my view at an emerging tool / approach / idea. That clarified,...

Agile
CI/CD

8.5.2012 | 7 minutes reading time

Pavlo Baron

Distributed automated acceptance testing with Robot and Chef

I just go on working on some edge topics around the Robot Framework and blogging about it. I’ve assume you’re familiar with Chef and Robot Framework as well as VirtualBox and Vagrant. For the basic setup, I suggest my previous post on a similar topic...

18.1.2012 | 3 minutes reading time

Pavlo Baron

SoapUI test library for the Robot Framework

I have started implementing the SoapUI test library for the Robot Framework – sources are here: https://github.com/pavlobaron/robotframework-soapuilibrary . The version 0.1 yet cannot do very much of what SoapUI offers, but you can run one project with...

Testing

31.12.2011 | 1 minutes reading time

Pavlo Baron

GOTO Prague 2011

This year, the GOTO conference (http://gotocon.com/ ) family has become bigger and added a wonderful new member location: Prague (https://gotocon.com/prague-2011/). I visited the conference and am also very proud of having given 2 talks there. The conference...

Community

25.11.2011 | 1 minutes reading time

Pavlo Baron

So … I say “hello”

I guess, I was in none of my previous job changes so intensely bombarded with questions about where I’m going to next. So, now I finally reveal the secret, if it is not obvious from the URL of this post: starting in November, I’m going to ……………………………...

31.10.2011 | 1 minutes reading time

Pavlo Baron

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Introducing Data Interface Quadrants (DIQs)

In today’s rapidly evolving, data-driven world, organisations face an increasingly complex challenge: how to design, implement, and manage data interfaces that meet both immediate operational demands and long-term strategic business objectives. A data...

API
Data

30.1.2025 | 8 [Missing String "readingTime"]

Daniel Kocot

Miriam Greis

Access Databricks UnityCatalog from duckdb

Databricks is a great platform when it comes to data management and governance, mostly due to the unity catalog. But Spark as an engine for processing the data is just ok'ish, especially when data is not really big. New engines like polars, datafusion...

Data

20.1.2025 | 5 [Missing String "readingTime"]

Matthias Niehoff

Charge your APIs Volume 36 - Trends for 2025

As 2025 approaches, new trends are emerging in the world of APIs. After 2024 was user-centric, the focus is now shifting back to developer needs and increasing productivity. APIs are evolving and the technologies surrounding them are becoming more powerful...

Integration
API
Data
Software architecture

11.12.2024 | 5 [Missing String "readingTime"]

Daniel Kocot

When Business Meets Technology: From Data Product to Data Architecture...

Abstract The Data Product Canvas (DPC) is a tool for the lightweight and iterative definition of data products. It increases the efficiency of product definition by clearly presenting the key impact areas on data products. Additionally, the DPC motivates...

Software architecture
Data
DDD
Digital product developement

6.8.2024 | 24 [Missing String "readingTime"]

Dr. Florian Rademacher

Charge your APIs Volume 28: Empowering application and data integration...

In today's fast-paced world, seamless application and data integration is crucial for organisational success. This blog explores how frameworks like Maslow's Pyramid, Team Topologies, Evolutionary Architectures, API Federation, and API Marketplaces, ...

API
Data
Integration

25.7.2024 | 8 [Missing String "readingTime"]

Daniel Kocot

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

The pillars of modern data architectures as success factors for organisations In the digital economy, a well-thought-out data architecture and the efficient use of data are crucial for organisational success. Data products, data contracts and API contracts...

Data
API

13.6.2024 | 7 [Missing String "readingTime"]

Daniel Kocot

Becoming a Data-Driven Company with Applied Data Products

In recent years, the hype surrounding the value of data has grown continuously, and a multitude of concepts and methods have emerged on how companies can become 'data-driven'. From strategic top management to detail-oriented data analysts attempts are...

Agile
Big Data
Data
Product management
Digitalization
Data Science
Business Intelligence

18.5.2024 | 9 [Missing String "readingTime"]

Dr. Florian Rademacher

A/B Testing: Tool support and testing GrowthBook

In the previous blog post we introduced some general concepts of A/B testing: we explored the main aspects, defined test types and explained the most common statistical methods. Now we want to explore the areas in which A/B testing tools can provide...

Testing
Python
Data
UX/UI
Analysis
JavaScript

18.3.2024 | 20 [Missing String "readingTime"]

Francesca Diana

A/B Testing: An introduction

This blog series aims to aid teams who are contemplating adding A/B testing to their toolkit but are unsure of which tool to use. In addition to helping with tool selection, the series also provides the entire team with a consistent initial understanding...

Testing
Data
UX/UI
Analysis

6.2.2024 | 29 [Missing String "readingTime"]

Francesca Diana

Data for the Masses Volume 1: The Digital Product Passport - A Key Element...

The Digital Product Passport represents a significant shift for digital units within organisations, compelling them to ensure comprehensive data transparency. This tool not only serves as a product's digital fingerprint but also opens up new dimensions...

Data
Product management

25.1.2024 | 7 [Missing String "readingTime"]

Daniel Kocot

Charge your APIs: NordicAPIs Platform Summit Edition - API first ... not...

In the ever-evolving landscape of software development, buzzwords and paradigms come and go. One such term that has gained significant traction in recent years is "API-First Development." It's been hailed as the holy grail of modern software engineering...

API
Data

19.10.2023 | 5 [Missing String "readingTime"]

Daniel Kocot

An introduction to federated learning in an industrial context: Advanced

In the Machine Learning space, it was long believed that sharing learnings or weights was safe in the sense that the input data couldn't be extracted. However, this belief has been challenged by researchers coming out over the years. Nowadays, numerous...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 9 [Missing String "readingTime"]

An introduction to federated learning in an industrial context: Fundamentals

With the help of data, companies are able to make more informed decisions, optimize their workflows and gain an edge in the competitive world of business using the power of Machine Learning (ML). However, handling data has become increasingly difficult...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 8 [Missing String "readingTime"]

Charge your APIs Volume 13: Data meets APIOps

In the swirling digital vortex that modern businesses navigate, two things stand clear as day: our escalating reliance on Application Programming Interfaces (APIs) and the immeasurable value of data. The API Operations (APIOps) pipeline, with its automated...

API
Data

24.8.2023 | 11 [Missing String "readingTime"]

Daniel Kocot

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

In this article, we'll explore how to use the Poetry package manager to manage the dependencies of a machine learning project that makes use of the M1 GPU for TensorFlow training. We'll cover the motivation for using Poetry in this context, and we'll...

Machine Learning
Apple
Data
AI
Python

11.1.2023 | 3 [Missing String "readingTime"]

Denis Stalz-John

IoT fleet management: A comparison of balena and Portainer

When your system contains many IoT devices that are scattered over a large production facility or even distributed over multiple facilities, it is important that you can manage and update the deployed software, access logs and easily provision new devices...

IoT
IIoT
DevOps
Container
Raspberry Pi

10.1.2023 | 8 [Missing String "readingTime"]

Florian Lüdiger

Money, Money, Money - Monetization of APIs needs more than just a business...

Welcome to my blog series on the topic of my bachelor's thesis, "Real-time dashboard with distributed streaming". To summarize, it discusses the visualization of API-related data that is essential for business owners. How is this series structured? This...

API
Streaming
Data

27.10.2022 | 5 [Missing String "readingTime"]

Toit will bring your IoT projects up to speed

If you have ever created an application for a microcontroller such as the ESP32, you might have noticed that it’s quite different to most of the software development we as IT consultants are doing most of the time. Using C/C++ and ancient tooling for...

29.8.2022 | 6 [Missing String "readingTime"]

Florian Lüdiger

Python on an M1 chip: Running smoothly using Docker

I have been working as a data scientist at codecentric for several years now. Thus, my language of choice is Python and I am using it in several projects on a daily basis. Last year, I got pretty excited about the announcement of the new versions of ...

Data
Machine Learning
Apple
Python

14.2.2022 | 6 [Missing String "readingTime"]

Denis Stalz-John

BigQuery to the rescue: How to prototype an ML system for a medium-sized...

BigQuery can help with building an ML system for production with a short time to market.Follow industry standards. Agile methods, the MLOps framework and focus on an MVP are helpful.Model improvement is not everything. A good model evaluation as well...

Data

2.2.2022 | 9 [Missing String "readingTime"]

How to use Wikipedia’s full dump as corpus for text classification with NLTK

Was this post helpful?

Blog author

More articles

Data Lab @ codecentric

Graphlr: indexing antlr3 generated Java AST through a Neo4j graph

How to simulate refactoring / restructuring of Java code with Sonar

travis-ci, or how continuous integration will become fun again

Distributed automated acceptance testing with Robot and Chef

SoapUI test library for the Robot Framework

GOTO Prague 2011

So … I say “hello”

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

Introducing Data Interface Quadrants (DIQs)

Access Databricks UnityCatalog from duckdb

Charge your APIs Volume 36 - Trends for 2025

When Business Meets Technology: From Data Product to Data Architecture...

Charge your APIs Volume 28: Empowering application and data integration...

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

Becoming a Data-Driven Company with Applied Data Products

A/B Testing: Tool support and testing GrowthBook

A/B Testing: An introduction

Data for the Masses Volume 1: The Digital Product Passport - A Key Element...

Charge your APIs: NordicAPIs Platform Summit Edition - API first ... not...

An introduction to federated learning in an industrial context: Advanced

An introduction to federated learning in an industrial context: Fundamentals

Charge your APIs Volume 13: Data meets APIOps

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

IoT fleet management: A comparison of balena and Portainer

Money, Money, Money - Monetization of APIs needs more than just a business...

Toit will bring your IoT projects up to speed

Python on an M1 chip: Running smoothly using Docker

BigQuery to the rescue: How to prototype an ML system for a medium-sized...