Recently, Matthias Radtke has written a very nice blog post on Topic Modeling of the codecentric Blog Articles , where he is giving a comprehensive introduction to Topic Modeling. In this article I am showing a real-world example of how we can use Data Science to gain insights from text data and social network analysis.
I am using publicly available Twitter data to characterize codecentric’s friends and followers for
- identifying the most “influential” followers and using text analysis tools like sentiment analysis to characterize their interests from their user descriptions
- performing Social Network Analysis on friends, followers and a subset of second degree connections to identify key players who will be able to pass on information to a wide reach of other users and
- combing this network analysis with topic modeling to identify meta-groups with similar interests.
Knowing the interests and social network positions of our followers allows us to identify key users who are likely to retweet posts that fall within their range of interests and who will reach a wide audience.
Twitter Mining
Via the Twitter REST API anybody can access Tweets, Timelines, Friends and Followers of users or hash-tags. One drawback of the REST API is its rate limit of 15 requests per application per rate limit window (15 minutes). An alternative would be to use Twitters’s Streaming API , if you wanted to continuously stream data of specific users, topics or hash-tags. Here though, I want to look at a snapshot of codecentric’s Twitter followers to show some of the possibilities that analyzing this information holds.
On July 15th, codecentric had 449 friends (users who codecentric follows) and 2732 followers (users who follow codecentric), while 261 of them are simultaneously friends & followers.
We now have the following information about these friends and followers:
- user name
- user screen name
- user description (the short introduction that each user can write about themselves)
- number of tweets per user
- number of followers per user
- number of friends per user
- date of account creation
- account location
- account language
- etc.
This data can tell us a lot about who is interested in codecentric and what we do. We can e.g. start with a simple exploratory data analysis and look at what languages the accounts are set to – no need for fancy models (just yet)!
As we can see, the vast majority of friends and followers have English and German account settings. The insight derived from this is that tweeting in both, German and English will find an audience among our followers (even though English would probably be more inclusive, assuming that most, if not all, German followers will also be able to understand English tweets).
Who are codecentric’s most influential followers and what are they interested in?
We can also try to identify our most influential followers. These would be followers with a big network (i.e. who have many followers) and who also tweet/re-tweet a lot. If we capture these followers’ interests with one of our tweets, they are a) more likely to re-tweet and b) will reach a bigger audience by doing so!
Correlation between follower count and the average number of tweets per day of codecentric’s Twitter followers.
The plot above shows the correlation between the number of followers codecentric’s followers have and how often they tweet.
Now that we know who our most influential followers are, we can analyze their short descriptions about themselves to find out what they are interested in. By proxy, this will give us an idea about which kind of tweets are most likely to capture their interest. Of course, this is not to say that these are the only people who (should) matter and that tweets should be tailored towards these interests only! Covering a wide range of topics makes for an interesting and authentic profile but since “knowledge is power”, it can be extremely valuable to know which tweets/posts are likely to increase visibility!
In order to extract information from the descriptions of the most influential followers (defined as the top 100 followers based on a score of follower count * average tweets per day), I am making use of text analysis and natural language processing tools.
To prepare the data, I am splitting the user descriptions into words, convert each word to its word stem and remove stop words.
We can now identify the most common words in these descriptions.
Not surprisingly, software development, agile and business are among the most common words. But also IoT, data and science occur frequently in our influential followers’ descriptions!
Instead of looking for the most common words, we can also look for the most common word pairs (bigrams).
This graph shows the most common word pairs in our influential followers’ descriptions (arrow colors represent how often the pair occurs). Because we are looking at a relatively small set of followers, none of the word pairs occur exceptionally often. Still, data science is the most common word pair!
Sentiment analysis
Sentiment analysis describes a collection of natural language processing tools and resources that are used to identify subjective information in text, like positive or negative sentiment, joy, digust, fear, anger, etc.
Here, we can also use bigram analysis to identify negated meanings, i.e. words preceded by “not”, “no”, etc. In sentiment analysis, the meanings of negated words can then be reversed.
This plot shows the overall sentiment in the user descriptions of the most influential followers. Based on Bing Liu’s sentiment lexicon, we can score how many positive and negative words were used in each followers’ description. Because this lexicon is only available for the English language, we can only get realiable scores for followers with an English description (68 out of 100 followers have an English language setting). As we can see, the majority of followers have predominantly positive descriptions.
Social Network Analysis
Social networks describe interactions between people, e.g. Twitter friends and followers. The analysis of such networks makes use of graph theory.
Here, we can show codecentric’s Twitter followers and friends as a directed network: each node represents a user and edge arrows indicate who a user follows.
Because of Twitter’s API rate limit, I have only mined the friends lists of 106 of codecentric’s friends. Still, this leaves us with a network of 39929 second degree connections!
With graph theory we can calculate a number of metrics that allow us to identify key players in the network:
- centrality and node degree to find nodes with many adjacent edges (i.e. users who are highly connected)
- closeness to find central nodes (i.e. users that can spread information to many other users)
- transitivity or clustering coefficient, which measures the probability that adjacent nodes are connected
- PageRank or eigenvector centrality, which scores nodes according to their connections with high-degree nodes
- betweenness centrality and diameter (to describe the shortest and longest paths between nodes)
Below, I am showing the network graph with node size representing betweenness centrality. Nodes with high betweenness centrality are on the path between many other nodes, which makes them key connections or bridges between different groups of nodes. These users are very important because they are likely to pass on information to a wide reach of other users. Node positions are calculated with the Fruchterman-Reingold layout algorithm.
Social network of codecentric’s Twitter friends and followers. Node size represents betweenness centrality.
Topic Modeling
We can now use the follower descriptions again to identify groups of users with similar interests. For a detailed introduction to topic modeling, see Matthias Radtke’s “Topic Modeling of the codecentric Blog Articles”“ .
Here, I am using Latent Dirichlet Allocation with the VEM algorithm to group codecentric’s first and second degree connections into five topics.
The wordcloud below visualizes the most characteristic words for each topic.
Now, we want to know which topic each user in our network belongs to. This, we can find out with the so called gamma score. Each user is assigned the topic with highest respective gamma score.
Social network of codecentric’s Twitter friends and followers. Color indicate topic from topic modelling.
This network shows the different interest groups of codecentric’s Twitter friends based on what topics they and their friends were assigned to (one user e.g. seems to be follow many users assigned to topic 1, which is about software development).
Even though this network is far from representative, because it only shows a subgroup of second degree friends, we can already see the potential that this information contains! We now have a very good idea about the interests of our friends from a) their Twitter descriptions and b) from the descriptions of the users that they in turn follow. We could now, for example, generate a similar network with first and second degree followers. It would give us a good idea about the interests of users who are not (yet) followers. This information could be used to target specific interest groups by expanding or focusing more on topics where we see a potential for reaching many users via existing followers.
We could even imagine combining this approach with machine learning techniques to predict follower interests and sharing-potential.
All analyses have been done with R version 3.4.0.
Code is available via Github .
More articles
fromShirin Elsinghorst
Your job at codecentric?
Jobs
Agile Developer und Consultant (w/d/m)
Alle Standorte
More articles in this subject area
Discover exciting further topics and let the codecentric world inspire you.
Blog author
Shirin Elsinghorst
People Lead & Principal Consultant Data/AI
Do you still have questions? Just send me a message.
Do you still have questions? Just send me a message.