Category Archives: Twitter

Pro-Brexit Camp Wages Active ‘Fake News’ Twitter Campaign

Suspicious activity on Twitter is trying to sway public opinion in favor of Brexit as the United Kingdom continues its struggle to reach a deal to withdraw from the European Union, according to a new report.

The post Pro-Brexit Camp Wages Active ‘Fake News’ Twitter Campaign appeared first on The Security Ledger.

Related Stories

Analysis Of Brexit-Centric Twitter Activity

This is a rather long blog post, so we’ve created a PDF for you to download, if you’d like to read it offline. You can download that from here.

Executive Summary

This report explores Brexit-related Twitter activity occurring between December 4, 2018 and February 13, 2019. Using the standard Twitter API, researchers collected approximately 24 million tweets that matched the word “brexit” published by 1.65 million users.

A node-edge graph created from the collected data was used to delineate pro-leave and pro-remain communities active on Twitter during the collection period. Using these graphs, researchers were able to identify accounts on both sides of the debate that play influential roles in shaping the Brexit conversation on Twitter. A subsequent analysis revealed that while both communities exhibited inorganic activity, this activity was far more pronounced in the pro-leave group. Given the degree of abnormal activity observed, the researchers conclude that the pro-leave Twitter community is receiving support from far-right Twitter accounts based outside of the UK. Some of the exceptional behaviors exhibited by the pro-leave community included:

  • The top two influencers in the pro-leave community received a disproportionate number of retweets, as compared to influencer patterns seen in the pro-remain community
  • The pro-leave group relied on support from a handful of non-authoritative news sources
  • A significant number of non-UK accounts were involved in pro-leave conversations and retweet activity
  • Some pro-leave accounts tweeted a mixture of Brexit and non-Brexit issues (specifically #giletsjaunes, and #MAGA)
  • Some pro-leave accounts participated in the agitation of French political issues (#franceprotests)

The scope of this report is too limited to conclusively determine whether or not there is a coordinated astroturfing campaign underway to manipulate the public or political climate surrounding Brexit. However, it does provide a solid foundation for more investigation into the matter.

Introduction

Social networks have come under fire for their inability to prevent the manipulation of news and information by potentially malicious actors. These activities can expose users to a variety of threats.  And recently, the spread of disinformation and factually inaccurate statements to socially engineer popular opinion has become a significant concern to the public. Of particular concern is the coordination of actions across multiple accounts in order to amplify specific content and fool underlying algorithms into falsely promoting amplified content to users in their news feeds, searches, and recommendations. Participants in these campaigns can include: fully automated accounts (“bots”), cyborgs (accounts that use a combination of manual and automated actions), full-time human operators, and users who inadvertently amplify content due to their beliefs or political affiliations. Architects of sophisticated social engineering campaigns, or astroturfing campaigns (fabricated social network interactions designed to deceive the observer into believing that the activity is part of a grass-roots campaign), sometimes create and operate convincing looking personas to assist in the propagation of content and messages relevant to their cause. It is extremely difficult to distinguish these “fake” personas from real accounts.

Identifying suspicious activities in social networks is becoming more and more difficult. Adversaries have learned from their past experiences, and are now using better tactics, building better automation, and creating much more human-like sock puppets. Social networks now employ more sophisticated algorithms for detecting suspicious activity, and this forces adversaries to develop new techniques aimed at evading those detection algorithms. Services that sell Twitter followers, Twitter retweets, YouTube views, YouTube subscribers, app store reviews, TripAdvisor reviews, Facebook likes, Instagram followers, Instagram likes, Facebook accounts, Twitter accounts, eBay reviews, Amazon ratings, and anything else you could possibly imagine (related to social networks) can be purchased cheaply online. These services can all be found with simple web searches. For the more tech-savvy, a plethora of tools exist for automating the control of multiple social media accounts, for automating the creation and publishing of text and video-based content, for scraping and copying web sites, and for automating search engine optimization tasks.  As such, more complex analysis techniques, and much more in-depth study of the data obtainable from social networks is required than ever before.

Because of its open nature, and fully-featured API support, Twitter is an ideal platform for research into suspicious social network activity. By studying what happens on Twitter, we can gain insight into the techniques adversaries use to “game” the platform’s users and underlying algorithms. The findings from such research can help us build more robust recommendation mechanisms for both current and future social networking platforms.

Background

Between December 4, 2018 and February 13, 2019, we used the standard Twitter API (from Python) to collect Twitter data against the search term “brexit”. The collected data was written to disk and then subsequently analyzed (using primarily Python and Jupyter notebooks) in order to search for suspicious activity such as disinformation campaigns, astroturfing, sentiment amplification, or other “meddling” operations.

At the time of writing, our dataset consisted of approximately 24 million tweets published by over 1.65 million users. 18 million of those were retweets published by 1.5 million users from tweets posted by 300,000 unique users. The dataset included 145,000 different hashtags, 412,000 different URLs and 700,000 unique tweets.

Suspicious activity (activity that appears inorganic or unnatural) can be difficult to separate from organic activity on a social network. For instance, a tweet from user with very few followers will normally fall on deaf ears. However, that user may once in a lifetime post something that ends up going “viral” because it was so catchy it got shared by other users, and eventually by influencers with many followers. Malicious actors can amplify a tweet to similar effect by instructing or coordinating a large number of accounts to share an equally unknown user’s tweet. This can be achieved via bots or manually operated accounts (such as what was achieved by Tweetdeckers in 2017 and 2018). Retweets can also be purchased online. Vendors that provide such services publish the purchased retweets from their own fleets of Twitter accounts which likely don’t participate in any other related conversations on the platform. Retweets purchased on this way are often published over a period of time (and not all at once, since that would arouse suspicion). Hence, detecting that a tweet has been amplified by such a service (and identifying the accounts that participated in the amplification) is only possible if those retweets are captured as they are published. Finding a small group of users that retweeted one account over several days, and that may have themselves appeared only once in a dataset containing over 20 million tweets and 300,000 users, is rather difficult.

Groups of accounts that heavily retweet similar content or users over and over can be indicative of automation or malicious behaviour, but finding such groups can sometimes be tricky. Nowadays sophisticated bot automation exists that can easily hide the usual tell-tale signs of artificial amplification. Automation can be used to queue a list of tweets to be published or retweeted, randomly select a portion of potentially thousands of slave accounts, and perform actions at random times, while specifically avoiding tweeting at certain times of the day to give the impression that real users are in control of those accounts. Real tweets and tweets from “share” buttons on news sites can be mixed with retweets to improve realism.

Another approach to finding suspicious behaviour on Twitter is to search for account activity patterns indicative of automation. In a vacuum, these patterns cannot be used to conclusively determine whether an account is automated, or designed to act as part of an astroturfing or disinformation campaign. However, identifying accounts with one or more suspicious traits can help lead researchers to other accounts, or suspicious phenomena, which may ultimately lead to finding evidence of foul-play. Here are some traits that may indicate suspiciousness:

  • While it is entirely possible for a bored human to tweet hundreds of times per day (especially when most of the activity is pressing the retweet button), accounts with high tweet volumes can sometimes be indicative of automation. In fact, some of the accounts we found during this research that tweeted at high volume tended to publish hundreds of tweets at certain times during the day, whilst remaining dormant the rest of the time, or published tweets at a uniform volume, with no pauses for sleep.
  • Accounts that are just a few days or weeks old tend to not have thousands of followers, unless they belong to a well-known celebrity who just joined the platform. New accounts that develop huge followings in a short period of time are suspicious, unless those followings can be explained by particular activity or some sort of pre-existing public status.
  • Accounts with a similar number of followers and friends can occasionally be suspicious. For instance, accounts controlled by a bot herder are sometimes programmatically instructed to follow each other, and end up having similar follower/friends counts. However, mechanisms also exist that promote a “follow-back” culture on Twitter. These mechanisms are often present in isolated communities, such as the far-right Twittersphere. Commercial services also exist that automate follow-back actions for accounts that followed them. The fact that the list of accounts followed by a user is very similar to that user’s list of friends can, unfortunately, be indicative of any of the above.
  • Accounts that follow thousands of other accounts, but are themselves followed by only a fraction of that number can occasionally be indicative of automation. Automated accounts that advertise adult services (such as porn, phone sex, “friend finders”, etc.) use this tactic to attract followers. However, there are also certain communities on Twitter that tend to reciprocate follows, and hence following a great deal of accounts (including “egg” accounts) is a way of “fishing” for follow-backs, and normal in those circles.
  • While it is true that many users on Twitter tend to like and retweet content a lot more than they write their own tweets, accounts that retweet more than 99% of the time might be controlled by automation (especially since it’s a very easy thing to automate, and can be used to boost the engagement of specific content). A few accounts that we encountered during our research had one pinned tweet published by “Twitter Web Client” whilst the rest of the account’s tweets were retweets published by “Twitter for Android”. This sort of pattern raises suspicion, since it could indicate that the account was manually created (and seeded with a single hand-written tweet) by a user at a computer, and then subsequently automated.
  • Accounts that publish tweets using apps created with the Twitter API, or from sources that are often associated with automation are not conclusively suspicious, but may warrant further examination. This is covered more extensively later in the article.
  • Temporal analysis techniques (discussed later in this article) can reveal robot-like behaviour indicative of automation. Some accounts are automated by design (e.g. automated marketing accounts, news feeds). However, if an account behaves in an automated fashion, and publishes politically polarizing content, it may be cause for suspicion.

During the first few weeks of our research, we focused on building up an understanding of the trends and topology of conversations around the Brexit topic. We created a simple tool designed to collect counts and interactions from the previous 24 hours’ worth of data, and present the results in an easily readable format. This analysis included:

  • counts of how many times each user tweeted
  • counts of how many times each user retweeted another user (amplifiers)
  • counts of how many times each user was retweeted by another user (influencers)
  • counts of hashtags seen
  • counts of URLs shared
  • counts of words seen in tweet text
  • a map of interactions between users

By mapping the interactions between users (users interact when they retweet, mention, or reply to each other), a node-edge graph representation of observed conversations can be built. Here’s a simple representation of what that looks like:

Lines connecting users in the diagram above represent interactions between those users. Communities are groups of nodes within a network that are more densely connected to one another than to other nodes, and can be discovered using community detection algorithms. To visualize the topology of conversation spaces, we used a graph analysis tool called Gephi, which uses the Louvain Method for community detection. For programmatic community detection purposes, we used the “multilevel” algorithm that is part of the python-igraph package (which is very similar to the algorithm used in Gephi). We often used graph analysis and visualization techniques during our research, since they were able to fairly accurately partition conversations between large numbers of accounts. As an example of the accuracy of these tools, the illustration below is a graph visualization created using about 24 hours’ worth of data collected around December 4, 2018.

Names with a larger font indicate Twitter accounts that are mentioned more often. It can be noted from the above illustration that that conversations related to pro-Brexit (leave) topics are clustered at the top (in orange) and conversations related to anti-Brexit (remain) topics are clustered at the bottom (in blue). The green cluster represents conversations related to Labour, and the purple cluster contains conversations about Scotland. People familiar with the Twitter users in this visualization will understand how accurately this methodology managed to separate out each political viewpoint. Visualizations like these illustrate that separate groups of users discuss opposing topics, with very little interaction between the two groups. Highly polarized issues, such as the Brexit debate (and many political topics around the world) usually generate graph visualizations that look like the above.

December 11: #franceprotests hashtag

On December 11, 2019, we observed the #franceprotests hashtag trending in our data (something we had not previously seen). Isolating all tweets from 24 hours’ worth of previously collected data, we found 56 separate tweets that included the #franceprotests hashtag. We mapped interactions between these tweets and the users that interacted with them, resulting in this visualization:

From the above visualization, we can clearly observe a large number of users interacting with a single tweet. This particular tweet (id: 1069955399917350912) was responsible for a majority of the occurrences of the #franceprotests hashtag on that day. This is the tweet:

The reason this tweet showed up in our data was because of the presence of the #BREXIT hashtag. From this 24 hours’ worth of collected data, we isolated a list of 1047 users that retweeted the above tweet. Interactions between these users from across the 24-hour period looked like this:

Of note in the above visualization are accounts such as @Keithbird59Bird (which retweeted pro-leave content at high volume across our entire dataset), @stephenhawes2 (a pro-leave account that exhibits potentially suspicious activity patterns), @SteveMcGill52 (an account that tweets pro-leave, anti-muslim, and US-related right wing content at high volume). The @lvnancy account that published the original tweet is a US-based alt-right account with over 50,000 followers.

At the time of writing, 23 of these 1047 accounts (2.2%) had been suspended by Twitter.

We performed a Twitter history search for “#franceprotests” in order to determine which accounts had been sharing this hashtag. The search captured roughly 5,800 tweets published by just over 3,617 accounts (retweets are not included in historical searches). Searching back historically allowed us to determine that the current wave of #franceprotests tweets started to pick up momentum around November 28, 2018. In addition to the #franceprotests hashtag, this group of users also published tweets with hashtags related to the yellow vests movement (#yellowvest, #yellowjackets, #giletsjaunes), and to US right-wing topics (#MAGA, #qanon, #wwg1wga). Interactions between the accounts found in that search look like this:

Some of the accounts in this group are quite suspicious looking. For instance, @tagaloglang is an account that claims to be themed towards learning the Tagalog language. The pinned tweet at the top of @tagaloglang’s timeline makes the account appear in-theme when the page loads:

However, scroll down, and you’ll notice that the account frequently publishes political content.

Another odd account is @HallyuWebsite – a Korean-themed account about Kpop. Here’s what the account looks like when you visit it:

Again, this is just a front. Scroll down and you will see plenty of political content.

Both @tagaloglang and @ HallyuWebsite look like accounts that might be owned by a “Twitter marketing” service that sells retweets.

The 5,800 tweets captured in this search had accumulated a total of 53,087 retweets by mid-February 2019. Here are a few of the tweets that received the most retweets:

At the time of writing, 66 of the 3,617 accounts (1.83%) identified as historically sharing this hashtag had been suspended.

Throughout our research, we observed many English-language accounts participating in activism related to the French protests, often in conjunction with UK, US, and other far-right themes. We would imagine that a separate research thread devoted to the study of far-right activism around the French protests would likely expose plenty of additional suspicious activity.

December 20: suspicious pro-leave amplification

During our time spent studying the day-to-day user interactions, we became familiar with the names of accounts that most often tweeted, and of those that were most often retweeted. On December 20, 2018 we noticed a few accounts that weren’t normally highly retweeted that made it onto our “top 50” list. We isolated the interactions between these accounts, and the accounts that retweeted them, and produced the following visualization:

As illustrated above, several separate groups of accounts participated in the amplification of a small number of tweets from brexiteer30, jackbmontgomery, unitynewsnet and stop_the_eu. Here is a visualization of tweets from those accounts, and the users who interacted with them:

5,876 accounts participated in the amplification captured on December 20, 2018. In order to discover what other accounts these 5,876 accounts were amplifying, we collected the last 200 tweets from each of the accounts, and mapped all interactions found, generating this graph:

Zooming in on this, we can see that the yellow cluster at the bottom contains US-based “alt-right” Twitter personalities (such as Education4Libs, and MrWyattEarpLA – an account that is now suspended), and US-based non-authoritative news accounts (such as INewsNet).

The large blue center cluster contains many EU-based right-wing accounts (such as Stop_The_EU, darrengrimes_, and BasedPoland), and non-authoritative news sources (such as UnityNewsNet, V_of_Europe). It also contains AmyMek, a radical racist US Twitter personality with over 200,000 followers.

The orange cluster at the top contains interactions with pro-remain accounts. Although we weren’t expecting to see any interactions of this nature, they were most likely introduced by accounts in the dataset that retweet content from both sides of the debate (such as Brexit-themed tweet aggregators).

Many of the 5,876 accounts that participated in the December 20, 2018 amplification contained #MAGA (Make America Great Again hashtag commonly used by the alt-right), had US cities and states set as their locations, or identified as American in one way or another. At the time of writing, 79 of these 5,876 accounts (1.34%) had been suspended by Twitter.

February 12: non-authoritative news accounts

The presence of interactions with a number of non-authoritative, pro-leave news sources that are supportive of far-right activist Tommy Robinson (such as UnityNewsNet, PoliticalUK, and PoliticaLite) in this data led us to explore the phenomenon a little further. We ran analysis over our entire collected dataset in order to discover which accounts were interacting with, and sharing links to these sources. The data reveals that some users retweeted the accounts associated with these news sources, others shared links directly, and some retweeted content that included those links. Using our collected data, we were able to build up a picture of how these links were being shared between early December and mid-February. The script we ran looked for interactions with the following accounts: “UnityNewsNet”, “AltNewsMedia”, “UK_ElectionNews”, “LivewireNewsUK”, “Newsflash_UK”, “PoliticsUK1”, “Politicaluk”, “politicalite”. It also performed string searches on any URLs embedded in tweets for the following: “unitynewsnet”, “politicalite”, “altnewsmedia”, “www-news”, “patriotnewsflash”, “puknews”.

Overall, we discovered that 7,233 accounts had either shared (or retweeted) links to these news sites, or retweeted their associated Twitter accounts. A total of 15,337 retweets were found from the dataset. The UnityNewsNet Twitter accounts was the most popular news source present in our dataset. It received 8,119 retweets by a total of 4,185 unique users. In second place was the UK_ElectionNews account with 1,293 retweets from 1,182 unique users, and in third place was politicalite with 494 retweets from 351 unique users.

A total of 9,193 Tweets were found in the dataset that shared URLS that matched the string searches mentioned above. Again, Unity News Network was the most popular – URLs that matched “unitynewsnet” were tweeted a total of 5542 times by 2928 unique users. Politicalite came in second – URLs that matched “politicalite” were tweeted a total of 3300 times by 2197 unique users. In third place was Newsflash_UK – URLs that matched “patriotnewsflash” were tweeted a total of 239 times by 65 unique users. Here is a graph visualization of all the activity that took place between the beginning of December 2018 and mid-February 2019:

Names that appear larger in the above visualization are account names that were retweeted more often. We can see more names here than the originally queried accounts because many links to these sites were shared by users retweeting other accounts that shared a link. Here’s a closer zoom-in:

At the time of writing, 130 of the 7,233 accounts (1.79%) identified to be sharing content related to these non-authoritative new sources had been suspended by Twitter.

The figures and illustrations shown above were obtained from a dataset of tweets that matched the term “brexit”. This particular analysis, unfortunately, didn’t give us full visibility into all activity around these “non-authoritative news” accounts and the websites associated with them that happened on Twitter between early December 2018 and mid-February 2019. In order to explore this phenomenon further, we performed historical Twitter searches for each of the account names in question (collecting data from between December 4, 2018 and February 12, 2019). This allowed us to examine tweets and interactions that weren’t captured using the search term “brexit”. Historical Twitter searched only return tweets from the accounts themselves, and tweets where the accounts were mentioned. Unfortunately, no retweets are returned by a search of this kind.

The combined dataset (over all 7 searches) included 30,846 tweets and 12,026 different users.

Combining the data from historical searches against all seven account names, we were able to map interactions between each news account and users that mentioned it. Here’s what it looked like:

Here’s a zoomed-in view of the graph around politicalite and altnewsmedia:

Note the presence of @prisonplanet (Paul Joseph Watson) and @jgoddard230616 (James Goddard, the star of several recent “yellow vests” harassment videos), amongst other highly-mentioned far-right personalities.

Also of interest is the set of users coloured in purple in the following visualization:

The users in the purple cluster were found from the data we collected using a Twitter search for “unitynewsnet”. With the exception of V_of_Europe, each of these accounts is mentioned exactly the same number of times (522 times) by other users in that dataset. This particular phenomenon appears to have been created by a rather long conversation between those 40ish accounts between January 14 and 16, 2019. The conversation started with a question about where to find “yellow vest”-related news. Since mentions are always inherited between replies, and both V_of_Europe and UnityNewsNet were mentioned in the first tweet in the thread, this explains why these tweets are present in this dataset. Using temporal analysis techniques (explained below), we were able to ascertain that a majority of the involved accounts pause, or tweet at reduced volume between 06:00 and 12:00 UK time, which is indicative of night time in US time zones. In fact, examining these accounts manually reveals that they are mostly US-based. V_of_Europe account is a non-authoritative news account (Voice of Europe) with over 200,000 followers.

This interesting finding illustrates the fact that sometimes a suspicious looking trend or spike may present itself when data is viewed from a certain angle. Further inspection of the phenomenon will then prove it to be largely benign.

At the time of writing, 159 of the 12.026 accounts (1.32%) discovered above had been suspended by Twitter.

Temporal Analysis

Temporal analysis methods can be useful for determining whether a Twitter account might be publishing tweets using automation. This section describes techniques, the results of which are included later in this document. Here are some common temporal analysis methods:

  • Gather a histogram of counts of time intervals between the publishing of tweets. Numerous high counts of similar time intervals between tweets can indicate robotic behaviour.
  • A “heatmap” of the time of day, and day of the week that an account tweets can be gathered. The heatmap can then be examined (either by eye, or programmatically) for anomalous patterns. Using this technique, it is easy to identify accounts that tweet non-stop, with no breaks. If this is the case, it is possible that some (or all) of the tweets are being published via automation.
  • A heatmap analysis may also illustrate that certain accounts publish tweets en-masse at specific times of the day, and remain dormant for many hours in between. This behavior can also be an indicator that an account is automated – for instance, this is somewhat common with marketing automation or news feeds.

Here are some interesting examples found from the dataset. Note that these examples are intended as illustration, and not as indications that the associated accounts are bots.

The stephanhawes2 account tweets in short bursts at specific times of the day, with no activity at any other time. The precise time windows during which this user tweets (18:00-20:59 and 00:00-01:59) looks odd. This account retweets a great deal of far-right content.

Here are the time deltas (in seconds) observed between the account’s last 3200 tweets. You’ll notice that a majority of the tweets are published between 5 and 15 seconds apart.

The JimNola42035005 account, which amplifies a lot of pro-leave content, pauses tweeting between 08:00 UTC and 13:00 UTC. This is indicative of a user not residing in the UK’s time zone.

The interarrival pattern for this account shows a strong tendency for multiple tweets to be published in rapid succession (5-30 seconds apart).

The tobytortoise1 account tweets at very high volume, and almost always shows up at or near the top of the most active users tweeting about Brexit. This is a pro-leave account. Here’s the heatmap for that account. Note the bursts of activity exceeding 100 tweets in an hour:

Here is the interarrival pattern for that account:

The walshr108 account, which publishes pro-leave content, appears to pause roughly around UK night-time hours. However, the interarrivals pattern of this account raises suspicion.

Over 350 of walshr108’s last 3200 tweets were published less than one second apart.

Unconventional source fields

Each published tweet includes a “source” field that is set by the agent that was used to publish that tweet. For instance, tweets published from the web interface have their source field set to “Twitter Web Client”. Tweets published from an iPhone have a source field set to “Twitter for iPhone”. And so on. Tweets can be published from a variety of sources, including services that allow tweets to be scheduled for publishing (for instance “IFTTT”), services that allow users to track follows and unfollows (such as “Unfollowspy”), apps within web pages, and social media aggregators. Twitter sources can be roughly grouped into:

  • Sources associated with manual tweeting (such as “Twitter Web Client”, “Twitter for iPhone”)
  • Sources associated with known automation services (such as “IFTTT”)
  • Sources that don’t match either of the above

While services that allow the automation of tweeting (such as “IFTTT”) can be used for malicious purposes, they can also be used for legitimate purposes (such as brand marketing, news feeds, and aggregators). Malicious actors sometimes shy away from such services for two reasons:

  • It is easy for researchers to identify tweet automation by examining source fields
  • Sophisticated tools exist that allow bot herders to publish tweets from multiple accounts without the use of the API, and which can spoof their user agent to match legitimate sources (often “Twitter for Android”)

Despite the availability of professional bot tools, there are still some malicious actors that use Twitter’s API and attempt to disguise what they’re doing. One way to do this is to create an app who’s source field is a string similar to that of a known source (e.g. “Twitter for  Android” <- note that this string has two spaces between the words “for” and “Android”). Another way is to replace ascii characters with non-ascii characters (e.g. “Оwly” <- the “O” in this string is non-ascii). API-based apps can also “hide” by using source strings that look like legitimate product names – there are a plethora of legitimate apps available that all have similar-looking names: Twuffer, Twibble, Twitterfy, Tweepsmap, Tweetsmap. It’s easy enough to create a similarly absurd, nonsensical word, and hide amongst all of these.

Over 6000 unique source strings were found in the dataset. There is no definitive list of “legitimate” Twitter sources available, and hence each and every one of the unique source strings found must be examined manually in order to build a list of acceptable versus unacceptable sources. This process involves either searching for the source string, locating a website, and reading it, or visiting the account that is using the unknown source string and manually checking the “legitimacy” of that account. At the time of writing, we had managed to hand-verify about 150 source strings that belonged to Twitter clients, known automation services, and custom apps used by legitimate services (such as news sites and aggregators). We found roughly 2 million tweets across the entire dataset that were published with source strings that we had yet to hand-verify. These tweets were published by just under 17,000 accounts.

As mentioned previously, since there are dozens of legitimate services that allow Twitter to be automated, it isn’t easy to programmatically identify whether these automation sources are being used for malicious purposes. Each use of such a service found in the dataset would need to be examined by hand (or by the use of custom filtering logic for each subset of examples). This is simply not feasible. As such, using Twitter’s source field to determine whether suspicious, malicious, or automated behaviour is occurring is a complex endeavour, and one that is outside of the scope of the research described in this document.

Comparison of remain and leave-centric communities

We collected retweet interactions over our entire dataset and created a large node-edge graph. The reason why we only captured retweet interactions in this case was based on the assumption that if users wish to extend the reach of a particular tweet, they’d more likely retweet it than reply to it, or simply mention an account name. While the process of “liking” a tweet also seems to amplify a tweet’s visibility (via Twitter’s underlying recommendation mechanisms), instances of users “liking” tweets are, unfortunately, not available via Twitter’s streaming API.

The graph of all retweet activity across the entire collection period contained 219,328 nodes (unique Twitter accounts) and 1,184,262 edges (each edge was one or more observed retweet). Using python-igraph’s multilevel community detection algorithm, we partitioned the graph into communities. A total of 8,881 communities were discovered during this process. We performed string searches on the members of each identified community for high-profile accounts we’d seen engaging in leave and remain conversations throughout the research period, and were able to discover a leave-centric community containing 39,961 users and a remain-centric community containing 52,205 users. We then separately queried our full dataset with each list of users to isolate all relevant data (tweets, interactions, hashtags, and urls). Below are the findings from that analysis work.

Leave community

The leave-centric community comprised of 39,961 users. They published 1.1 million unique tweets, and a total of 4.3 million retweets across the dataset.

  • 2779 (6.95%) accounts that were seen at least 100 times in the dataset retweeted 95% (or more) of the time.
  • 278 (0.70%) accounts retweeted over 2000 times across the entire dataset, for a total of 880,620 retweets (20.5%). Of these, temporal analysis suggests that 33 of the accounts exhibited potentially suspicious behavior (11.9%) and 14 accounts tweeted during non-UK time schedules.
  • At the time of writing, 133 of these accounts (0.33%) had been suspended by Twitter.

Remain community

The remain-centric community comprised of 52,205 users. They published 1.7 million unique tweets, and a total of 6.2 million retweets across the dataset.

  • 3413 (6.54%) accounts that were seen at least 100 times in the dataset retweeted 95% (or more) of the time.
  • 436 (0.84%) accounts retweeted over 2000 times across the entire dataset, for a total of 1,471,515 retweets (23.7%). Of these, temporal analysis suggests that 41 of the accounts exhibited potentially suspicious behavior (9.4%) and 18 accounts tweeted during non-UK time schedules.
  • At the time of writing, 54 of these accounts (0.10%) had been suspended by Twitter.

Findings

Although we were initially suspicious of high-volume retweeters, the presence of these in roughly equal proportions in both groups led us to believe that this sort of behaviour might be somewhat standard on Twitter. The top remain-centric group’s high-volume retweeters published more often than top leave-centric high-volume retweeters. I observed that many of the top retweeters from the remain-centric group tended to tweet a lot about Labour.

The top retweeted account in the leave-centric group received substantially more retweets than the next most highly retweeted account. This was not the case for the remain-centric group.

  • Top hashtags used by the remain-centric group included: #peoplesvote, #stopbrexit, #eu, #fbpe, #remain, #labour, #finalsay, and #revokea50. All of the top-50 hashtags in this group were themed around anti-brexit sentiment, around politicians, or around political events that happened in the UK during the data collection period (#specialplaceinhell, #theresamay, #corbyn, #donaldtusk, #newsnight).
  • Top hashtags used by the leave-centric group included: #eu, #nodeal, #standup4brexit, #ukip, #leavemeansleave, #projectfear and #leave. Notable other hashtags on the top-50 list for this group were other “no-deal” hashtags (#gowto, #wto, #letsgowto, #nodealnoproblem, #wtobrexit), hashtags referring to protests in France and the adoption of high-vis vest by far-right UK protesters (#giletsjaunes, #yellowvestsuk, #yellowvestuk) and the hashtag #trump.
  • Both groups heavily advertised links to UK Parliament online petitions relevant to the outcome of brexit. The remain group advertised links to petitions requesting a second referendum, whilst the leave group advertised links to petitions demanding the UK leave the EU, regardless of the outcome of negotiations.
  • From the users we identified as having retweeted more than 95% of the time, we found 62 accounts from the leave-centric group that were clearly American right-wing personas. These accounts associated with #Trump and #MAGA, amplified US political content, and interacted with US-based alt-right personalities (in addition to amplifying Brexit-related content). The description fields of these accounts usually included words such as “Patriot”, “Christian”, “NRA”, and “Israel”. Many of these accounts had their locations set to a state or city in the US. The most common locations set for these accounts were: Texas, Florida, California, New York, and North Carolina. We found no evidence of equivalent accounts in the remain-centric group.
  • Following on from our previous discovery, using a simple string search, we found 1294 accounts in the leave-centric group and 12 accounts in the remain-centric group that had #MAGA in either their name or description fields. We manually visited a random selection of these accounts to verify that they were alt-right personas. A few of the #MAGA accounts identified in the remain group were not what we would consider alt-right – they showed up in the results due to the presence of negative comments about MAGA culture in their account description fields.
  • As detailed earlier, some of the accounts in the leave-centric group interacted with non-authoritative, far-right “news” accounts, or shared links with sites associated with these accounts (such as UnityNewsNet, BreakingNLive, LibertyDefenders, INewsNet, Voice of Europe, ZNEWSNET, PoliticalUK, and PoliticaLite.) We didn’t find an analogous activity in the remain-centric group.

We created a few plots of the number of times a hashtag was observed during each hour of the day. For a baseline reference, here’s what that plot looks like for the #brexit hashtag:

You can clearly see a lull in activity during night-time hours in the UK. Compare the above baseline with the plot for the #yellowvestuk hashtag:

This clearly shows that the #yellowvest hashtag is most frequently used in the late evening UK time (mid-afternoon US time). Here is the plot for #yellowvestsuk:

Note that this hashtag follows a different pattern to #yellowvestuk, and is used most often around lunchtime in the UK. Both of these graphs show a lull in activity during night-time hours in the UK, indicating that the accounts pushing these hashtags most likely belong to people living in the UK, and that possibly different groups are promoting these two competing hashtags.

Final thoughts

It is very difficult to determine whether a Twitter account is a bot, or acting as part of a coordinated astroturfing campaign, simply by performing queries with the standard API. Twitter’s programmatic interface imposes many limitations to what can be done when analyzing an account. For instance, by default, only the last 3200 tweets can be collected from any given account, and Twitter restricts how often such a query can be run. Most of the potentially suspicious accounts identified during this research have published tens, or even hundreds of thousands of tweets over their lifetimes, most of which are now inaccessible.

Since Twitter’s API doesn’t support observing when a user “likes” a tweet, and has limited support for querying which accounts retweeted a tweet, or replied to a tweet, it is impossible to track all actions that occur on the platform. Nowadays, a user’s Twitter timeline contains a series of recommendations (for instance, tweets that appear on user’s timeline may indicate that they are there because “user x that you follow liked this tweet”). The timeline is no longer just a sequential list of tweets published by accounts a user follows. Hence it is important to understand which actions might increase the likelihood that a tweet appears on a user’s timeline, is recommended to a user (via notifications) or appears in a curated list when performing a search.

We do know that Twitter’s systems track an internal representation of the quality of every account, and give more engagement weight to higher quality accounts. Although it is likely that many of the potentially suspicious accounts identified during our research have low quality scores, it is still possible that their collective actions may incorrectly modify the sentiment of certain viewpoints and opinions, or cause content to be shown to users when it otherwise shouldn’t have.

From analysis of the “leave” and “remain” communities obtained by graph analysis, it seems clear to us that the remain-centric group looks quite organic, whilst the leave-centric group are being bolstered by non-UK far-right Twitter accounts. Leave users also utilize a number of “non-authoritative” news sources to spread their messages. Given that we also observed a subset of leave accounts performing amplification of political content related to French and US politics, we wouldn’t be surprised if coordinated astroturfing activity is being used to amplify pro-Brexit sentiment. Finding such a phenomenon would require additional work – most of the tweets published by this group likely weren’t captured by our stream-search for the word “brexit”. It’s clear that an internationally-coordinated collective of far-right activists are promoting content on Twitter (and likely other social networks) in order to steer discussions and amplify sentiment and opinion towards their own goals, Brexit being one of them.

During the course of our research, we created over 90 separate Jupyter notebooks and custom analysis tools. We would approximate that 90% of the approaches we tried ended up in dead ends. Despite all of this analysis work, we didn’t find the “next big” political disinformation botnet. We did, however, find many phenomena that were both interesting and odd.

A week in security (February 25 – March 3)

Last week, we delved into the realm of K-12 schools and security, explored the world of compromised websites and Golang bruteforcers, and examined the possible realms of pay for privacy. We also looked at identity management solutions, Google’s Universal Read Gadget, and did the deepest of dives into the life of Max Schrems.

Other security news

  • Big coin, big  problems: Founder of My Big Coin charged with seven counts of fraud (Source: The Register)
  • Another day, another exposed list: Specifically, the paid-for Dow Jones watchlist (Source: Security Discovery)
  • Mobile malware continues to rise: Mobile threats may have been a little quiet recently, but that certainly doesn’t mean they’ve gone away. Ignore at your peril (Source: CBR)
  • PDF tracking: Viewing some samples in Chrome can lead to tracking behaviour (source: Edgespot)
  • Verification bait and switch: Instagram users who desire verification status should be wary of a phish currently in circulation (Source: PCMag)
  • Missile warning sent from hacked Twitter account: The dangers of not securing your social media profile take on a whole new terrifying angle (Source: Naked Security)
  • Graphics card security update: NVIDIA rolls out a fix patching no less than 8 flaws for their display driver (Source: NVIDIA)
  • Momo, oh no: The supposed Momo challenge has predictably turned out to be an urban myth, except it was known to be a so-called creepypasta hoax for a long time (Source: IFLScience)
  • Police arrest supplier of radios: Turns out you really don’t want to install fraudulent software from someone Homeland security considers to be a security threat (Source: CBC news)

Stay safe, everyone!

The post A week in security (February 25 – March 3) appeared first on Malwarebytes Labs.

Smashing Security #117: SWATs on a plane

Smashing Security #117: SWATs on a plane

Why is Tampa’s mayor tweeting about blowing up the airport? Are hackers trying to connect with you via LinkedIn? And has Maria succeeded in her attempt to survive February without Facebook?

All this and much much more in the latest edition of the “Smashing Security” podcast by computer security veterans Graham Cluley and Carole Theriault, joined this week by Maria Varmazis.

Plus, after last week’s discussion about the legal battle between Mondelez and Zurich Insurance, we have a chat with security veteran Martin Overton to take a deeper look into cyberinsurance.

Will pay-for-privacy be the new normal?

Privacy is a human right, and online privacy should be no exception.

Yet, as the US considers new laws to protect individuals’ online data, at least two proposals—one statewide law that can still be amended and one federal draft bill that has yet to be introduced—include an unwelcome bargain: exchanging money for privacy.

This framework, sometimes called “pay-for-privacy,” is plain wrong. It casts privacy as a commodity that individuals with the means can easily purchase. But a move in this direction could further deepen the separation between socioeconomic classes. The “haves” can operate online free from prying eyes. But the “have nots” must forfeit that right.

Though this framework has been used by at least one major telecommunications company before, and there are no laws preventing its practice today, those in cybersecurity and the broader technology industry must put a stop to it. Before pay-for-privacy becomes law, privacy as a right should become industry practice.

Data privacy laws prove popular, but flawed

Last year, the European Union put into effect one of the most sweeping set of data privacy laws in the world. The General Data Protection Regulation, or GDPR, regulates how companies collect, store, share, and use EU citizens’ data. The law has inspired countries everywhere to follow suit, with Italy (an EU member) issuing regulatory fines against Facebook, Brazil passing a new data-protective bill, and Chile amending its constitution to include data protection rights.

The US is no exception to this ripple effect.

In the past year, Senators Ron Wyden of Oregon, Marco Rubio of Florida, Amy Klobuchar of Minnesota, and Brian Schatz, joined by 14 other senators as co-sponsors, of Hawaii, proposed separate federal bills to regulate how companies collect, use, and protect Americans’ data.

Sen. Rubio’s bill asks the Federal Trade Commission to write its own set of rules, which Congress would then vote on two years later. Sen. Klobuchar’s bill would require companies to write clear terms of service agreements and to send users notifications about privacy violations within 72 hours. Sen. Schatz’s bill introduces the idea that companies have a “duty to care” for consumers’ data by providing a “reasonable” level of security.

But it is Sen. Wyden’s bill, the Consumer Data Protection Act, that stands out, and not for good reason. Hidden among several privacy-forward provisions, like stronger enforcement authority for the FTC and mandatory privacy reports for companies of a certain size, is a dangerous pay-for-privacy stipulation.

According to the Consumer Data Protection Act, companies that require user consent for their services could charge users a fee if those users have opted out of online tracking.

If passed, here’s how the Consumer Data Protection Act would work:

Say a user, Alice, no longer feels comfortable having companies collect, share, and sell her personal information to third parties for the purpose of targeted ads and increased corporate revenue. First, Alice would register with the Federal Trade Commission’s “Do Not Track” website, where she would choose to opt-out of online tracking. Then, online companies with which Alice interacts would be required to check Alice’s “Do Not Track” status.

If a company sees that Alice has opted out of online tracking, that company is barred from sharing her information with third parties and from following her online to build and sell a profile of her Internet activity. Companies that are run almost entirely on user data—including Facebook, Amazon, Google, Uber, Fitbit, Spotify, and Tinder—would need to heed users’ individual decisions. However, those same companies could present Alice with a difficult choice: She can continue to use their services, free of online tracking, so long as she pays a price.

This represents a literal price for privacy.

Electronic Frontier Foundation Senior Staff Attorney Adam Schwartz said his organization strongly opposes pay-for-privacy systems.

“People should be able to not just opt out, but not be opted in, to corporate surveillance,” Schwartz said. “Also, when they choose to maintain their privacy, they shouldn’t have to pay a higher price.”

Pay-for-privacy schemes can come in two varieties: individuals can be asked to pay more for more privacy, or they can pay a lower (discounted) amount and be given less privacy. Both options, Schwartz said, incentivize people not to exercise their privacy rights, either because the cost is too high or because the monetary gain is too appealing.

Both options also harm low-income communities, Schwartz said.

“Poor people are more likely to be coerced into giving up their privacy because they need the money,” Schwartz said. “We could be heading into a world of the ‘privacy-haves’ and ‘have-nots’ that conforms to current economic statuses. It’s hard enough for low-income individuals to live in California with its high cost-of-living. This would only further aggravate the quality of life.”

Unfortunately, a pay-for-privacy provision is also included in the California Consumer Privacy Act, which the state passed last year. Though the law includes a “non-discrimination” clause meant to prevent just this type of practice, it also includes an exemption that allows companies to provide users with “incentives” to still collect and sell personal information.

In a larger blog about ways to improve the law, which was then a bill, Schwartz and other EFF attorneys wrote:

“For example, if a service costs money, and a user of this service refuses to consent to collection and sale of their data, then the service may charge them more than it charges users that do consent.”

Real-world applications

The alarm for pay-for-privacy isn’t theoretical—it has been implemented in the past, and there is no law stopping companies from doing it again.

In 2015, AT&T offered broadband service for a $30-a-month discount if users agreed to have their Internet activity tracked. According to AT&T’s own words, that Internet activity included the “webpages you visit, the time you spend on each, the links or ads you see and follow, and the search terms you enter.”

Most of the time, paying for privacy isn’t always so obvious, with real dollars coming out or going into a user’s wallet or checking account. Instead, it happens behind the scenes, and it isn’t the user getting richer—it’s the companies.

Powered by mountains of user data for targeted ads, Google-parent Alphabet recorded $32.6 billion in advertising revenue in the last quarter of 2018 alone. In the same quarter, Twitter recorded $791 million in ad revenue. And, notable for its CEO’s insistence that the company does not sell user data, Facebook’s prior plans to do just that were revealed in documents posted this week. Signing up for these services may be “free,” but that’s only because the product isn’t the platform—it’s the user.

A handful of companies currently reject this approach, though, refusing to sell or monetize users’ private information.

In 2014, CREDO Mobile separated itself from AT&T by promising users that their privacy “is not for sale. Period.” (The company does admit in its privacy policy that it may “sell or trade mailing lists” containing users’ names and street addresses, though.) ProtonMail, an encrypted email service, positions itself as a foil to Gmail because it does not advertise on its site, and it promises that users’ encrypted emails will never be scanned, accessed, or read. In fact, the company claims it can’t access these emails even if it wanted.

As for Google’s very first product—online search— the clearest privacy alternative is DuckDuckGo. The privacy-focused service does not track users’ searches, and it does not build individualized profiles of its users to deliver unique results.

Even without monetizing users’ data, DuckDuckGo has been profitable since 2014, said community manager Daniel Davis.

“At DuckDuckGo, we’ve been able to do this with ads based on context (individual search queries) rather than personalization.”

Davis said that DuckDuckGo’s decisions are steered by a long-held belief that privacy is a fundamental right. “When it comes to the online world,” Davis said, “things should be no different, and privacy by default should be the norm.”

It is time other companies follow suit, Davis said.

“Control of one’s own data should not come at a price, so it’s essential that [the] industry works harder to develop business models that don’t make privacy a luxury,” Davis said. “We’re proof this is possible.”

Hopefully, other companies are listening, because it shouldn’t matter whether pay-for-privacy is codified into law—it should never be accepted as an industry practice.

The post Will pay-for-privacy be the new normal? appeared first on Malwarebytes Labs.

Why Social Network Analysis Is Important

I got into social network analysis purely for nerdy reasons – I wanted to write some code in my free time, and python modules that wrap Twitter’s API (such as tweepy) allowed me to do simple things with just a few lines of code. I started off with toy tasks, (like mapping the time of day that @realDonaldTrump tweets) and then moved onto creating tools to fetch and process streaming data, which I used to visualize trends during some recent elections.

The more I work on these analyses, the more I’ve come to realize that there are layers upon layers of insights that can be derived from the data. There’s data hidden inside data – and there are many angles you can view it from, all of which highlight different phenomena. Social network data is like a living organism that changes from moment to moment.

Perhaps some pictures will help explain this better. Here’s a visualization of conversations about Brexit that happened between between the 3rd and 4th of December, 2018. Each dot is a user, and each line represents a reply, mention, or retweet.

Tweets supportive of the idea that the UK should leave the EU are concentrated in the orange-colored community at the top. Tweets supportive of the UK remaining in the EU are in blue. The green nodes represent conversations about UK’s Labour party, and the purple nodes reflect conversations about Scotland. Names of accounts that were mentioned more often have a larger font.

Here’s what the conversation space looked like between the 14th and 15th of January, 2019.

Notice how the shape of the visualization has changed. Every snapshot produces a different picture, that reflects the opinions, issues, and participants in that particular conversation space, at the moment it was recorded. Here’s one more – this time from the 20th to 21st of January, 2019.

Every interaction space is unique. Here’s a visual representation of interactions between users and hashtags on Twitter during the weekend before the Finnish presidential elections that took place in January of 2018.

And here’s a representation of conversations that happened in the InfoSec community on Twitter between the 15th and 16th of March, 2018.

I’ve been looking at Twitter data on and off for a couple of years now. My focus has been on finding scams, social engineering, disinformation, sentiment amplification, and astroturfing campaigns. Even though the data is readily available via Twitter’s API, and plenty of the analysis can be automated, oftentimes finding suspicious activity just involves blind luck – the search space is so huge that you have to be looking in the right place, at the right time, to find it. One approach is, of course, to think like the adversary. Social networks run on recommendation algorithms that can be probed and reverse engineered. Once an adversary understands how those underlying algorithms work, they’ll game them to their advantage. These tactics share many analogies with search engine optimization methodologies. One approach to countering malicious activities on these platforms is to devise experiments that simulate the way attackers work, and then design appropriate detection methods, or countermeasures against these. Ultimately, it would be beneficial to have automation that can trace suspicious activity back through time, to its source, visualize how the interactions propagated through the network, and provide relevant insights (that can be queried using natural language). Of course, we’re not there yet.

The way social networks present information to users has changed over time. In the past, Twitter feeds contained a simple, sequential list of posts published by the accounts a user followed. Nowadays, Twitter feeds are made up of recommendations generated by the platform’s underlying models – what they understand about a user, and what they think the user wants to see.

A potentially dystopian outcome of social networks was outlined in a blog post written by François Chollet in May 2018, which he describes social media becoming a “psychological panopticon”.

The premise for his theory is that the algorithms that drive social network recommendation systems have access to every user’s perceptions and actions. Algorithms designed to drive user engagement are currently rather simple, but if more complex algorithms (for instance, based on reinforcement learning) were to be used to drive these systems, they may end up creating optimization loops for human behavior, in which the recommender observes the current state of each target (user) and keeps tuning the information that is fed to them, until the algorithm starts observing the opinions and behaviors it wants to see. In essence the system will attempt to optimize its users. Here are some ways these algorithms may attempt to “train” their targets:

  • The algorithm may choose to only show its target content that it believes the target will engage or interact with, based on the algorithm’s notion of the target’s identity or personality. Thus, it will cause a reinforcement of certain opinions or views in the target, based on the algorithm’s own logic. (This is partially true today)
  • If the target publishes a post containing a viewpoint that the algorithm doesn’t wish the target to hold, it will only share it with users who would view the post negatively. The target will, after being flamed or down-voted enough times, stop sharing such views.
  • If the target publishes a post containing a viewpoint the algorithm wants the target to hold, it will only share it with users that would view the post positively. The target will, after some time, likely share more of the same views.
  • The algorithm may place a target in an “information bubble” where the target only sees posts from friends that share the target’s views (that are desirable to the algorithm).
  • The algorithm may notice that certain content it has shared with a target caused their opinions to shift towards a state (opinion) the algorithm deems more desirable. As such, the algorithm will continue to share similar content with the target, moving the target’s opinion further in that direction. Ultimately, the algorithm may itself be able to generate content to those ends.

Chollet goes on to mention that, although social network recommenders may start to see their users as optimization problems, a bigger threat still arises from external parties gaming those recommenders in malicious ways. The data available about users of a social network can already be used to predict when a when a user is suicidal or when a user will fall in love or break up with their partner, and content delivered by social networks can be used to change users’ moods. We also know that this same data can be used to predict which way a user will vote in an election, and the probability of whether that user will vote or not.

If this optimization problem seems like a thing of the future, bear in mind that, at the beginning of 2019, YouTube made changes to their recommendation algorithms exactly because of problems it was causing for certain members of society. Guillaume Chaslot posted a Twitter thread in February 2019 that described how YouTube’s algorithms favored recommending conspiracy theory videos, guided by the behaviors of a small group of hyper-engaged viewers. Fiction is often more engaging than fact, especially for users who spend all day, every day watching YouTube. As such, the conspiracy videos watched by this group of chronic users received high engagement, and thus were pushed up the recommendation system. Driven by these high engagement numbers, the makers of these videos created more and more content, which was, in-turn, viewed by this same group of users. YouTube’s recommendation system was optimized to pull more and more users into a hole of chronic YouTube addiction. Many of the users sucked into this hole have since become indoctrinated with right-wing extremist views. One such user actually became convinced that his brother was a lizard, and killed him with a sword. Chaslot has since created a tool that allows users to see which of these types of videos are being promoted by YouTube.

Social engineering campaigns run by entities such as the Internet Research Agency, Cambridge Analytica, and the far-right demonstrate that social media advert distribution platforms (such as those on Facebook) have provided a weapon for malicious actors that is incredibly powerful, and damaging to society. The disruption caused by their recent political campaigns has created divides in popular thinking and opinion that may take generations to repair. Now that the effectiveness of these social engineering techniques is apparent, I expect what we’ve seen so far is just an omen of what’s to come.

The disinformation we hear about is only a fraction of what’s actually happening. It requires a great deal of time and effort for researchers to find evidence of these campaigns. As I already noted, Twitter data is open and freely available, and yet it can still be extremely tedious to find evidence of disinformation campaigns on that platform. Facebook’s targeted ads are only seen by the users who were targeted in the first place. Unless those who were targeted come forward, it is almost impossible to determine what sort of ads were published, who they were targeted at, and what the scale of the campaign was. Although social media platforms now enforce transparency on political ads, the source of these ads must still be determined in order to understand who’s being targeted, and by what content.

Many individuals on social networks share links to “clickbait” headlines that align with their personal views or opinions (sometimes without having read the content behind the link). Fact checking is uncommon, and often difficult for people who don’t have a lot of time on their hands. As such, inaccurate or fabricated news, headlines, or “facts” propagate through social networks so quickly that even if they are later refuted, the damage is already done. This mechanism forms the very basis of malicious social media disinformation. A well-documented example of this was the UK’s “Leave” campaign that was run before the Brexit referendum. Some details of that campaign are documented in the recent Channel 4 film: “Brexit: The Uncivil War”.

Its not just the engineers of social networks that need to understand how they work and how they might be abused. Social networks are a relatively new form of human communication, and have only been around for a few decades. But they’re part of our everyday lives, and obviously they’re here to stay. Social networks are a powerful tool for spreading information and ideas, and an equally powerful weapon for social engineering, disinformation, and propaganda. As such, research into these systems should be of interest to governments, law enforcement, cyber security companies and organizations that seek to understand human communications, culture, and society.

The potential avenues of research in this field are numerous. Whilst my research with Twitter data has largely focused on graph analysis methodologies, I’ve also started experimenting with natural language processing techniques, which I feel have a great deal of potential.

The Orville, “Majority Rule”. A vote badge worn by all citizens of the alien world Sargus 4, allowing the wearer to receive positive or negative social currency. Source: youtube.com

We don’t yet know how much further social networks will integrate into society. Perhaps the future will end up looking like the “Majority Rule” episode of The Orville, or the “Nosedive” episode of Black Mirror, both of which depict societies in which each individual’s social “rating” determines what they can and can’t do and where a low enough rating can even lead to criminal punishment.

Historical OSINT – Profiling a Typosquatted Facebook and Twitter Impersonating Fraudulent and Malicious Domains Portfolio

With cybercriminals continuing to populate the cybercrime ecosystem with hundreds of malicious released including a variety of typosquatted domains it shouldn't be surprising that hundreds of thousands of users continue falling victim to fraudulent and malicious malware and exploits serving schemes. In this post I'll profile a currently active fraudulent and malicious typosquatted domain

Hackers seize dormant Twitter accounts to push terrorist propaganda

As much progress as Twitter has made kicking terrorists off its platform, it still has a long way to go. TechCrunch has learned that ISIS supporters are hijacking long-dormant Twitter accounts to promote their ideology. Security researcher WauchulaGhost found that the extremists were using a years-old trick to get in. Many of these idle accounts used email addresses that either expired or never existed, often with names identical to their Twitter handles -- the social site didn't confirm email addresses for roughly a decade, making it possible to use the service without a valid inbox. As Twitter only partly masks those addresses, it's easy to create those missing addresses and reset those passwords.

Source: TechCrunch

Project Lakhta: Putin’s Chef spends $35M on social media influence

Project Lakhta is the name of a Russian project that was further documented by the Department of Justice last Friday in the form of sharing a Criminal Complaint against Elena Alekseevna Khusyaynova, said to be the accountant in charge of running a massive organization designed to inject distrust and division into the American elections and American society in general.

https://www.justice.gov/opa/press-release/file/1102316/download
In a fairly unusual step, the 39 page Criminal Complaint against Khusyaynova, filed just last month in Alexandria, Virginia, has already been unsealed, prior to any indictment or specific criminal charges being brought against her before a grand jury.  US Attorney G. Zachary Terwilliger says "The strategic goal of this alleged conspiracy, which continues to this day, is to sow discord in the U.S. political system and to undermine faith in our democratic institutions."

The data shared below, intended to summarize the 39 page criminal complaint, contains many direct quotes from the document, which has been shared by the DOJ. ( Click for full Criminal Complaint against Elena Khusyaynova )

Since May 2014 the complaint shows that the following organizations were used as cover to spread distrust towards candidates for political office and the political system in general.

Internet Research Agency LLC ("IRA")
Internet Research LLC
MediaSintez LLC
GlavSet LLC
MixInfo LLC
Azimut LLC
NovInfo LLC
Nevskiy News LLC ("NevNov")
Economy Today LLC
National News LLC
Federal News Agency LLC ("FAN")
International News Agency LLC ("MAN")

These entities employed hundreds of individuals in support of Project Lakhta's operations with an annual global budget of millions of US dollars.  Only some of their activity was directed at the United States.

Prigozhin and Concord 

Concord Management and Consulting LLC and Concord Catering (collectively referred to as "Concord") are related Russian entities with various Russian government contracts.  Concord was the primary source of funding for Project Lakhta, controlling funding, recommending personnel, and overseeing activities through reporting and interaction with the management of various Project Lakhta entities.

Yevgeniy Viktorovich Prigozhin is a Russian oligarch closely identified with Russian President Vladimir Putin.  He began his career in the food and restaurant business and is sometimes referred to as "Putin's Chef."  Concord has Russian government contracts to feed school children and the military.

Prigozhin was previously indicted, along with twelve others and three Russian companies, with committing federal crimes while seeking to interfere with the US elections and political process, including the 2016 presidential election.

Project Lakhta internally referred to their work as "information warfare against the United States of America" which was conducted through fictitious US personas on social media platforms and other Internet-based media.

Lakhta has a management group which organized the project into departments, including a design and graphics department, an analysts department, a search-engine optimization ("SEO") department, an IT department and a finance department.

Khusyaynova has been the chief accountant of Project Lakhta's finance department since April of 2014, which included the budgets of most or all of the previously named organizations.  She submitted hundreds of financial vouchers, budgets, and payments requests for the Project Lakhta entities.  The money was managed through at least 14 bank accounts belonging to more Project Lakhta affiliates, including:

Glavnaya Liniya LLC
Merkuriy LLC
Obshchepit LLC
Potentsial LLC
RSP LLC
ASP LLC
MTTs LLC
Kompleksservis LLC
SPb Kulinariya LLC
Almira LLC
Pishchevik LLC
Galant LLC
Rayteks LLC
Standart LLC

Project Lakhta Spending 

Monthly reports were provided by Khusyaynova to Concord about the spendings for at least the period from January 2016 through July 2018.

A document sent in January 2017 including the projected budget for February 2017 (60 million rubles, or roughly $1 million USD), and an accounting of spending for all of calendar 2016 (720 million rubles, or $12 million USD).  Expenses included:

Registration of domain names
Purchasing proxy servers
Social media marketing expenses, including:
 - purchasing posts for social networks
 - advertisements on Facebook
 - advertisements on VKontakte
 - advertisements on Instagram
 - promoting posts on social networks

Other expenses were for Activists, Bloggers, and people who "developed accounts" on Twitter to promote online videos.

In January 2018, the "annual report" for 2017 showed 733 million Russian rubles of expenditure ($12.2M USD).

More recent expenses, between January 2018 and June 2018, included more than $60,000 in Facebook ads, and $6,000 in Instagram ads, as well as $18,000 for Bloggers and Twitter account developers.

Project Lakhta Messaging

From December 2016 through May 2018, Lakhta analysts and activist spread messages "to inflame passions on a wide variety of topics" including:
  • immigration
  • gun control and the Second Amendment 
  • the Confederate flag
  • race relations
  • LGBT issues 
  • the Women's March 
  • and the NFL national anthem debate.


Events in the United States were seized upon "to anchor their themes" including the Charleston church shootings, the Las Vegas concert shootings, the Charlottesville "Unite the Right" rally, police shootings of African-American men, and the personnel and policy decisions of the Trump administration.

Many of the graphics that were shared will be immediately recognizable to most social media users.

"Rachell Edison" Facebook profile
The graphic above was shared by a confirmed member of the conspiracy on December 5, 2016. "Rachell Edison" was a Facebook profile controlled by someone on payroll from Project Lakhta.  Their comment read  "Whatever happens, blacks are innocent. Whatever happens, it's all guns and cops. Whatever happens, it's all racists and homophobes. Mainstream Media..."

The Rachell Edison account was created in September 2016 and controlled the Facebook page "Defend the 2nd".  Between December 2016 and May 2017, "while concealing its true identity, location, and purpose" this account was used to share over 700 inflammatory posts related to gun control and the Second Amendment.

Other accounts specialized on other themes.  Another account, using the name "Bertha Malone", was created in June 2015, using fake information to claim that the account holder lived in New York City and attended a university in NYC.   In January 2016, the account created a Facebook page called "Stop All Invaders" (StopAI) which shared over 400 hateful anti-immigration and anti-Islam memes, implying that all immigrants were either terrorists or criminals.  Posts shared by this acount reached 1.3 million individuals and at least 130,851 people directly engaged with the content (for example, by liking, sharing, or commenting on materials that originated from this account.)

Some examples of the hateful posts shared by "Bertha Malone" that were included in the DOJ criminal complaint,  included these:




The latter image was accompanied by the comment:

"Instead this stupid witch hunt on Trump, media should investigate this traitor and his plane to Islamize our country. If you are true enemy of America, take a good look at Barack Hussein Obama and Muslim government officials appointed by him."

Directions to Project Lakhta Team Members


The directions shared to the propaganda spreaders gave very specific examples of how to influence American thought with guidance on what sources and techniques should be used to influence particular portions of our society.  For example, to further drive wedges in the Republican party, Republicans who spoke out against Trump were attacked in social media:
(all of these are marked in the Criminal Complaint as "preliminary translations of Russian text"):

"Brand McCain as an old geezer who has lost it and who long ago belonged in a home for the elderly. Emphasize that John McCain's pathological hatred towards Donald Trump and towards all his initiatives crosses all reasonable borders and limits.  State that dishonorable scoundrels, such as McCain, immediately aim to destroy all the conservative voters' hopes as soon as Trump tries to fulfill his election promises and tries to protect the American interests."

"Brand Paul Ryan a complete and absolute nobody incapable of any decisiveness.  Emphasize that while serving as Speaker, this two-faced loudmouth has not accomplished anything good for America or for American citizens.  State that the only way to get rid of Ryan from Congress, provided he wins in the 2018 primaries, is to vote in favor of Randy Brice, an American veteran and an iron worker and a Democrat."

Frequently the guidance was in relation to a particular news headline, where directions on how to use the headline to spread their message of division where shared. A couple examples of these:

After a news story "Trump: No Welfare To Migrants for Grants for First 5 Years" was shared, the conspiracy was directed to twist the messaging like this:

"Fully support Donald Trump and express the hope that this time around Congress will be forced to act as the president says it should. Emphasize that if Congress continues to act like the Colonial British government did before the War of Independence, this will call for another revolution.  Summarize that Trump once again proved that he stands for protecting the interests of the United States of America."

In response to an article about scandals in the Robert Mueller investigation, the direction was to use this messaging:

"Special prosecutor Mueller is a puppet of the establishment. List scandals that took place when Mueller headed the FBI.  Direct attention to the listed examples. State the following: It is a fact that the Special Prosector who leads the investigation against Trump represents the establishment: a politician with proven connections to the U.S. Democratic Party who says things that should either remove him from his position or disband the entire investigation commission. Summarize with a statement that Mueller is a very dependent and highly politicized figure; therefore, there will be no honest and open results from his investigation. Emphasize that the work of this commission is damaging to the country and is aimed to declare impeachement of Trump. Emphasize that it cannot be allowed, no matter what."

Many more examples are given, some targeted at particular concepts, such as this direction regarding "Sanctuary Cities":

"Characterize the position of the Californian sanctuary cities along with the position of the entire California administration as absolutely and completely treacherous and disgusting. Stress that protecting an illegal rapist who raped an American child is the peak of wickedness and hypocrisy. Summarize in a statement that "sanctuary city" politicians should surrender their American citizenship, for they behave as true enemies of the United States of America"

Some more basic guidance shared by Project Lakhta was about how to target conservatives vs. liberals, such as "if you write posts in a liberal group, you must not use Breitbart titles.  On the contrary, if you write posts in a conservative group, do not use Washington Post or BuzzFeed's titles."

We see the "headline theft" implied by this in some of their memes.  For example, this Breitbart headline:


Became this Project Lakhta meme (shared by Stop All Immigrants):


Similarly this meme originally shared as a quote from the Heritage Foundation, was adopted and rebranded by Lakhta-funded "Stop All Immigrants": 



Twitter Messaging and Specific Political Races

Many Twitter accounts shown to be controlled by paid members of the conspiracy were making very specific posts in support of or in opposition to particular candidates for Congress or Senate.  Some examples listed in the Criminal Complaint include:

@CovfefeNationUS posting:

Tell us who you want to defeat!  Donate $1.00 to defeat @daveloebsack Donate $2.00 to defeat @SenatorBaldwin Donate $3.00 to defeat @clairecmc Donate $4.00 to defeat @NancyPelosi Donate $5.00 to defeat @RepMaxineWaters Donate $6.00 to defeat @SenWarren

Several of the Project Lakhta Twitter accounts got involved in the Alabama Senate race, but to point out that the objective of Lakhta is CREATE DISSENT AND DISTRUST, they actually tweeted on opposite sides of the campaign:

One Project Lakhta Twitter account, @KaniJJackson, posted on December 12, 2017: 

"Dear Alabama, You have a choice today. Doug Jones put the KKK in prison for murdering 4 young black girls.  Roy Moore wants to sleep with your teenage daughters. This isn't hard. #AlabamaSenate"

while on the same day @JohnCopper16, also a confirmed Project Lakhta Twitter account, tweeted:

"People living in Alabama have different values than people living in NYC. They will vote for someone who represents them, for someone who they can trust. Not you.  Dear Alabama, vote for Roy Moore."

@KaniJJackson was a very active voice for Lakhta.  Here are some additional tweets for that account:

"If Trump fires Robert Mueller, we have to take to the streets in protest.  Our democracy is at stake." (December 16, 2017)

"Who ended DACA? Who put off funding CHIP for 4 months? Who rejected a deal to restore DACA? It's not #SchumerShutdown. It's #GOPShutdown." (January 19, 2018)

@JohnCopper16 also tweeted on that topic: 
"Anyone who believes that President Trump is responsible for #shutdown2018 is either an outright liar or horribly ignorant. #SchumerShutdown for illegals. #DemocratShutdown #DemocratLosers #DemocratsDefundMilitary #AlternativeFacts"   (January 20, 2018)

@KaniJJackson on Parkland, Florida and the 2018 Midterm election: 
"Reminder: the same GOP that is offering thoughts and prayers today are the same ones that voted to allow loosening gun laws for the mentally ill last February.  If you're outraged today, VOTE THEM OUT IN 2018. #guncontrol #Parkland"

They even tweet about themselves, as shown in this pair of tweets!

@JemiSHaaaZzz (February 16, 2018):
"Dear @realDonaldTrump: The DOJ indicted 13 Russian nationals at the Internet Research Agency for violating federal criminal law to help your campaign and hurt other campaigns. Still think this Russia thing is a hoax and a witch hunt? Because a lot of witches just got indicted."

@JohnCopper16 (February 16, 2018): 
"Russians indicted today: 13  Illegal immigrants crossing Mexican border indicted today: 0  Anyway, I hope all those Internet Research Agency f*ckers will be sent to gitmo." 

The Russians are also involved in "getting out the vote" - especially of those who hold strongly divisive views:

@JohnCopper16 (February 27, 2018):
"Dem2018 platform - We want women raped by the jihadists - We want children killed - We want higher gas prices - We want more illegal aliens - We want more Mexican drugs And they are wondering why @realDonaldTrump became the President"

@KaniJJackson (February 19, 2018): 
"Midterms are 261 days, use this time to: - Promote your candidate on social media - Volunteer for a campaign - Donate to a campaign - Register to vote - Help others register to vote - Spread the word We have only 261 days to guarantee survival of democracy. Get to work! 

More recent tweets have been on a wide variety of topics, with other accounts expressing strong views around racial tensions, and then speaking to the Midterm elections: 

@wokeluisa (another confirmed Project Lakhta account): 
"Just a reminder that: - Majority black Flint, Michigan still has drinking water that will give you brain damage if consumed - Republicans are still trying to keep black people from voting - A terrorist has been targeting black families for assassination in Austin, Texas" 

and then, also @wokeluisa: (March 19, 2018): 
"Make sure to pre-register to vote if you are 16 y.o. or older. Don't just sit back, do something about everything that's going on because November 6, 2018 is the date that 33 senate seats, 436 seats in the House of Representatives and 36 governorships will be up for re-election." 

And from @johncopper16 (March 22, 2018):
"Just a friendly reminder to get involved in the 2018 Midterms. They are motivated They hate you They hate your morals They hate your 1A and 2A rights They hate the Police They hate the Military They hate YOUR President" 

Some of the many additional Twitter accounts controlled by the conspiracy mentioned in the Criminal Complaint: 

@UsaUsafortrump, @USAForDTrump, @TrumpWithUSA, @TrumpMov, @POTUSADJT, @imdeplorable201, @swampdrainer659, @maga2017trump, @TXCowboysRawk, @covfefeNationUS, @wokeluisa (2,000 tweets and at least 55,000 followers), @JohnCopper16, @Amconvoice, @TheTrainGuy13, @KaniJJackson, @JemiSHaaaZzz 




Pr0nbots2: Revenge Of The Pr0nbots

A month and a half ago I posted an article in which I uncovered a series of Twitter accounts advertising adult dating (read: scam) websites. If you haven’t read it yet, I recommend taking a look at it before reading this article, since I’ll refer back to it occasionally.

To start with, let’s recap. In my previous research, I used a script to recursively query Twitter accounts for specific patterns, and found just over 22,000 Twitter bots using this process. This figure was based on the fact that I concluded my research (stopped my script) after querying only 3000 of the 22,000 discovered accounts. I have a suspicion that my script would have uncovered a lot more accounts, had I let it run longer.

This week, I decided to re-query all the Twitter IDs I found in March, to see if anything had changed. To my surprise, I was only able to query 2895 of the original 21964 accounts, indicating that Twitter has taken action on most of those accounts.

In order to find out whether the culled accounts were deleted or suspended, I wrote a small python script that utilized the requests module to directly query each account’s URL. If the script encountered a 404 error, it indicated that the account was removed or renamed. A reply indicated that the account was suspended. Of the 19069 culled accounts checked, 18932 were suspended, and 137 were deleted/renamed.

I also checked the surviving accounts in a similar manner, using requests to identify which ones were “restricted” (by checking for specific strings in the html returned from the query). Of the 2895 surviving accounts, 47 were set to restricted and the other 2848 were not.

As noted in my previous article, the accounts identified during my research had creation dates ranging from a few days old to over a decade in age. I checked the creation dates of both the culled set and the survivor’s set (using my previously recorded data) for patterns, but I couldn’t find any. Here they are, for reference:

Based on the connectivity I recorded between the original bot accounts, I’ve created a new graph visualization depicting the surviving communities. Of the 2895 survivors, only 402 presumably still belong to the communities I observed back then. The rest of the accounts were likely orphaned. Here’s a representation of what the surviving communities might look like, if the entity controlling these accounts didn’t make any changes in the meantime.

By the way, I’m using Gephi to create these graph visualizations, in case you were wondering.

Erik Ellason (@slickrockweb) contacted me recently with some evidence that the bots I’d discovered might be re-tooling. He pointed me to a handful of accounts that contained the shortened URL in a pinned tweet (instead of in the account’s description). Here’s an example profile:

Fetching a user object using the Twitter API will also return the last tweet that account published, but I’m not sure it would necessarily return the pinned Tweet. In fact, I don’t think there’s a way of identifying a pinned Tweet using the standard API. Hence, searching for these accounts by their promotional URL would be time consuming and problematic (you’d have to iterate through their tweets).

Fortunately, automating discovery of Twitter profiles similar to those Eric showed me was fairly straightforward. Like the previous botnet, the accounts could be crawled due to the fact that they follow each other. Also, all of these new accounts had text in their descriptions that followed a predictable pattern. Here’s an example of a few of those sentences:

look url in last post
go on link in top tweet
go at site in last post

It was trivial to construct a simple regular expression to find all such sentences:

desc_regex = "(look|go on|go at|see|check|click) (url|link|site) in (top|last) (tweet|post)"

I modified my previous script to include the above regular expression, seeded it with the handful of accounts that Eric had provided me, and let it run. After 24 hours, my new script had identified just over 20000 accounts. Mapping the follower/following relationships between these accounts gave me the following graph:

As we zoom in, you’ll notice that these accounts are way more connected than the older botnet. The 20,000 or so accounts identified at this point map to just over 100 separate communities. With roughly the same amount of accounts, the previous botnet contained over 1000 communities.

Zooming in further shows the presence of “hubs” in each community, similar to in our previous botnet.

Given that this botnet showed a greater degree of connectivity than the previous one studied, I decided to continue my discovery script and collect more data. The discovery rate of new accounts slowed slightly after the first 24 hours, but remained steady for the rest of the time it was running. After 4 days, my script had found close to 44,000 accounts.

And eight days later, the total was just over 80,000.

Here’s another way of visualizing that data:


Here’s the size distribution of communities detected for the 80,000 node graph. Smaller community sizes may indicate places where my discovery script didn’t yet look. The largest communities contained over 1000 accounts. There may be a way of searching more efficiently for these accounts by prioritizing crawling within smaller communities, but this is something I’ve yet to explore.

I shut down my discovery script at this point, having queried just over 30,000 accounts. I’m fairly confident this rabbit hole goes a lot deeper, but it would have taken weeks to query the next 50,000 accounts, not to mention the countless more that would have been added to the list during that time.

As with the previous botnet, the creation dates of these accounts spanned over a decade.

Here’s the oldest account I found.

Using the same methodology I used to analyze the survivor accounts from the old botnet, I checked which of these new accounts were restricted by Twitter. There was an almost exactly even split between restricted and non-restricted accounts in this new set.

Given that these new bots show many similarities to the previously discovered botnet (similar avatar pictures, same URL shortening services, similar usage of the English language) we might speculate that this new set of accounts is being managed by the same entity as those older ones. If this is the case, a further hypothesis is that said entity is re-tooling based on Twitter’s action against their previous botnet (for instance, to evade automation).

Because these new accounts use a pinned Tweet to advertise their services, we can test this hypothesis by examining the creation dates of the most recent Tweet from each account. If the entity is indeed re-tooling, all of the accounts should have Tweeted fairly recently. However, a brief examination of last tweet dates for these accounts revealed a rather large distribution, tracing back as far as 2012. The distribution had a long tail, with a majority of the most recent Tweets having been published within the last year. Here’s the last year’s worth of data graphed.

Here’s the oldest Tweet I found:

This data, on it’s own, would refute the theory that the owner of this botnet has been recently retooling. However, a closer look at some of the discovered accounts reveals an interesting story. Here are a few examples.

This account took a 6 year break from Twitter, and switched language to English.

This account mentions a “url in last post” in its bio, but there isn’t one.

This account went from posting in Korean to posting in English, with a 3 year break in between. However, the newer Tweet mentions “url in bio”. Sounds vaguely familiar.

Examining the text contained in the last Tweets from these discovered accounts revealed around 76,000 unique Tweets. Searching these Tweets for links containing the URL shortening services used by the previous botnet revealed 8,200 unique Tweets. Here’s a graph of the creation dates of those particular Tweets.

As we can see, the Tweets containing shortened URLs date back only 21 days. Here’s a distribution of domains seen in those Tweets.

My current hypothesis is that the owner of the previous botnet has purchased a batch of Twitter accounts (of varying ages) and has been, at least for the last 21 days, repurposing those accounts to advertise adult dating sites using the new pinned-Tweet approach.

One final thing – I checked the 2895 survivor accounts from the previously discovered botnet to see if any had been reconfigured to use a pinned Tweet. At the time of checking, only one of those accounts had been changed.

If you’re interested in looking at the data I collected, I’ve uploaded names/ids of all discovered accounts, the follower/following mappings found between these accounts, the gephi save file for the 80,000 node graph, and a list of accounts queried by my script (in case someone would like to continue iterating through the unqueried accounts.) You can find all of that data in this github repo.

Marketing “Dirty Tinder” On Twitter

About a week ago, a Tweet I was mentioned in received a dozen or so “likes” over a very short time period (about two minutes). I happened to be on my computer at the time, and quickly took a look at the accounts that generated those likes. They all followed a similar pattern. Here’s an example of one of the accounts’ profiles:

This particular avatar was very commonly used as a profile picture in these accounts.

All of the accounts I checked contained similar phrases in their description fields. Here’s a list of common phrases I identified:

  • Check out
  • Check this
  • How do you like my site
  • How do you like me
  • You love it harshly
  • Do you like fast
  • Do you like it gently
  • Come to my site
  • Come in
  • Come on
  • Come to me
  • I want you
  • You want me
  • Your favorite
  • Waiting you
  • Waiting you at

All of the accounts also contained links to URLs in their description field that pointed to domains such as the following:

  • me2url.info
  • url4.pro
  • click2go.info
  • move2.pro
  • zen5go.pro
  • go9to.pro

It turns out these are all shortened URLs, and the service behind each of them has the exact same landing page:

“I will ban drugs, spam, porn, etc.” Yeah, right.

My colleague, Sean, checked a few of the links and found that they landed on “adult dating” sites. Using a VPN to change the browser’s exit node, he noticed that the landing pages varied slightly by region. In Finland, the links ended up on a site called “Dirty Tinder”.

Checking further, I noticed that some of the accounts either followed, or were being followed by other accounts with similar traits, so I decided to write a script to programmatically “crawl” this network, in order to see how large it is.

The script I wrote was rather simple. It was seeded with the dozen or so accounts that I originally witnessed, and was designed to iterate friends and followers for each user, looking for other accounts displaying similar traits. Whenever a new account was discovered, it was added to the query list, and the process continued. Of course, due to Twitter API rate limit restrictions, the whole crawler loop was throttled so as to not perform more queries than the API allowed for, and hence crawling the network took quite some time.

My script recorded a graph of which accounts were following/followed by which other accounts. After a few hours I checked the output and discovered an interesting pattern:

Graph of follower/following relationships between identified accounts after about a day of running the discovery script.

The discovered accounts seemed to be forming independent “clusters” (through follow/friend relationships). This is not what you’d expect from a normal social interaction graph.

After running for several days the script had queried about 3000 accounts, and discovered a little over 22,000 accounts with similar traits. I stopped it there. Here’s a graph of the resulting network.

Pretty much the same pattern I’d seen after one day of crawling still existed after one week. Just a few of the clusters weren’t “flower” shaped. Here’s a few zooms of the graph.

 

Since I’d originally noticed several of these accounts liking the same tweet over a short period of time, I decided to check if the accounts in these clusters had anything in common. I started by checking this one:

Oddly enough, there were absolutely no similarities between these accounts. They were all created at very different times and all Tweeted/liked different things at different times. I checked a few other clusters and obtained similar results.

One interesting thing I found was that the accounts were created over a very long time period. Some of the accounts discovered were over eight years old. Here’s a breakdown of the account ages:

As you can see, this group has less new accounts in it than older ones. That big spike in the middle of the chart represents accounts that are about six years old. One reason why there are fewer new accounts in this network is because Twitter’s automation seems to be able to flag behaviors or patterns in fresh accounts and automatically restrict or suspend them. In fact, while my crawler was running, many of the accounts on the graphs above were restricted or suspended.

Here are a few more breakdowns – Tweets published, likes, followers and following.

Here’s a collage of some of the profile pictures found. I modified a python script to generate this – far better than using one of those “free” collage making tools available on the Internets. 🙂

So what are these accounts doing? For the most part, it seems they’re simply trying to advertise the “adult dating” sites linked in the account profiles. They do this by liking, retweeting, and following random Twitter accounts at random times, fishing for clicks. I did find one that had been helping to sell stuff:

Individually the accounts probably don’t break any of Twitter’s terms of service. However, all of these accounts are likely controlled by a single entity. This network of accounts seems quite benign, but in theory, it could be quickly repurposed for other tasks including “Twitter marketing” (paid services to pad an account’s followers or engagement), or to amplify specific messages.

If you’re interested, I’ve saved a list of both screen_name and id_str for each discovered account here. You can also find the scraps of code I used while performing this research in that same github repo.

How To Get Twitter Follower Data Using Python And Tweepy

In January 2018, I wrote a couple of blog posts outlining some analysis I’d performed on followers of popular Finnish Twitter profiles. A few people asked that I share the tools used to perform that research. Today, I’ll share a tool similar to the one I used to conduct that research, and at the same time, illustrate how to obtain data about a Twitter account’s followers.

This tool uses Tweepy to connect to the Twitter API. In order to enumerate a target account’s followers, I like to start by using Tweepy’s followers_ids() function to get a list of Twitter ids of accounts that are following the target account. This call completes in a single query, and gives us a list of Twitter ids that can be saved for later use (since both screen_name and name an be changed, but the account’s id never changes). Once I’ve obtained a list of Twitter ids, I can use Tweepy’s lookup_users(userids=batch) to obtain Twitter User objects for each Twitter id. As far as I know, this isn’t exactly the documented way of obtaining this data, but it suits my needs. /shrug

Once a full set of Twitter User objects has been obtained, we can perform analysis on it. In the following tool, I chose to look at the account age and friends_count of each account returned, print a summary, and save a summarized form of each account’s details as json, for potential further processing. Here’s the full code:

from tweepy import OAuthHandler
from tweepy import API
from collections import Counter
from datetime import datetime, date, time, timedelta
import sys
import json
import os
import io
import re
import time

# Helper functions to load and save intermediate steps
def save_json(variable, filename):
    with io.open(filename, "w", encoding="utf-8") as f:
        f.write(unicode(json.dumps(variable, indent=4, ensure_ascii=False)))

def load_json(filename):
    ret = None
    if os.path.exists(filename):
        try:
            with io.open(filename, "r", encoding="utf-8") as f:
                ret = json.load(f)
        except:
            pass
    return ret

def try_load_or_process(filename, processor_fn, function_arg):
    load_fn = None
    save_fn = None
    if filename.endswith("json"):
        load_fn = load_json
        save_fn = save_json
    else:
        load_fn = load_bin
        save_fn = save_bin
    if os.path.exists(filename):
        print("Loading " + filename)
        return load_fn(filename)
    else:
        ret = processor_fn(function_arg)
        print("Saving " + filename)
        save_fn(ret, filename)
        return ret

# Some helper functions to convert between different time formats and perform date calculations
def twitter_time_to_object(time_string):
    twitter_format = "%a %b %d %H:%M:%S %Y"
    match_expression = "^(.+)\s(\+[0-9][0-9][0-9][0-9])\s([0-9][0-9][0-9][0-9])$"
    match = re.search(match_expression, time_string)
    if match is not None:
        first_bit = match.group(1)
        second_bit = match.group(2)
        last_bit = match.group(3)
        new_string = first_bit + " " + last_bit
        date_object = datetime.strptime(new_string, twitter_format)
        return date_object

def time_object_to_unix(time_object):
    return int(time_object.strftime("%s"))

def twitter_time_to_unix(time_string):
    return time_object_to_unix(twitter_time_to_object(time_string))

def seconds_since_twitter_time(time_string):
    input_time_unix = int(twitter_time_to_unix(time_string))
    current_time_unix = int(get_utc_unix_time())
    return current_time_unix - input_time_unix

def get_utc_unix_time():
    dts = datetime.utcnow()
    return time.mktime(dts.timetuple())

# Get a list of follower ids for the target account
def get_follower_ids(target):
    return auth_api.followers_ids(target)

# Twitter API allows us to batch query 100 accounts at a time
# So we'll create batches of 100 follower ids and gather Twitter User objects for each batch
def get_user_objects(follower_ids):
    batch_len = 100
    num_batches = len(follower_ids) / 100
    batches = (follower_ids[i:i+batch_len] for i in range(0, len(follower_ids), batch_len))
    all_data = []
    for batch_count, batch in enumerate(batches):
        sys.stdout.write("\r")
        sys.stdout.flush()
        sys.stdout.write("Fetching batch: " + str(batch_count) + "/" + str(num_batches))
        sys.stdout.flush()
        users_list = auth_api.lookup_users(user_ids=batch)
        users_json = (map(lambda t: t._json, users_list))
        all_data += users_json
    return all_data

# Creates one week length ranges and finds items that fit into those range boundaries
def make_ranges(user_data, num_ranges=20):
    range_max = 604800 * num_ranges
    range_step = range_max/num_ranges

# We create ranges and labels first and then iterate these when going through the whole list
# of user data, to speed things up
    ranges = {}
    labels = {}
    for x in range(num_ranges):
        start_range = x * range_step
        end_range = x * range_step + range_step
        label = "%02d" % x + " - " + "%02d" % (x+1) + " weeks"
        labels[label] = []
        ranges[label] = {}
        ranges[label]["start"] = start_range
        ranges[label]["end"] = end_range
    for user in user_data:
        if "created_at" in user:
            account_age = seconds_since_twitter_time(user["created_at"])
            for label, timestamps in ranges.iteritems():
                if account_age > timestamps["start"] and account_age < timestamps["end"]:
                    entry = {} 
                    id_str = user["id_str"] 
                    entry[id_str] = {} 
                    fields = ["screen_name", "name", "created_at", "friends_count", "followers_count", "favourites_count", "statuses_count"] 
                    for f in fields: 
                        if f in user: 
                            entry[id_str][f] = user[f] 
                    labels[label].append(entry) 
    return labels

if __name__ == "__main__": 
    account_list = [] 
    if (len(sys.argv) > 1):
        account_list = sys.argv[1:]

    if len(account_list) < 1:
        print("No parameters supplied. Exiting.")
        sys.exit(0)

    consumer_key=""
    consumer_secret=""
    access_token=""
    access_token_secret=""

    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    auth_api = API(auth)

    for target in account_list:
        print("Processing target: " + target)

# Get a list of Twitter ids for followers of target account and save it
        filename = target + "_follower_ids.json"
        follower_ids = try_load_or_process(filename, get_follower_ids, target)

# Fetch Twitter User objects from each Twitter id found and save the data
        filename = target + "_followers.json"
        user_objects = try_load_or_process(filename, get_user_objects, follower_ids)
        total_objects = len(user_objects)

# Record a few details about each account that falls between specified age ranges
        ranges = make_ranges(user_objects)
        filename = target + "_ranges.json"
        save_json(ranges, filename)

# Print a few summaries
        print
        print("\t\tFollower age ranges")
        print("\t\t===================")
        total = 0
        following_counter = Counter()
        for label, entries in sorted(ranges.iteritems()):
            print("\t\t" + str(len(entries)) + " accounts were created within " + label)
            total += len(entries)
            for entry in entries:
                for id_str, values in entry.iteritems():
                    if "friends_count" in values:
                        following_counter[values["friends_count"]] += 1
        print("\t\tTotal: " + str(total) + "/" + str(total_objects))
        print
        print("\t\tMost common friends counts")
        print("\t\t==========================")
        total = 0
        for num, count in following_counter.most_common(20):
            total += count
            print("\t\t" + str(count) + " accounts are following " + str(num) + " accounts")
        print("\t\tTotal: " + str(total) + "/" + str(total_objects))
        print
        print

Let’s run this tool against a few accounts and see what results we get. First up: @realDonaldTrump

realdonaldtrump_age_ranges

Age ranges of new accounts following @realDonaldTrump

As we can see, over 80% of @realDonaldTrump’s last 5000 followers are very new accounts (less than 20 weeks old), with a majority of those being under a week old. Here’s the top friends_count values of those accounts:

realdonaldtrump_friends_counts

Most common friends_count values seen amongst the new accounts following @realDonaldTrump

No obvious pattern is present in this data.

Next up, an account I looked at in a previous blog post – @niinisto (the president of Finland).

Age ranges of new accounts following @niinisto

Many of @niinisto’s last 5000 followers are new Twitter accounts. However, not in as large of a proportion as in the @realDonaldTrump case. In both of the above cases, this is to be expected, since both accounts are recommended to new users of Twitter. Let’s look at the friends_count values for the above set.

Most common friends_count values seen amongst the new accounts following @niinisto

In some cases, clicking through the creation of a new Twitter account (next, next, next, finish) will create an account that follows 21 Twitter profiles. This can explain the high proportion of accounts in this list with a friends_count value of 21. However, we might expect to see the same (or an even stronger) pattern with the @realDonaldTrump account. And we’re not. I’m not sure why this is the case, but it could be that Twitter has some automation in place to auto-delete programmatically created accounts. If you look at the output of my script you’ll see that between fetching the list of Twitter ids for the last 5000 followers of @realDonaldTrump, and fetching the full Twitter User objects for those ids, 3 accounts “went missing” (and hence the tool only collected data for 4997 accounts.)

Finally, just for good measure, I ran the tool against my own account (@r0zetta).

Age ranges of new accounts following @r0zetta

Here you see a distribution that’s probably common for non-celebrity Twitter accounts. Not many of my followers have new accounts. What’s more, there’s absolutely no pattern in the friends_count values of these accounts:

Most common friends_count values seen amongst the new accounts following @r0zetta

Of course, there are plenty of other interesting analyses that can be performed on the data collected by this tool. Once the script has been run, all data is saved on disk as json files, so you can process it to your heart’s content without having to run additional queries against Twitter’s servers. As usual, have fun extending this tool to your own needs, and if you’re interested in reading some of my other guides or analyses, here’s full list of those articles.

Searching Twitter With Twarc

Twarc makes it really easy to search Twitter via the API. Simply create a twarc object using your own API keys and then pass your search query into twarc’s search() function to get a stream of Tweet objects. Remember that, by default, the Twitter API will only return results from the last 7 days. However, this is useful enough if we’re looking for fresh information on a topic.

Since this methodology is so simple, posting code for a tool that simply prints the resulting tweets to stdout would make for a boring blog post. Here I present a tool that collects a bunch of metadata from the returned Tweet objects. Here’s what it does:

  • records frequency distributions of URLs, hashtags, and users
  • records interactions between users and hashtags
  • outputs csv files that can be imported into Gephi for graphing
  • downloads all images found in Tweets
  • records each Tweet’s text along with the URL of the Tweet

The code doesn’t really need explanation, so here’s the whole thing.

from collections import Counter
from itertools import combinations
from twarc import Twarc
import requests
import sys
import os
import shutil
import io
import re
import json

# Helper functions for saving json, csv and formatted txt files
def save_json(variable, filename):
  with io.open(filename, "w", encoding="utf-8") as f:
    f.write(unicode(json.dumps(variable, indent=4, ensure_ascii=False)))

def save_csv(data, filename):
  with io.open(filename, "w", encoding="utf-8") as handle:
    handle.write(u"Source,Target,Weight\n")
    for source, targets in sorted(data.items()):
      for target, count in sorted(targets.items()):
        if source != target and source is not None and target is not None:
          handle.write(source + u"," + target + u"," + unicode(count) + u"\n")

def save_text(data, filename):
  with io.open(filename, "w", encoding="utf-8") as handle:
    for item, count in data.most_common():
      handle.write(unicode(count) + "\t" + item + "\n")

# Returns the screen_name of the user retweeted, or None
def retweeted_user(status):
  if "retweeted_status" in status:
    orig_tweet = status["retweeted_status"]
    if "user" in orig_tweet and orig_tweet["user"] is not None:
      user = orig_tweet["user"]
      if "screen_name" in user and user["screen_name"] is not None:
        return user["screen_name"]

# Returns a list of screen_names that the user interacted with in this Tweet
def get_interactions(status):
  interactions = []
  if "in_reply_to_screen_name" in status:
    replied_to = status["in_reply_to_screen_name"]
    if replied_to is not None and replied_to not in interactions:
      interactions.append(replied_to)
  if "retweeted_status" in status:
    orig_tweet = status["retweeted_status"]
    if "user" in orig_tweet and orig_tweet["user"] is not None:
      user = orig_tweet["user"]
      if "screen_name" in user and user["screen_name"] is not None:
        if user["screen_name"] not in interactions:
          interactions.append(user["screen_name"])
  if "quoted_status" in status:
    orig_tweet = status["quoted_status"]
    if "user" in orig_tweet and orig_tweet["user"] is not None:
      user = orig_tweet["user"]
      if "screen_name" in user and user["screen_name"] is not None:
        if user["screen_name"] not in interactions:
          interactions.append(user["screen_name"])
  if "entities" in status:
    entities = status["entities"]
    if "user_mentions" in entities:
      for item in entities["user_mentions"]:
        if item is not None and "screen_name" in item:
          mention = item['screen_name']
          if mention is not None and mention not in interactions:
            interactions.append(mention)
  return interactions

# Returns a list of hashtags found in the tweet
def get_hashtags(status):
  hashtags = []
  if "entities" in status:
    entities = status["entities"]
    if "hashtags" in entities:
      for item in entities["hashtags"]:
        if item is not None and "text" in item:
          hashtag = item['text']
          if hashtag is not None and hashtag not in hashtags:
            hashtags.append(hashtag)
  return hashtags

# Returns a list of URLs found in the Tweet
def get_urls(status):
  urls = []
  if "entities" in status:
    entities = status["entities"]
      if "urls" in entities:
        for item in entities["urls"]:
          if item is not None and "expanded_url" in item:
            url = item['expanded_url']
            if url is not None and url not in urls:
              urls.append(url)
  return urls

# Returns the URLs to any images found in the Tweet
def get_image_urls(status):
  urls = []
  if "entities" in status:
    entities = status["entities"]
    if "media" in entities:
      for item in entities["media"]:
        if item is not None:
          if "media_url" in item:
            murl = item["media_url"]
            if murl not in urls:
              urls.append(murl)
  return urls

# Main starts here
if __name__ == '__main__':
# Add your own API key values here
  consumer_key=""
  consumer_secret=""
  access_token=""
  access_token_secret=""

  twarc = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)

# Check that search terms were provided at the command line
  target_list = []
  if (len(sys.argv) > 1):
    target_list = sys.argv[1:]
  else:
    print("No search terms provided. Exiting.")
    sys.exit(0)

  num_targets = len(target_list)
  for count, target in enumerate(target_list):
    print(str(count + 1) + "/" + str(num_targets) + " searching on target: " + target)
# Create a separate save directory for each search query
# Since search queries can be a whole sentence, we'll check the length
# and simply number it if the query is overly long
    save_dir = ""
    if len(target) < 30:
      save_dir = target.replace(" ", "_")
    else:
      save_dir = "target_" + str(count + 1)
    if not os.path.exists(save_dir):
      print("Creating directory: " + save_dir)
      os.makedirs(save_dir)
# Variables for capturing stuff
    tweets_captured = 0
    influencer_frequency_dist = Counter()
    mentioned_frequency_dist = Counter()
    hashtag_frequency_dist = Counter()
    url_frequency_dist = Counter()
    user_user_graph = {}
    user_hashtag_graph = {}
    hashtag_hashtag_graph = {}
    all_image_urls = []
    tweets = {}
    tweet_count = 0
# Start the search
    for status in twarc.search(target):
# Output some status as we go, so we know something is happening
      sys.stdout.write("\r")
      sys.stdout.flush()
      sys.stdout.write("Collected " + str(tweet_count) + " tweets.")
      sys.stdout.flush()
      tweet_count += 1
    
      screen_name = None
      if "user" in status:
        if "screen_name" in status["user"]:
          screen_name = status["user"]["screen_name"]

      retweeted = retweeted_user(status)
      if retweeted is not None:
        influencer_frequency_dist[retweeted] += 1
      else:
        influencer_frequency_dist[screen_name] += 1

# Tweet text can be in either "text" or "full_text" field...
      text = None
      if "full_text" in status:
        text = status["full_text"]
      elif "text" in status:
        text = status["text"]

      id_str = None
      if "id_str" in status:
        id_str = status["id_str"]

# Assemble the URL to the tweet we received...
      tweet_url = None
      if "id_str" is not None and "screen_name" is not None:
        tweet_url = "https://twitter.com/" + screen_name + "/status/" + id_str

# ...and capture it
      if tweet_url is not None and text is not None:
        tweets[tweet_url] = text

# Record mapping graph between users
      interactions = get_interactions(status)
        if interactions is not None:
          for user in interactions:
            mentioned_frequency_dist[user] += 1
            if screen_name not in user_user_graph:
              user_user_graph[screen_name] = {}
            if user not in user_user_graph[screen_name]:
              user_user_graph[screen_name][user] = 1
            else:
              user_user_graph[screen_name][user] += 1

# Record mapping graph between users and hashtags
      hashtags = get_hashtags(status)
      if hashtags is not None:
        if len(hashtags) > 1:
          hashtag_interactions = []
# This code creates pairs of hashtags in situations where multiple
# hashtags were found in a tweet
# This is used to create a graph of hashtag-hashtag interactions
          for comb in combinations(sorted(hashtags), 2):
            hashtag_interactions.append(comb)
          if len(hashtag_interactions) > 0:
            for inter in hashtag_interactions:
              item1, item2 = inter
              if item1 not in hashtag_hashtag_graph:
                hashtag_hashtag_graph[item1] = {}
              if item2 not in hashtag_hashtag_graph[item1]:
                hashtag_hashtag_graph[item1][item2] = 1
              else:
                hashtag_hashtag_graph[item1][item2] += 1
          for hashtag in hashtags:
            hashtag_frequency_dist[hashtag] += 1
            if screen_name not in user_hashtag_graph:
              user_hashtag_graph[screen_name] = {}
            if hashtag not in user_hashtag_graph[screen_name]:
              user_hashtag_graph[screen_name][hashtag] = 1
            else:
              user_hashtag_graph[screen_name][hashtag] += 1

      urls = get_urls(status)
      if urls is not None:
        for url in urls:
          url_frequency_dist[url] += 1

      image_urls = get_image_urls(status)
      if image_urls is not None:
        for url in image_urls:
          if url not in all_image_urls:
            all_image_urls.append(url)

# Iterate through image URLs, fetching each image if we haven't already
      print
      print("Fetching images.")
      pictures_dir = os.path.join(save_dir, "images")
      if not os.path.exists(pictures_dir):
        print("Creating directory: " + pictures_dir)
        os.makedirs(pictures_dir)
      for url in all_image_urls:
        m = re.search("^http:\/\/pbs\.twimg\.com\/media\/(.+)$", url)
        if m is not None:
          filename = m.group(1)
          print("Getting picture from: " + url)
          save_path = os.path.join(pictures_dir, filename)
          if not os.path.exists(save_path):
            response = requests.get(url, stream=True)
            with open(save_path, 'wb') as out_file:
              shutil.copyfileobj(response.raw, out_file)
            del response

# Output a bunch of files containing the data we just gathered
      print("Saving data.")
      json_outputs = {"tweets.json": tweets,
                      "urls.json": url_frequency_dist,
                      "hashtags.json": hashtag_frequency_dist,
                      "influencers.json": influencer_frequency_dist,
                      "mentioned.json": mentioned_frequency_dist,
                      "user_user_graph.json": user_user_graph,
                      "user_hashtag_graph.json": user_hashtag_graph,
                      "hashtag_hashtag_graph.json": hashtag_hashtag_graph}
      for name, dataset in json_outputs.iteritems():
        filename = os.path.join(save_dir, name)
        save_json(dataset, filename)

# These files are created in a format that can be easily imported into Gephi
      csv_outputs = {"user_user_graph.csv": user_user_graph,
                     "user_hashtag_graph.csv": user_hashtag_graph,
                     "hashtag_hashtag_graph.csv": hashtag_hashtag_graph}
      for name, dataset in csv_outputs.iteritems():
        filename = os.path.join(save_dir, name)
        save_csv(dataset, filename)

      text_outputs = {"hashtags.txt": hashtag_frequency_dist,
                      "influencers.txt": influencer_frequency_dist,
                      "mentioned.txt": mentioned_frequency_dist,
                      "urls.txt": url_frequency_dist}
      for name, dataset in text_outputs.iteritems():
        filename = os.path.join(save_dir, name)
        save_text(dataset, filename)

Running this tool will create a directory for each search term provided at the command-line. To search for a sentence, or to include multiple terms, enclose the argument with quotes. Due to Twitter’s rate limiting, your search may hit a limit, and need to pause to wait for the rate limit to reset. Luckily twarc takes care of that. Once the search is finished, a bunch of files will be written to the previously created directory.

Since I use a Mac, I can use its Quick Look functionality from the Finder to browse the output files created. Since pytorch is gaining a lot of interest, I ran my script against that search term. Here’s some examples of how I can quickly view the output files.

The preview pane is enough to get an overview of the recorded data.

 

Pressing spacebar opens the file in Quick Look, which is useful for data that doesn’t fit neatly into the preview pane

Importing the user_user_graph.csv file into Gephi provided me with some neat visualizations about the pytorch community.

A full zoom out of the pytorch community

Here we can see who the main influencers are. It seems that Yann LeCun and François Chollet are Tweeting about pytorch, too.

Here’s a zoomed-in view of part of the network.

Zoomed in view of part of the Gephi graph generated.

If you enjoyed this post, check out the previous two articles I published on using the Twitter API here and here. I hope you have fun tailoring this script to your own needs!

NLP Analysis Of Tweets Using Word2Vec And T-SNE

In the context of some of the Twitter research I’ve been doing, I decided to try out a few natural language processing (NLP) techniques. So far, word2vec has produced perhaps the most meaningful results. Wikipedia describes word2vec very precisely:

“Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.”

During the two weeks leading up to the  January 2018 Finnish presidential elections, I performed an analysis of user interactions and behavior on Twitter, based on search terms relevant to that event. During the course of that analysis, I also dumped each Tweet’s raw text field to a text file, one item per line. I then wrote a small tool designed to preprocess the collected Tweets, feed that processed data into word2vec, and finally output some visualizations. Since word2vec creates multidimensional tensors, I’m using T-SNE for dimensionality reduction (the resulting visualizations are in two dimensions, compared to the 200 dimensions of the original data.)

The rest of this blog post will be devoted to listing and explaining the code used to perform these tasks. I’ll present the code as it appears in the tool. The code starts with a set of functions that perform processing and visualization tasks. The main routine at the end wraps everything up by calling each routine sequentially, passing artifacts from the previous step to the next one. As such, you can copy-paste each section of code into an editor, save the resulting file, and the tool should run (assuming you’ve pip installed all dependencies.) Note that I’m using two spaces per indent purely to allow the code to format neatly in this blog. Let’s start, as always, with importing dependencies. Off the top of my head, you’ll probably want to install tensorflow, gensim, six, numpy, matplotlib, and sklearn (although I think some of these install as part of tensorflow’s installation).

# -*- coding: utf-8 -*-
from tensorflow.contrib.tensorboard.plugins import projector
from sklearn.manifold import TSNE
from collections import Counter
from six.moves import cPickle
import gensim.models.word2vec as w2v
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import multiprocessing
import os
import sys
import io
import re
import json

The next listing contains a few helper functions. In each processing step, I like to save the output. I do this for two reasons. Firstly, depending on the size of your raw data, each step can take some time. Hence, if you’ve performed the step once, and saved the output, it can be loaded from disk to save time on subsequent passes. The second reason for saving each step is so that you can examine the output to check that it looks like what you want. The try_load_or_process() function attempts to load the previously saved output from a function. If it doesn’t exist, it runs the function and then saves the output. Note also the rather odd looking implementation in save_json(). This is a workaround for the fact that json.dump() errors out on certain non-ascii characters when paired with io.open().

def try_load_or_process(filename, processor_fn, function_arg):
  load_fn = None
  save_fn = None
  if filename.endswith("json"):
    load_fn = load_json
    save_fn = save_json
  else:
    load_fn = load_bin
    save_fn = save_bin
  if os.path.exists(filename):
    return load_fn(filename)
  else:
    ret = processor_fn(function_arg)
    save_fn(ret, filename)
    return ret

def print_progress(current, maximum):
  sys.stdout.write("\r")
  sys.stdout.flush()
  sys.stdout.write(str(current) + "/" + str(maximum))
  sys.stdout.flush()

def save_bin(item, filename):
  with open(filename, "wb") as f:
    cPickle.dump(item, f)

def load_bin(filename):
  if os.path.exists(filename):
    with open(filename, "rb") as f:
      return cPickle.load(f)

def save_json(variable, filename):
  with io.open(filename, "w", encoding="utf-8") as f:
    f.write(unicode(json.dumps(variable, indent=4, ensure_ascii=False)))

def load_json(filename):
  ret = None
  if os.path.exists(filename):
    try:
      with io.open(filename, "r", encoding="utf-8") as f:
        ret = json.load(f)
    except:
      pass
  return ret

Moving on, let’s look at the first preprocessing step. This function takes the raw text strings dumped from Tweets, removes unwanted characters and features (such as user names and URLs), removes duplicates, and returns a list of sanitized strings. Here, I’m not using string.printable for a list of characters to keep, since Finnish includes additional letters that aren’t part of the english alphabet (äöåÄÖÅ). The regular expressions used in this step have been somewhat tailored for the raw input data. Hence, you may need to tweak them for your own input corpus.

def process_raw_data(input_file):
  valid = u"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ#@.:/ äöåÄÖÅ"
  url_match = "(https?:\/\/[0-9a-zA-Z\-\_]+\.[\-\_0-9a-zA-Z]+\.?[0-9a-zA-Z\-\_]*\/?.*)"
  name_match = "\@[\_0-9a-zA-Z]+\:?"
  lines = []
  print("Loading raw data from: " + input_file)
  if os.path.exists(input_file):
    with io.open(input_file, 'r', encoding="utf-8") as f:
      lines = f.readlines()
  num_lines = len(lines)
  ret = []
  for count, text in enumerate(lines):
    if count % 50 == 0:
      print_progress(count, num_lines)
    text = re.sub(url_match, u"", text)
    text = re.sub(name_match, u"", text)
    text = re.sub("\&amp\;?", u"", text)
    text = re.sub("[\:\.]{1,}$", u"", text)
    text = re.sub("^RT\:?", u"", text)
    text = u''.join(x for x in text if x in valid)
    text = text.strip()
    if len(text.split()) > 5:
      if text not in ret:
        ret.append(text)
  return ret

The next step is to tokenize each sentence (or Tweet) into words.

def tokenize_sentences(sentences):
  ret = []
  max_s = len(sentences)
  print("Got " + str(max_s) + " sentences.")
  for count, s in enumerate(sentences):
    tokens = []
    words = re.split(r'(\s+)', s)
    if len(words) > 0:
      for w in words:
        if w is not None:
          w = w.strip()
          w = w.lower()
          if w.isspace() or w == "\n" or w == "\r":
            w = None
          if len(w) < 1:
            w = None
          if w is not None:
            tokens.append(w)
    if len(tokens) > 0:
      ret.append(tokens)
    if count % 50 == 0:
      print_progress(count, max_s)
  return ret

The final text preprocessing step removes unwanted tokens. This includes numeric data and stop words. Stop words are the most common words in a language. We omit them from processing in order to bring out the meaning of the text in our analysis. I downloaded a json dump of stop words for all languages from here, and placed it in the same directory as this script. If you plan on trying this code out yourself, you’ll need to perform the same steps. Note that I included extra stopwords of my own. After looking at the output of this step, I noticed that Twitter’s truncation of some tweets caused certain word fragments to occur frequently.

def clean_sentences(tokens):
  all_stopwords = load_json("stopwords-iso.json")
  extra_stopwords = ["ssä", "lle", "h.", "oo", "on", "muk", "kov", "km", "ia", "täm", "sy", "but", ":sta", "hi", "py", "xd", "rr", "x:", "smg", "kum", "uut", "kho", "k", "04n", "vtt", "htt", "väy", "kin", "#8", "van", "tii", "lt3", "g", "ko", "ett", "mys", "tnn", "hyv", "tm", "mit", "tss", "siit", "pit", "viel", "sit", "n", "saa", "tll", "eik", "nin", "nii", "t", "tmn", "lsn", "j", "miss", "pivn", "yhn", "mik", "tn", "tt", "sek", "lis", "mist", "tehd", "sai", "l", "thn", "mm", "k", "ku", "s", "hn", "nit", "s", "no", "m", "ky", "tst", "mut", "nm", "y", "lpi", "siin", "a", "in", "ehk", "h", "e", "piv", "oy", "p", "yh", "sill", "min", "o", "va", "el", "tyn", "na", "the", "tit", "to", "iti", "tehdn", "tlt", "ois", ":", "v", "?", "!", "&"]
  stopwords = None
  if all_stopwords is not None:
    stopwords = all_stopwords["fi"]
    stopwords += extra_stopwords
  ret = []
  max_s = len(tokens)
  for count, sentence in enumerate(tokens):
    if count % 50 == 0:
      print_progress(count, max_s)
    cleaned = []
    for token in sentence:
      if len(token) > 0:
        if stopwords is not None:
          for s in stopwords:
            if token == s:
              token = None
        if token is not None:
            if re.search("^[0-9\.\-\s\/]+$", token):
              token = None
        if token is not None:
            cleaned.append(token)
    if len(cleaned) > 0:
      ret.append(cleaned)
  return ret

The next function creates a vocabulary from the processed text. A vocabulary, in this context, is basically a list of all unique tokens in the data. This function creates a frequency distribution of all tokens (words) by counting the number of occurrences of each token. We will use this later to “trim” the vocabulary down to a manageable size.

def get_word_frequencies(corpus):
  frequencies = Counter()
  for sentence in corpus:
    for word in sentence:
      frequencies[word] += 1
  freq = frequencies.most_common()
  return freq

Now we’re done with all preprocessing steps, let’s get into the more interesting analysis functions. The following function accepts the tokenized and cleaned data generated from the steps above, and uses it to train a word2vec model. The num_features parameter sets the number of features each word is assigned (and hence the dimensionality of the resulting tensor.) It is recommended to set it between 100 and 1000. Naturally, larger values take more processing power and memory/disk space to handle. I found 200 to be enough, but I normally start with a value of 300 when looking at new datasets. The min_count variable passed to word2vec designates how to trim the vocabulary. For example, if min_count is set to 3, all words that appear in the data set less than 3 times will be discarded from the vocabulary used when training the word2vec model. In the dimensionality reduction step we perform later, large vocabulary sizes cause T-SNE iterations to take a long time. Hence, I tuned min_count to generate a vocabulary of around 10,000 words. Increasing the value of sample, will cause word2vec to randomly omit words with high frequency counts. I decided that I wanted to keep all of those words in my analysis, so it’s set to zero. Increasing epoch_count will cause word2vec to train for more iterations, which will, naturally take longer. Increase this if you have a fast machine or plenty of time on your hands 🙂

def get_word2vec(sentences):
  num_workers = multiprocessing.cpu_count()
  num_features = 200
  epoch_count = 10
  sentence_count = len(sentences)
  w2v_file = os.path.join(save_dir, "word_vectors.w2v")
  word2vec = None
  if os.path.exists(w2v_file):
    print("w2v model loaded from " + w2v_file)
    word2vec = w2v.Word2Vec.load(w2v_file)
  else:
    word2vec = w2v.Word2Vec(sg=1,
                            seed=1,
                            workers=num_workers,
                            size=num_features,
                            min_count=min_frequency_val,
                            window=5,
                            sample=0)

    print("Building vocab...")
    word2vec.build_vocab(sentences)
    print("Word2Vec vocabulary length:", len(word2vec.wv.vocab))
    print("Training...")
    word2vec.train(sentences, total_examples=sentence_count, epochs=epoch_count)
    print("Saving model...")
    word2vec.save(w2v_file)
  return word2vec

Tensorboard has some good tools to visualize word embeddings in the word2vec model we just created. These visualizations can be accessed using the “projector” tab in the interface. Here’s code to create tensorboard embeddings:

def create_embeddings(word2vec):
  all_word_vectors_matrix = word2vec.wv.syn0
  num_words = len(all_word_vectors_matrix)
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  dim = word2vec.wv[vocab[0]].shape[0]
  embedding = np.empty((num_words, dim), dtype=np.float32)
  metadata = ""
  for i, word in enumerate(vocab):
    embedding[i] = word2vec.wv[word]
    metadata += word + "\n"
  metadata_file = os.path.join(save_dir, "metadata.tsv")
  with io.open(metadata_file, "w", encoding="utf-8") as f:
    f.write(metadata)

  tf.reset_default_graph()
  sess = tf.InteractiveSession()
  X = tf.Variable([0.0], name='embedding')
  place = tf.placeholder(tf.float32, shape=embedding.shape)
  set_x = tf.assign(X, place, validate_shape=False)
  sess.run(tf.global_variables_initializer())
  sess.run(set_x, feed_dict={place: embedding})

  summary_writer = tf.summary.FileWriter(save_dir, sess.graph)
  config = projector.ProjectorConfig()
  embedding_conf = config.embeddings.add()
  embedding_conf.tensor_name = 'embedding:0'
  embedding_conf.metadata_path = 'metadata.tsv'
  projector.visualize_embeddings(summary_writer, config)

  save_file = os.path.join(save_dir, "model.ckpt")
  print("Saving session...")
  saver = tf.train.Saver()
  saver.save(sess, save_file)

Once this code has been run, tensorflow log entries will be created in save_dir. To start a tensorboard session, run the following command from the same directory where this script was run from:

tensorboard –logdir=save_dir

You should see output like the following once you’ve run the above command:

TensorBoard 0.4.0rc3 at http://node.local:6006 (Press CTRL+C to quit)

Navigate your web browser to localhost:<port_number> to see the interface. From the “Inactive” pulldown menu, select “Projector”.

tensorboard projector menu item

The “projector” menu is often hiding under the “inactive” pulldown.

Once you’ve selected “projector”, you should see a view like this:

Tensorboard's projector view

Tensorboard’s projector view allows you to interact with word embeddings, search for words, and even run t-sne on the dataset.

There are a lot of things to play around with in this view. You can search for words, fly around the embeddings, and even run t-sne (on the bottom left) on the dataset. If you get to this step, have fun playing with the interface!

And now, back to the code. One of word2vec’s most interesting functions is to find similarities between words. This is done via the word2vec.wv.most_similar() call. The following function calls word2vec.wv.most_similar() for a word and returns num-similar words. The returned value is a list containing the queried word, and a list of similar words. ( [queried_word, [similar_word1, similar_word2, …]] ).

def most_similar(input_word, num_similar):
  sim = word2vec.wv.most_similar(input_word, topn=num_similar)
  output = []
  found = []
  for item in sim:
    w, n = item
    found.append(w)
  output = [input_word, found]
  return output

The following function takes a list of words to be queried, passes them to the above function, saves the output, and also passes the queried words to t_sne_scatterplot(), which we’ll show later. It also writes a csv file – associations.csv – which can be imported into Gephi to generate graphing visualizations. You can see some Gephi-generated visualizations in the accompanying blog post.

I find that manually viewing the word2vec_test.json file generated by this function is a good way to read the list of similarities found for each word queried with wv.most_similar().

def test_word2vec(test_words):
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  output = []
  associations = {}
  test_items = test_words
  for count, word in enumerate(test_items):
    if word in vocab:
      print("[" + str(count+1) + "] Testing: " + word)
      if word not in associations:
        associations[word] = []
      similar = most_similar(word, num_similar)
      t_sne_scatterplot(word)
      output.append(similar)
      for s in similar[1]:
        if s not in associations[word]:
          associations[word].append(s)
    else:
      print("Word " + word + " not in vocab")
  filename = os.path.join(save_dir, "word2vec_test.json")
  save_json(output, filename)
  filename = os.path.join(save_dir, "associations.json")
  save_json(associations, filename)
  filename = os.path.join(save_dir, "associations.csv")
  handle = io.open(filename, "w", encoding="utf-8")
  handle.write(u"Source,Target\n")
  for w, sim in associations.iteritems():
    for s in sim:
      handle.write(w + u"," + s + u"\n")
  return output

The next function implements standalone code for creating a scatterplot from the output of T-SNE on a set of data points obtained from a word2vec.wv.most_similar() query. The scatterplot is visualized with matplotlib. Unfortunately, my matplotlib skills leave a lot to be desired, and these graphs don’t look great. But they’re readable.

def t_sne_scatterplot(word):
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  dim0 = word2vec.wv[vocab[0]].shape[0]
  arr = np.empty((0, dim0), dtype='f')
  w_labels = [word]
  nearby = word2vec.wv.similar_by_word(word, topn=num_similar)
  arr = np.append(arr, np.array([word2vec[word]]), axis=0)
  for n in nearby:
    w_vec = word2vec[n[0]]
    w_labels.append(n[0])
    arr = np.append(arr, np.array([w_vec]), axis=0)

  tsne = TSNE(n_components=2, random_state=1)
  np.set_printoptions(suppress=True)
  Y = tsne.fit_transform(arr)
  x_coords = Y[:, 0]
  y_coords = Y[:, 1]

  plt.rc("font", size=16)
  plt.figure(figsize=(16, 12), dpi=80)
  plt.scatter(x_coords[0], y_coords[0], s=800, marker="o", color="blue")
  plt.scatter(x_coords[1:], y_coords[1:], s=200, marker="o", color="red")

  for label, x, y in zip(w_labels, x_coords, y_coords):
    plt.annotate(label.upper(), xy=(x, y), xytext=(0, 0), textcoords='offset points')
  plt.xlim(x_coords.min()-50, x_coords.max()+50)
  plt.ylim(y_coords.min()-50, y_coords.max()+50)
  filename = os.path.join(plot_dir, word + "_tsne.png")
  plt.savefig(filename)
  plt.close()

In order to create a scatterplot of the entire vocabulary, we need to perform T-SNE over that whole dataset. This can be a rather time-consuming operation. The next function performs that operation, attempting to save and re-load intermediate steps (since some of them can take over 30 minutes to complete).

def calculate_t_sne():
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  arr = np.empty((0, dim0), dtype='f')
  labels = []
  vectors_file = os.path.join(save_dir, "vocab_vectors.npy")
  labels_file = os.path.join(save_dir, "labels.json")
  if os.path.exists(vectors_file) and os.path.exists(labels_file):
    print("Loading pre-saved vectors from disk")
    arr = load_bin(vectors_file)
    labels = load_json(labels_file)
  else:
    print("Creating an array of vectors for each word in the vocab")
    for count, word in enumerate(vocab):
      if count % 50 == 0:
        print_progress(count, vocab_len)
      w_vec = word2vec[word]
      labels.append(word)
      arr = np.append(arr, np.array([w_vec]), axis=0)
    save_bin(arr, vectors_file)
    save_json(labels, labels_file)

  x_coords = None
  y_coords = None
  x_c_filename = os.path.join(save_dir, "x_coords.npy")
  y_c_filename = os.path.join(save_dir, "y_coords.npy")
  if os.path.exists(x_c_filename) and os.path.exists(y_c_filename):
    print("Reading pre-calculated coords from disk")
    x_coords = load_bin(x_c_filename)
    y_coords = load_bin(y_c_filename)
  else:
    print("Computing T-SNE for array of length: " + str(len(arr)))
    tsne = TSNE(n_components=2, random_state=1, verbose=1)
    np.set_printoptions(suppress=True)
    Y = tsne.fit_transform(arr)
    x_coords = Y[:, 0]
    y_coords = Y[:, 1]
    print("Saving coords.")
    save_bin(x_coords, x_c_filename)
    save_bin(y_coords, y_c_filename)
 return x_coords, y_coords, labels, arr

The next function takes the data calculated in the above step, and data obtained from test_word2vec(), and plots the results from each word queried on the scatterplot of the entire vocabulary. These plots are useful for visualizing which words are closer to others, and where clusters commonly pop up. This is the last function before we get onto the main routine.

def show_cluster_locations(results, labels, x_coords, y_coords):
  for item in results:
    name = item[0]
    print("Plotting graph for " + name)
    similar = item[1]
    in_set_x = []
    in_set_y = []
    out_set_x = []
    out_set_y = []
    name_x = 0
    name_y = 0
    for count, word in enumerate(labels):
      xc = x_coords[count]
      yc = y_coords[count]
      if word == name:
        name_x = xc
        name_y = yc
      elif word in similar:
        in_set_x.append(xc)
        in_set_y.append(yc)
      else:
        out_set_x.append(xc)
        out_set_y.append(yc)
    plt.figure(figsize=(16, 12), dpi=80)
    plt.scatter(name_x, name_y, s=400, marker="o", c="blue")
    plt.scatter(in_set_x, in_set_y, s=80, marker="o", c="red")
    plt.scatter(out_set_x, out_set_y, s=8, marker=".", c="black")
    filename = os.path.join(big_plot_dir, name + "_tsne.png")
    plt.savefig(filename)
    plt.close()

Now let’s write our main routine, which will call all the above functions, process our collected Twitter data, and generate visualizations. The first few lines take care of our three preprocessing steps, and generation of a frequency distribution / vocabulary. The script expects the raw Twitter data to reside in a relative path (data/tweets.txt). Change those variables as needed. Also, all output is saved to a subdirectory in the relative path (analysis/). Again, tailor this to your needs.

if __name__ == '__main__':
  input_dir = "data"
  save_dir = "analysis"
  if not os.path.exists(save_dir):
    os.makedirs(save_dir)

  print("Preprocessing raw data")
  raw_input_file = os.path.join(input_dir, "tweets.txt")
  filename = os.path.join(save_dir, "data.json")
  processed = try_load_or_process(filename, process_raw_data, raw_input_file)
  print("Unique sentences: " + str(len(processed)))

  print("Tokenizing sentences")
  filename = os.path.join(save_dir, "tokens.json")
  tokens = try_load_or_process(filename, tokenize_sentences, processed)

  print("Cleaning tokens")
  filename = os.path.join(save_dir, "cleaned.json")
  cleaned = try_load_or_process(filename, clean_sentences, tokens)

  print("Getting word frequencies")
  filename = os.path.join(save_dir, "frequencies.json")
  frequencies = try_load_or_process(filename, get_word_frequencies, cleaned)
  vocab_size = len(frequencies)
  print("Unique words: " + str(vocab_size))

Next, I trim the vocabulary, and save the resulting list of words. This allows me to look over the trimmed list and ensure that the words I’m interested in survived the trimming operation. Due to the nature of the Finnish language, (and Twitter), the vocabulary of our “cleaned” set, prior to trimming, was over 100,000 unique words. After trimming it ended up at around 11,000 words.

  trimmed_vocab = []
  min_frequency_val = 6
  for item in frequencies:
    if item[1] >= min_frequency_val:
      trimmed_vocab.append(item[0])
  trimmed_vocab_size = len(trimmed_vocab)
  print("Trimmed vocab length: " + str(trimmed_vocab_size))
  filename = os.path.join(save_dir, "trimmed_vocab.json")
  save_json(trimmed_vocab, filename)

The next few lines do all the compute-intensive work. We’ll create a word2vec model with the cleaned token set, create tensorboard embeddings (for the visualizations mentioned above), and calculate T-SNE. Yes, this part can take a while to run, so go put the kettle on.

  print
  print("Instantiating word2vec model")
  word2vec = get_word2vec(cleaned)
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  print("word2vec vocab contains " + str(vocab_len) + " items.")
  dim0 = word2vec.wv[vocab[0]].shape[0]
  print("word2vec items have " + str(dim0) + " features.")

  print("Creating tensorboard embeddings")
  create_embeddings(word2vec)

  print("Calculating T-SNE for word2vec model")
  x_coords, y_coords, labels, arr = calculate_t_sne()

Finally, we’ll take the top 50 most frequent words from our frequency distrubution, query them for 40 most similar words, and plot both labelled graphs of each set, and a “big plot” of that set on the entire vocabulary.

  plot_dir = os.path.join(save_dir, "plots")
  if not os.path.exists(plot_dir):
    os.makedirs(plot_dir)

  num_similar = 40
  test_words = []
  for item in frequencies[:50]:
    test_words.append(item[0])
  results = test_word2vec(test_words)

  big_plot_dir = os.path.join(save_dir, "big_plots")
  if not os.path.exists(big_plot_dir):
    os.makedirs(big_plot_dir)
  show_cluster_locations(results, labels, x_coords, y_coords)

And that’s it! Rather a lot of code, but it does quite a few useful tasks. If you’re interested in seeing the visualizations I created using this tool against the Tweets collected from the January 2018 Finnish presidential elections, check out this blog post.

NLP Analysis And Visualizations Of #presidentinvaalit2018

During the lead-up to the January 2018 Finnish presidential elections, I collected a dataset consisting of raw Tweets gathered from search words related to the election. I then performed a series of natural language processing experiments on this raw data. The methodology, including all the code used, can be found in an accompanying blog post. This article details the results of my experiments, and shows some of the visualizations generated.

I pre-processed the raw dataset, used it to train a word2vec model, and then used that model to perform analyses using word2vec.wv.most_similar(), T-SNE, and Tensorboard.

My first experiment involved creating scatterplots of words found to be similar to frequently encountered tokens within the Twitter data. I looked at the 50 most frequent tokens encountered in this way, and used T-SNE to reduce the dimensionality of the set of vectors generated in each case. Results were plotted using matplotlib. Here are a few examples of the output generated.

T-SNE scatterplot of the 40 most similar words to #laura2018

T-SNE scatterplot of the 40 most similar words to #laura2018

Here you can see that word2vec easily identified other hashtags related to the #laura2018 campaign, including #suomitakaisin, #suomitakas, #siksilaura and #siksips. Laura Huhtasaari was candidate number 5 on the voting slip, and that was also identified, along with other hashtags associated with her name.

T-SNE scatterplot of the 40 most similar words to #turpo

T-SNE scatterplot of the 40 most similar words to #turpo

Here’s an analysis of the hashtag #turpo (short for turvallisuuspolitiikka – National Security). Here you can see that word2vec identified many references to NATO (one issue that was touched upon during election campaigning), jäsenyys (membership), #ulpo – ulkopolitiikka (Foreign Policy), and references to regions and countries (venäjä – Russia, ruotsi – Sweden, itämeri – Baltic).

T-SNE scatterplot of the 40 most similar words to venäjä

T-SNE scatterplot of the 40 most similar words to venäjä

On a similar note, here’s a scatterplot of words similar to venäjä (Russia). As expected, word2vec identified NATO in close relationship. Names of countries are expected to register as similar in word2vec, and we see Ruotsi (Sweden), Ukraine, USA, Turkki (Turkey), Syria, Kiina (China). Word2vec also finds the word Putin to be similar, and interestingly, Neuvostoliito (USSR) was mentioned in the Twitter data.

T-SNE scatterplot of the 40 most similar words to presidentti

T-SNE scatterplot of the 40 most similar words to presidentti

Above is a scatterplot based on the word “presidentti” (president). Note how word2vec identified Halonen, Urho, Kekkonen, Donald, and Trump.

Moving on, I took the names of the eight presidential candidates in Sunday’s election, and plotted them, along with the 40 most similar guesses from word2vec, on scatterplots of the entire vocabulary. Here are the results.

All candidates plotted against the full vocabulary. The blue dot is the target. Red dots are similar tokens.

All candidates plotted against the full vocabulary. The blue dot is the target. Red dots are similar tokens.

As you can see above, all of the candidates occupied separate spaces on the graph, and there was very little overlap amongst words similar to each candidate’s name.

I created word embeddings using Tensorflow, and opened the resulting log files in Tensorboard in order to produce some visualizations with that tool. Here are some of the outputs.

Tensorboard visualization of words related to #haavisto on a 2d representation of word embeddings, dimensionally reduced using T-SNE

Tensorboard visualization of words related to #haavisto2018 on a 2D representation of word embeddings, dimensionally reduced using T-SNE

The above shows word vectors in close proximity to #haavisto2018, based on the embeddings I created (from the word2vec model). Here you can find references to Tavastia, a club in Helsinki where Pekka Haavisto’s campaign hosted an event on 20th January 2018. Words clearly associated with this event include liput (tickets), ilta (evening), livenä (live), and biisejä (songs). The event was called “Siksipekka”. Here’s a view of that hashtag.

Again, we see similar words, including konsertti (concert). Another nearby word vector identified was #vihreät (the green party).

In my last experiment, I compiled lists of similar words for all of the top 50 most frequent words found in the Twitter data, and recorded associations between the lists generated. I imported this data into Gephi, and generated some graphs with it.

I got interested in Gephi after recently collaborating with Erin Gallagher (@3r1nG) to visualize the data I collected on some bots found to be following Finnish recommended Twitter accounts. I highly recommend that you check out some of her other blog posts, where you’ll see some amazing visualizations. Gephi is a powerful tool, but it takes quite some time to master. As you’ll see, my attempts at using it pale in comparison to what Erin can do.

A zoomed-out view of the mapping between the 40 most similar words to the 50 most frequent words in the Twitter data collected

A zoomed-out view of the mapping between the 40 most similar words to the 50 most frequent words in the Twitter data collected

The above is a graph of all the words found. Larger circles indicate that a word has more other words associated with it.

A zoomed-in view of some of the candidates

A zoomed-in view of some of the candidates

Here’s a zoom-in on some of the candidates. Note that I treated hashtags as unique words, which turned out to be useful for this analysis. For reference, here are a few translations: äänestää = vote, vaalit = elections, puhuu = to speak, presitenttiehdokas = presidential candidate.

Words related to foreign policy and national security

Words related to foreign policy and national security

Here is a zoomed-in view of the words associated with foreign policy and national security.

Words associated with Suomi (Finland)

Words associated with Suomi (Finland)

Finally, here are some words associated with #suomi (Finland). Note lots of references to nature (luonto), winter (talvi), and snow (lumi).

As you might have gathered, word2vec finds interesting and fairly accurate associations between words, even in messy data such as Tweets. I plan on delving further into this area in hopes of finding some techniques that might improve the Twitter research I’ve been doing. The dataset collected during the Finnish elections was fairly small (under 150,000 Tweets). Many of the other datasets I work with are orders of magnitude larger. Hence I’m particularly interested in figuring out if there’s a way to accurately cluster Twitter data using these techniques.

 

How To Get Tweets From A Twitter Account Using Python And Tweepy

In this blog post, I’ll explain how to obtain data from a specified Twitter account using tweepy and Python. Let’s jump straight into the code!

As usual, we’ll start off by importing dependencies. I’ll use the datetime and Counter modules later on to do some simple analysis tasks.

from tweepy import OAuthHandler
from tweepy import API
from tweepy import Cursor
from datetime import datetime, date, time, timedelta
from collections import Counter
import sys

The next bit creates a tweepy API object that we will use to query for data from Twitter. As usual, you’ll need to create a Twitter application in order to obtain the relevant authentication keys and fill in those empty strings. You can find a link to a guide about that in one of the previous articles in this series.

consumer_key=""
consumer_secret=""
access_token=""
access_token_secret=""

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
auth_api = API(auth)

Names of accounts to be queried will be passed in as command-line arguments. I’m going to exit the script if no args are passed, since there would be no reason to continue.

account_list = []
if (len(sys.argv) > 1):
  account_list = sys.argv[1:]
else:
  print("Please provide a list of usernames at the command line.")
  sys.exit(0)

Next, let’s iterate through the account names passed and use tweepy’s API.get_user() to obtain a few details about the queried account.

if len(account_list) > 0:
  for target in account_list:
    print("Getting data for " + target)
    item = auth_api.get_user(target)
    print("name: " + item.name)
    print("screen_name: " + item.screen_name)
    print("description: " + item.description)
    print("statuses_count: " + str(item.statuses_count))
    print("friends_count: " + str(item.friends_count))
    print("followers_count: " + str(item.followers_count))

Twitter User Objects contain a created_at field that holds the creation date of the account. We can use this to calculate the age of the account, and since we also know how many Tweets that account has published (statuses_count), we can calculate the average Tweets per day rate of that account. Tweepy provides time-related values as datetime objects which are easy to calculate things like time deltas with.

    tweets = item.statuses_count
    account_created_date = item.created_at
    delta = datetime.utcnow() - account_created_date
    account_age_days = delta.days
    print("Account age (in days): " + str(account_age_days))
    if account_age_days > 0:
      print("Average tweets per day: " + "%.2f"%(float(tweets)/float(account_age_days)))

Next, let’s iterate through the user’s Tweets using tweepy’s API.user_timeline(). Tweepy’s Cursor allows us to stream data from the query without having to manually query for more data in batches. The Twitter API will return around 3200 Tweets using this method (which can take a while). To make things quicker, and show another example of datetime usage we’re going to break out of the loop once we hit Tweets that are more than 30 days old. While looping, we’ll collect lists of all hashtags and mentions seen in Tweets.

    hashtags = []
    mentions = []
    tweet_count = 0
    end_date = datetime.utcnow() - timedelta(days=30)
    for status in Cursor(auth_api.user_timeline, id=target).items():
      tweet_count += 1
      if hasattr(status, "entities"):
        entities = status.entities
        if "hashtags" in entities:
          for ent in entities["hashtags"]:
            if ent is not None:
              if "text" in ent:
                hashtag = ent["text"]
                if hashtag is not None:
                  hashtags.append(hashtag)
        if "user_mentions" in entities:
          for ent in entities["user_mentions"]:
            if ent is not None:
              if "screen_name" in ent:
                name = ent["screen_name"]
                if name is not None:
                  mentions.append(name)
      if status.created_at < end_date:
        break

Finally, we’ll use Counter.most_common() to print out the ten most used hashtags and mentions.

    print
    print("Most mentioned Twitter users:")
    for item, count in Counter(mentions).most_common(10):
      print(item + "\t" + str(count))

    print
    print("Most used hashtags:")
    for item, count in Counter(hashtags).most_common(10):
      print(item + "\t" + str(count))

    print
    print "All done. Processed " + str(tweet_count) + " tweets."
    print

And that’s it. A simple tool. But effective. And, of course, you can extend this code in any direction you like.

How To Get Streaming Data From Twitter

I occasionally receive requests to share my Twitter analysis tools. After a few recent requests, it finally occurred to me that it would make sense to create a series of articles that describe how to use Python and the Twitter API to perform basic analytical tasks. Teach a man to fish, and all that.

In this blog post, I’ll describe how to obtain streaming data using Python and the Twitter API.

I’m using twarc instead of tweepy to gather data from Twitter streams. I recently switched to using twarc, because has a simpler interface than tweepy, and handles most network errors and Twitter errors automatically.

In this article, I’ll provide two examples. The first one covers the simplest way to get streaming data from Twitter. Let’s start by importing our dependencies.

from twarc import Twarc
import sys

Next, create a twarc session. For this, you’ll need to create a Twitter application in order to obtain the relevant authentication keys and fill in those empty strings. You can find many guides on the Internet for this. Here’s one.

if __name__ == '__main__':
  consumer_key=""
  consumer_secret=""
  access_token=""
  access_token_secret=""

  twarc = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)

For the sake of brevity, let’s assume search terms will be passed as a list on the command-line. We’ll simply accept that list without checking it’s validity. Your own implementation should probably do more.

  target_list = []
  if (len(sys.argv) > 1):
    target_list = sys.argv[1:]

Finally, we’ll check if we have any search targets. If we do, we’ll create a search query. If not, we’ll attach to the sample stream.

  if len(target_list) > 0:
    query = ",".join(target_list)
    print "Search: " + query
    for tweet in twarc.filter(track = query):
      print_tweet(tweet)
  else:
    print "Getting 1% sample."
    for tweet in twarc.sample():
      print_tweet(tweet)

Here’s a function to print the “text” field of each tweet we receive from the stream.

def print_tweet(status):
  if "text" in status:
    print status["text"]

And that’s it. In just over 20 lines of code, you can attach to a Twitter stream, receive Tweets, and process (or in this case, print) them.

In my second example, incoming Tweet objects will be pushed onto a queue in the main thread, while a second processing thread will pull those objects off the queue and process them. The reason we would want to separate gathering and processing into separate threads is to prevent any blocking by the processing step. Although in this example, simply printing a Tweet’s text out is unlikely to block under normal circumstances, once your processing code becomes more complex, blocking is more likely to occur. By offloading processing to a separate thread, your script should be able to handle things such as heavy Tweet volume spikes, writing to disk, communicating over the network, using machine learning models, and working with large frequency distribution maps.

As before, we’ll start by importing dependencies. We’re including threading (for multithreading), Queue (to manage a queue), and time (for time.sleep).

from twarc import Twarc
import Queue
import threading
import sys
import time

The following two functions will run in our processing thread. One will process a Tweet object. In this case, we’ll do exactly the same as in our previous example, and simply print the Tweet’s text out.

# Processing thread
def process_tweet(status):
  if "text" in status:
    print status["text"]

The other function that will run in the context of the processing thread is a function to get items that were pushed into the queue. Here’s what it looks like.

def tweet_processing_thread():
  while True:
    item = tweet_queue.get()
    process_tweet(item)
    tweet_queue.task_done()

There are also two functions in our main thread. This one implements the same logic for attaching to a Twitter stream as in our first example. However, instead of calling process_tweet() directly, it pushes tweets onto the queue.

# Main thread
def get_tweet_stream(target_list, twarc):
  if len(target_list) > 0:
    query = ",".join(target_list)
    print "Search: " + query
    for tweet in twarc.filter(track = query):
      tweet_queue.put(tweet)
  else:
    print "Getting 1% sample."
    for tweet in twarc.sample():
      tweet_queue.put(tweet)

Now for our main function. We’ll start by creating a twarc object, and getting command-line args (as before):

if __name__ == '__main__':
  consumer_key=""
  consumer_secret=""
  access_token=""
  access_token_secret=""

  twarc = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)

  target_list = []
    if (len(sys.argv) > 1):
      target_list = sys.argv[1:]

Next, let’s create the queue and start our processing thread.

  tweet_queue = Queue.Queue()
  thread = threading.Thread(target=tweet_processing_thread)
  thread.daemon = True
  thread.start()

Since listening to a Twitter stream is essentially an endless loop, let’s add the ability to catch ctrl-c and clean up if needed.

  while True:
    try:
      get_tweet_stream(target_list, twarc)
    except KeyboardInterrupt:
      print "Keyboard interrupt..."
      # Handle cleanup (save data, etc)
      sys.exit(0)
    except:
      print("Error. Restarting...")
      time.sleep(5)
      pass

If you want to observe a queue buildup, add a sleep into the process_tweet() function, and attach to a stream with high enough volume (such as passing “trump” as a command-line parameter). Have fun listening to Twitter streams!

Further Analysis Of The Finnish Themed Twitter Botnet

In a blog post I published yesterday, I detailed the methodology I have been using to discover “Finnish themed” Twitter accounts that are most likely being programmatically created. In my previous post, I called them “bots”, but for the sake of clarity, let’s refer to them as “suspicious accounts”.

These suspicious accounts all follow a subset of recommended profiles presented to new Twitter users. In many cases, these automatically created Twitter accounts follow exactly 21 users. The reason I pursued this line of research was because it was similar to a phenomenon I’d seen happening in the US earlier last year. Check this post for more details about that case.

In an attempt to estimate the number of accounts created by the automated process described in my previous post, I ran the same analysis tool against a list of 114 Twitter profiles recommended to new Finnish users. Here is the list.

juhasipila
TuomasEnbuske
alexstubb
hsfi
mikko
rikurantala
yleuutiset
jatkoaika
smliiga
Valavuori
SarasvuoJari
niinisto
iltasanomat
Tami2605
KauppalehtiFi
talouselama
TeemuSel8nne
nokia
HeikelaJussi
hjallisharkimo
Linnanahde
tapio_suominen
vrantanen
meteorologit
tikitalk10
yleurheilu
JaajoLinnonmaa
hirviniemi
pvesterbacka
taloussanomat
TuomasKyr
MTVUutiset
Haavisto
SuomenKuvalehti
MikaelJungner
paavoarhinmaki
KajKunnas
SamiHedberg
VilleNiinisto
HenkkaHypponen
SaskaSaarikoski
jhiitela
Finnair
TarjaHalonen
leijonat
JollaHQ
filsdeproust
makinenantti
lottabacklund
jyrkikasvi
JethroRostedt
Ulkoministerio
valtioneuvosto
Yleisradio
annaperho
liandersson
pekkasauri
neiltyson
villetolvanen
akiriihilahti
TampereenPoika
madventures
Vapaavuori
jkekalainen
AppelsinUlla
pakalupapito
rakelliekki
kyleturris
tanelitikka
SlushHQ
arcticstartup
lindaliukas
goodnewsfinland
docventures
jasondemers5
Retee27
H_Kovalainen
ipaananen
FrenzziiiBull
ylenews
digitoday
jraitamaa
marmai
MikaVayrynen
LKomarov
ovi8
paulavesala
OsmoSoininvaara
juuuso
JaanaPelkonen
saaraaalto
yletiede
TimoHaapala
Huuhkajat
ErvastiPekka
JussiPullinen
rsiilasmaa
moia
Palloliitto
teroterotero
ARaanta31
kirsipiha
JPohjanpalo
startupsauna
aaltoes
Villebla
MariaVeitola
merjaya
MikiKuusi
MTVSportfi
EHaula
svuorikoski
andrewickstroem
kokoomus

For each account, my script saved a list of accounts suspected of being automatically created. After completing the analysis of these 114 accounts, I iterated through all collected lists in order to identify all unique account names across those lists.

Across the 114 recommended Twitter profiles, my analysis identified 5631 unique accounts. Here are the (first twenty) age ranges of the most recently created accounts:

All age ranges

Age ranges of all suspicious Twitter accounts identified by my script

It has been suggested (link in Finnish) that these accounts appeared when a popular game, Growtopia, asked its players to follow their Twitter account after a game outage, and those new accounts started following recommended Twitter profiles (including those of Haavisto and Niinistö). In order to check if this was the case, I collected a list of accounts following @growtopiagame, and checked for accounts that appear on both that list, and the list of suspicious accounts collected in my previous step. That number was 3. This likely indicates that the accounts my analysis identified aren’t players of Growtopia.

Game of 72 Myth or Reality?

I can’t pretend that, in the mid 90s, I didn't pester my mum for a pair Adidas poppers joggers. Or that I didn't, against my better judgement, strut around in platform sneakers in an attempt to fit in with the in crowd. But emulating popular fashion was as far as I got. I don’t remember ever doing stupid or dangerous dares to impress my classmates. Initially, I thought, maybe I was just a good kid, but a quick straw poll around Smoothwall Towers, showed that my colleagues don’t recall hurting themselves or anyone else for a dare either. The closest example of a prank we could come up with between us was knock and run and egg and flour - hardly show stopping news.
But now, teenagers seem to be taking daring games to a whole new level through social media, challenging each other to do weird and even dangerous things. Like the #cinnamonchallenge on Twitter (where you dare someone to swallow a mouthful of cinnamon powder in 60 seconds without water). A quick visual check for the hashtag shows it’s still a thing today, despite initially going viral in 2013, and doctors having warned teens about the serious health implications. Now, apparently there’s another craze doing the rounds. #Gameof72 dares teens to go missing for 72 hours without contacting their parents. The first suspected case was reported in a local French newspaper in April, when a French student disappeared for three days and later told police she had been doing Game of 72. Then, in a separate incident, on 7 May, two schoolgirls from Essex went missing for a weekend in a suspected Game of 72 disappearance. Police later issued a statement to say the girls hadn't been playing the game. So why then, despite small incident numbers, and the absence of any actual evidence that Game of 72 is real, are parents and the authorities so panicked? Tricia Bailey from the Missing Children’s Society warned kids of the “immense and terrifying challenges they will face away from home.” And Stephen Fields, a communications coordinator at Windsor-Essex Catholic District School Board said, “it’s not cool”, and has warned students who participate that they could face suspension. It’s completely feasible that Game of 72 is actually a myth, created by a school kid with the intention of worrying the adults. And it’s worked; social media has made it seem even worse, when in reality, it’s probably not going to become an issue. I guess the truth is, we’ll probably never know, unless a savvy web filtering company finds a way of making these twitter-mobile games trackable at school, where peer pressure is often at its worst. Wait a minute...we already do that. Smoothwall allows school admins to block specific words and phrases including, Twitter hashtags. Say for instance that students were discussing Game of 72, or any other challenge, by tweet, and that phrase had been added to the list of banned words or phrases; the school’s administrator would be alerted, and their parents could be notified. Sure it won’t stop kids getting involved in online challenges, because they could take it to direct message and we’d lose the conversation. But, I think you’ll probably agree, the ability to track what students are saying in tweets is definitely a step in the right direction.

A new option to stem the tide of nefarious Twitter images…

Smoothwall's team of intrepid web-wranglers have recently noticed a change in Twitter's behaviour. Where once, it was impossible to differentiate the resources loaded from twimg.com, Twitter now includes some handy sub-domains so we can differentiate the optional user-uploaded images from the CSS , buttons, etc.

This means it's possible to prevent twitter loading user-content images without doing HTTPS inspection - something that's a bit of a broad brush, but given the fairly hefty amount of adult content swilling around Twitter, it's far from being the worst idea!

Smoothwall users: Twitter images are considered "unmoderated image hosting" - if you had previously made some changes to unblock CSS and JS from twimg, you can probably remove those now.

Twitter – Den of Iniquity or Paragon of Virtue… or Someplace in Between?


Twitter - Den of Iniquity or Paragon of Virtue or Someplace in Between


Recently there's been some coverage of Twitter's propensity for porn. Some research has shown that
one in every thousand tweets contains something pornographic. With 8662 tweets purportedly sent every second, that's quite a lot.

Now, this is not something that has escaped our notice here at Smoothwall HQ. We like to help our customers keep the web clean and tidy for their users, and mostly that means free of porn. With Twitter that's particularly difficult. Their filtering isn't easy to enforce and, while we have had some reasonable results with a combination of search term filtering and stripping certain tweets based on content, it's still not optimal. Twitter does not enforce content marking and 140 characters is right on the cusp of being impossible to content filter.

That said - how porn riddled is Twitter? Is there really sex round every corner? Is that little blue bird a pervert? Well, what we've found is: it's all relative.

Twitter is certainly among the more gutter variety of social networks, with Tumblr giving it a decent run for boobs-per-square-inch, but the likes of Facebook are much cleaner — with even images of breastfeeding mothers causing some controversy.

Interestingly, however, our back-of-a-beermat research leads us to believe that about 40 in every 1000 websites is in some way linked to porn — these numbers come from checking a quarter of a million of the most popular sites through Smoothwall's web filter and seeing what gets tagged as porn. Meanwhile, the Huffington Post reports that 30% of all Internet traffic is porn - the biggest number thus far. However, given the tendency of porn toward video, I guess we shouldn't be shocked.

Twitter: hard to filter, relatively porn-rich social network which is only doing its best to mirror the makeup of the Internet at large. As a school network admin, I would have it blocked for sure: Twitter themselves used to suggest a minimum age of 13, though this requirement quietly went away in a recent update to their terms of service.