This is a rather long blog post, so we’ve created a PDF for you to download, if you’d like to read it offline. You can download that from here.
This report explores Brexit-related Twitter activity occurring between December 4, 2018 and February 13, 2019. Using the standard Twitter API, researchers collected approximately 24 million tweets that matched the word “brexit” published by 1.65 million users.
A node-edge graph created from the collected data was used to delineate pro-leave and pro-remain communities active on Twitter during the collection period. Using these graphs, researchers were able to identify accounts on both sides of the debate that play influential roles in shaping the Brexit conversation on Twitter. A subsequent analysis revealed that while both communities exhibited inorganic activity, this activity was far more pronounced in the pro-leave group. Given the degree of abnormal activity observed, the researchers conclude that the pro-leave Twitter community is receiving support from far-right Twitter accounts based outside of the UK. Some of the exceptional behaviors exhibited by the pro-leave community included:
- The top two influencers in the pro-leave community received a disproportionate number of retweets, as compared to influencer patterns seen in the pro-remain community
- The pro-leave group relied on support from a handful of non-authoritative news sources
- A significant number of non-UK accounts were involved in pro-leave conversations and retweet activity
- Some pro-leave accounts tweeted a mixture of Brexit and non-Brexit issues (specifically #giletsjaunes, and #MAGA)
- Some pro-leave accounts participated in the agitation of French political issues (#franceprotests)
The scope of this report is too limited to conclusively determine whether or not there is a coordinated astroturfing campaign underway to manipulate the public or political climate surrounding Brexit. However, it does provide a solid foundation for more investigation into the matter.
Social networks have come under fire for their inability to prevent the manipulation of news and information by potentially malicious actors. These activities can expose users to a variety of threats. And recently, the spread of disinformation and factually inaccurate statements to socially engineer popular opinion has become a significant concern to the public. Of particular concern is the coordination of actions across multiple accounts in order to amplify specific content and fool underlying algorithms into falsely promoting amplified content to users in their news feeds, searches, and recommendations. Participants in these campaigns can include: fully automated accounts (“bots”), cyborgs (accounts that use a combination of manual and automated actions), full-time human operators, and users who inadvertently amplify content due to their beliefs or political affiliations. Architects of sophisticated social engineering campaigns, or astroturfing campaigns (fabricated social network interactions designed to deceive the observer into believing that the activity is part of a grass-roots campaign), sometimes create and operate convincing looking personas to assist in the propagation of content and messages relevant to their cause. It is extremely difficult to distinguish these “fake” personas from real accounts.
Identifying suspicious activities in social networks is becoming more and more difficult. Adversaries have learned from their past experiences, and are now using better tactics, building better automation, and creating much more human-like sock puppets. Social networks now employ more sophisticated algorithms for detecting suspicious activity, and this forces adversaries to develop new techniques aimed at evading those detection algorithms. Services that sell Twitter followers, Twitter retweets, YouTube views, YouTube subscribers, app store reviews, TripAdvisor reviews, Facebook likes, Instagram followers, Instagram likes, Facebook accounts, Twitter accounts, eBay reviews, Amazon ratings, and anything else you could possibly imagine (related to social networks) can be purchased cheaply online. These services can all be found with simple web searches. For the more tech-savvy, a plethora of tools exist for automating the control of multiple social media accounts, for automating the creation and publishing of text and video-based content, for scraping and copying web sites, and for automating search engine optimization tasks. As such, more complex analysis techniques, and much more in-depth study of the data obtainable from social networks is required than ever before.
Because of its open nature, and fully-featured API support, Twitter is an ideal platform for research into suspicious social network activity. By studying what happens on Twitter, we can gain insight into the techniques adversaries use to “game” the platform’s users and underlying algorithms. The findings from such research can help us build more robust recommendation mechanisms for both current and future social networking platforms.
Between December 4, 2018 and February 13, 2019, we used the standard Twitter API (from Python) to collect Twitter data against the search term “brexit”. The collected data was written to disk and then subsequently analyzed (using primarily Python and Jupyter notebooks) in order to search for suspicious activity such as disinformation campaigns, astroturfing, sentiment amplification, or other “meddling” operations.
At the time of writing, our dataset consisted of approximately 24 million tweets published by over 1.65 million users. 18 million of those were retweets published by 1.5 million users from tweets posted by 300,000 unique users. The dataset included 145,000 different hashtags, 412,000 different URLs and 700,000 unique tweets.
Suspicious activity (activity that appears inorganic or unnatural) can be difficult to separate from organic activity on a social network. For instance, a tweet from user with very few followers will normally fall on deaf ears. However, that user may once in a lifetime post something that ends up going “viral” because it was so catchy it got shared by other users, and eventually by influencers with many followers. Malicious actors can amplify a tweet to similar effect by instructing or coordinating a large number of accounts to share an equally unknown user’s tweet. This can be achieved via bots or manually operated accounts (such as what was achieved by Tweetdeckers in 2017 and 2018). Retweets can also be purchased online. Vendors that provide such services publish the purchased retweets from their own fleets of Twitter accounts which likely don’t participate in any other related conversations on the platform. Retweets purchased on this way are often published over a period of time (and not all at once, since that would arouse suspicion). Hence, detecting that a tweet has been amplified by such a service (and identifying the accounts that participated in the amplification) is only possible if those retweets are captured as they are published. Finding a small group of users that retweeted one account over several days, and that may have themselves appeared only once in a dataset containing over 20 million tweets and 300,000 users, is rather difficult.
Groups of accounts that heavily retweet similar content or users over and over can be indicative of automation or malicious behaviour, but finding such groups can sometimes be tricky. Nowadays sophisticated bot automation exists that can easily hide the usual tell-tale signs of artificial amplification. Automation can be used to queue a list of tweets to be published or retweeted, randomly select a portion of potentially thousands of slave accounts, and perform actions at random times, while specifically avoiding tweeting at certain times of the day to give the impression that real users are in control of those accounts. Real tweets and tweets from “share” buttons on news sites can be mixed with retweets to improve realism.
Another approach to finding suspicious behaviour on Twitter is to search for account activity patterns indicative of automation. In a vacuum, these patterns cannot be used to conclusively determine whether an account is automated, or designed to act as part of an astroturfing or disinformation campaign. However, identifying accounts with one or more suspicious traits can help lead researchers to other accounts, or suspicious phenomena, which may ultimately lead to finding evidence of foul-play. Here are some traits that may indicate suspiciousness:
- While it is entirely possible for a bored human to tweet hundreds of times per day (especially when most of the activity is pressing the retweet button), accounts with high tweet volumes can sometimes be indicative of automation. In fact, some of the accounts we found during this research that tweeted at high volume tended to publish hundreds of tweets at certain times during the day, whilst remaining dormant the rest of the time, or published tweets at a uniform volume, with no pauses for sleep.
- Accounts that are just a few days or weeks old tend to not have thousands of followers, unless they belong to a well-known celebrity who just joined the platform. New accounts that develop huge followings in a short period of time are suspicious, unless those followings can be explained by particular activity or some sort of pre-existing public status.
- Accounts with a similar number of followers and friends can occasionally be suspicious. For instance, accounts controlled by a bot herder are sometimes programmatically instructed to follow each other, and end up having similar follower/friends counts. However, mechanisms also exist that promote a “follow-back” culture on Twitter. These mechanisms are often present in isolated communities, such as the far-right Twittersphere. Commercial services also exist that automate follow-back actions for accounts that followed them. The fact that the list of accounts followed by a user is very similar to that user’s list of friends can, unfortunately, be indicative of any of the above.
- Accounts that follow thousands of other accounts, but are themselves followed by only a fraction of that number can occasionally be indicative of automation. Automated accounts that advertise adult services (such as porn, phone sex, “friend finders”, etc.) use this tactic to attract followers. However, there are also certain communities on Twitter that tend to reciprocate follows, and hence following a great deal of accounts (including “egg” accounts) is a way of “fishing” for follow-backs, and normal in those circles.
- While it is true that many users on Twitter tend to like and retweet content a lot more than they write their own tweets, accounts that retweet more than 99% of the time might be controlled by automation (especially since it’s a very easy thing to automate, and can be used to boost the engagement of specific content). A few accounts that we encountered during our research had one pinned tweet published by “Twitter Web Client” whilst the rest of the account’s tweets were retweets published by “Twitter for Android”. This sort of pattern raises suspicion, since it could indicate that the account was manually created (and seeded with a single hand-written tweet) by a user at a computer, and then subsequently automated.
- Accounts that publish tweets using apps created with the Twitter API, or from sources that are often associated with automation are not conclusively suspicious, but may warrant further examination. This is covered more extensively later in the article.
- Temporal analysis techniques (discussed later in this article) can reveal robot-like behaviour indicative of automation. Some accounts are automated by design (e.g. automated marketing accounts, news feeds). However, if an account behaves in an automated fashion, and publishes politically polarizing content, it may be cause for suspicion.
During the first few weeks of our research, we focused on building up an understanding of the trends and topology of conversations around the Brexit topic. We created a simple tool designed to collect counts and interactions from the previous 24 hours’ worth of data, and present the results in an easily readable format. This analysis included:
- counts of how many times each user tweeted
- counts of how many times each user retweeted another user (amplifiers)
- counts of how many times each user was retweeted by another user (influencers)
- counts of hashtags seen
- counts of URLs shared
- counts of words seen in tweet text
- a map of interactions between users
By mapping the interactions between users (users interact when they retweet, mention, or reply to each other), a node-edge graph representation of observed conversations can be built. Here’s a simple representation of what that looks like:
Lines connecting users in the diagram above represent interactions between those users. Communities are groups of nodes within a network that are more densely connected to one another than to other nodes, and can be discovered using community detection algorithms. To visualize the topology of conversation spaces, we used a graph analysis tool called Gephi, which uses the Louvain Method for community detection. For programmatic community detection purposes, we used the “multilevel” algorithm that is part of the python-igraph package (which is very similar to the algorithm used in Gephi). We often used graph analysis and visualization techniques during our research, since they were able to fairly accurately partition conversations between large numbers of accounts. As an example of the accuracy of these tools, the illustration below is a graph visualization created using about 24 hours’ worth of data collected around December 4, 2018.
Names with a larger font indicate Twitter accounts that are mentioned more often. It can be noted from the above illustration that that conversations related to pro-Brexit (leave) topics are clustered at the top (in orange) and conversations related to anti-Brexit (remain) topics are clustered at the bottom (in blue). The green cluster represents conversations related to Labour, and the purple cluster contains conversations about Scotland. People familiar with the Twitter users in this visualization will understand how accurately this methodology managed to separate out each political viewpoint. Visualizations like these illustrate that separate groups of users discuss opposing topics, with very little interaction between the two groups. Highly polarized issues, such as the Brexit debate (and many political topics around the world) usually generate graph visualizations that look like the above.
December 11: #franceprotests hashtag
On December 11, 2019, we observed the #franceprotests hashtag trending in our data (something we had not previously seen). Isolating all tweets from 24 hours’ worth of previously collected data, we found 56 separate tweets that included the #franceprotests hashtag. We mapped interactions between these tweets and the users that interacted with them, resulting in this visualization:
From the above visualization, we can clearly observe a large number of users interacting with a single tweet. This particular tweet (id: 1069955399917350912) was responsible for a majority of the occurrences of the #franceprotests hashtag on that day. This is the tweet:
The reason this tweet showed up in our data was because of the presence of the #BREXIT hashtag. From this 24 hours’ worth of collected data, we isolated a list of 1047 users that retweeted the above tweet. Interactions between these users from across the 24-hour period looked like this:
Of note in the above visualization are accounts such as @Keithbird59Bird (which retweeted pro-leave content at high volume across our entire dataset), @stephenhawes2 (a pro-leave account that exhibits potentially suspicious activity patterns), @SteveMcGill52 (an account that tweets pro-leave, anti-muslim, and US-related right wing content at high volume). The @lvnancy account that published the original tweet is a US-based alt-right account with over 50,000 followers.
At the time of writing, 23 of these 1047 accounts (2.2%) had been suspended by Twitter.
We performed a Twitter history search for “#franceprotests” in order to determine which accounts had been sharing this hashtag. The search captured roughly 5,800 tweets published by just over 3,617 accounts (retweets are not included in historical searches). Searching back historically allowed us to determine that the current wave of #franceprotests tweets started to pick up momentum around November 28, 2018. In addition to the #franceprotests hashtag, this group of users also published tweets with hashtags related to the yellow vests movement (#yellowvest, #yellowjackets, #giletsjaunes), and to US right-wing topics (#MAGA, #qanon, #wwg1wga). Interactions between the accounts found in that search look like this:
Some of the accounts in this group are quite suspicious looking. For instance, @tagaloglang is an account that claims to be themed towards learning the Tagalog language. The pinned tweet at the top of @tagaloglang’s timeline makes the account appear in-theme when the page loads:
However, scroll down, and you’ll notice that the account frequently publishes political content.
Another odd account is @HallyuWebsite – a Korean-themed account about Kpop. Here’s what the account looks like when you visit it:
Again, this is just a front. Scroll down and you will see plenty of political content.
Both @tagaloglang and @ HallyuWebsite look like accounts that might be owned by a “Twitter marketing” service that sells retweets.
The 5,800 tweets captured in this search had accumulated a total of 53,087 retweets by mid-February 2019. Here are a few of the tweets that received the most retweets:
At the time of writing, 66 of the 3,617 accounts (1.83%) identified as historically sharing this hashtag had been suspended.
Throughout our research, we observed many English-language accounts participating in activism related to the French protests, often in conjunction with UK, US, and other far-right themes. We would imagine that a separate research thread devoted to the study of far-right activism around the French protests would likely expose plenty of additional suspicious activity.
December 20: suspicious pro-leave amplification
During our time spent studying the day-to-day user interactions, we became familiar with the names of accounts that most often tweeted, and of those that were most often retweeted. On December 20, 2018 we noticed a few accounts that weren’t normally highly retweeted that made it onto our “top 50” list. We isolated the interactions between these accounts, and the accounts that retweeted them, and produced the following visualization:
As illustrated above, several separate groups of accounts participated in the amplification of a small number of tweets from brexiteer30, jackbmontgomery, unitynewsnet and stop_the_eu. Here is a visualization of tweets from those accounts, and the users who interacted with them:
5,876 accounts participated in the amplification captured on December 20, 2018. In order to discover what other accounts these 5,876 accounts were amplifying, we collected the last 200 tweets from each of the accounts, and mapped all interactions found, generating this graph:
Zooming in on this, we can see that the yellow cluster at the bottom contains US-based “alt-right” Twitter personalities (such as Education4Libs, and MrWyattEarpLA – an account that is now suspended), and US-based non-authoritative news accounts (such as INewsNet).
The large blue center cluster contains many EU-based right-wing accounts (such as Stop_The_EU, darrengrimes_, and BasedPoland), and non-authoritative news sources (such as UnityNewsNet, V_of_Europe). It also contains AmyMek, a radical racist US Twitter personality with over 200,000 followers.
The orange cluster at the top contains interactions with pro-remain accounts. Although we weren’t expecting to see any interactions of this nature, they were most likely introduced by accounts in the dataset that retweet content from both sides of the debate (such as Brexit-themed tweet aggregators).
Many of the 5,876 accounts that participated in the December 20, 2018 amplification contained #MAGA (Make America Great Again hashtag commonly used by the alt-right), had US cities and states set as their locations, or identified as American in one way or another. At the time of writing, 79 of these 5,876 accounts (1.34%) had been suspended by Twitter.
February 12: non-authoritative news accounts
The presence of interactions with a number of non-authoritative, pro-leave news sources that are supportive of far-right activist Tommy Robinson (such as UnityNewsNet, PoliticalUK, and PoliticaLite) in this data led us to explore the phenomenon a little further. We ran analysis over our entire collected dataset in order to discover which accounts were interacting with, and sharing links to these sources. The data reveals that some users retweeted the accounts associated with these news sources, others shared links directly, and some retweeted content that included those links. Using our collected data, we were able to build up a picture of how these links were being shared between early December and mid-February. The script we ran looked for interactions with the following accounts: “UnityNewsNet”, “AltNewsMedia”, “UK_ElectionNews”, “LivewireNewsUK”, “Newsflash_UK”, “PoliticsUK1”, “Politicaluk”, “politicalite”. It also performed string searches on any URLs embedded in tweets for the following: “unitynewsnet”, “politicalite”, “altnewsmedia”, “www-news”, “patriotnewsflash”, “puknews”.
Overall, we discovered that 7,233 accounts had either shared (or retweeted) links to these news sites, or retweeted their associated Twitter accounts. A total of 15,337 retweets were found from the dataset. The UnityNewsNet Twitter accounts was the most popular news source present in our dataset. It received 8,119 retweets by a total of 4,185 unique users. In second place was the UK_ElectionNews account with 1,293 retweets from 1,182 unique users, and in third place was politicalite with 494 retweets from 351 unique users.
A total of 9,193 Tweets were found in the dataset that shared URLS that matched the string searches mentioned above. Again, Unity News Network was the most popular – URLs that matched “unitynewsnet” were tweeted a total of 5542 times by 2928 unique users. Politicalite came in second – URLs that matched “politicalite” were tweeted a total of 3300 times by 2197 unique users. In third place was Newsflash_UK – URLs that matched “patriotnewsflash” were tweeted a total of 239 times by 65 unique users. Here is a graph visualization of all the activity that took place between the beginning of December 2018 and mid-February 2019:
Names that appear larger in the above visualization are account names that were retweeted more often. We can see more names here than the originally queried accounts because many links to these sites were shared by users retweeting other accounts that shared a link. Here’s a closer zoom-in:
At the time of writing, 130 of the 7,233 accounts (1.79%) identified to be sharing content related to these non-authoritative new sources had been suspended by Twitter.
The figures and illustrations shown above were obtained from a dataset of tweets that matched the term “brexit”. This particular analysis, unfortunately, didn’t give us full visibility into all activity around these “non-authoritative news” accounts and the websites associated with them that happened on Twitter between early December 2018 and mid-February 2019. In order to explore this phenomenon further, we performed historical Twitter searches for each of the account names in question (collecting data from between December 4, 2018 and February 12, 2019). This allowed us to examine tweets and interactions that weren’t captured using the search term “brexit”. Historical Twitter searched only return tweets from the accounts themselves, and tweets where the accounts were mentioned. Unfortunately, no retweets are returned by a search of this kind.
The combined dataset (over all 7 searches) included 30,846 tweets and 12,026 different users.
Combining the data from historical searches against all seven account names, we were able to map interactions between each news account and users that mentioned it. Here’s what it looked like:
Here’s a zoomed-in view of the graph around politicalite and altnewsmedia:
Note the presence of @prisonplanet (Paul Joseph Watson) and @jgoddard230616 (James Goddard, the star of several recent “yellow vests” harassment videos), amongst other highly-mentioned far-right personalities.
Also of interest is the set of users coloured in purple in the following visualization:
The users in the purple cluster were found from the data we collected using a Twitter search for “unitynewsnet”. With the exception of V_of_Europe, each of these accounts is mentioned exactly the same number of times (522 times) by other users in that dataset. This particular phenomenon appears to have been created by a rather long conversation between those 40ish accounts between January 14 and 16, 2019. The conversation started with a question about where to find “yellow vest”-related news. Since mentions are always inherited between replies, and both V_of_Europe and UnityNewsNet were mentioned in the first tweet in the thread, this explains why these tweets are present in this dataset. Using temporal analysis techniques (explained below), we were able to ascertain that a majority of the involved accounts pause, or tweet at reduced volume between 06:00 and 12:00 UK time, which is indicative of night time in US time zones. In fact, examining these accounts manually reveals that they are mostly US-based. V_of_Europe account is a non-authoritative news account (Voice of Europe) with over 200,000 followers.
This interesting finding illustrates the fact that sometimes a suspicious looking trend or spike may present itself when data is viewed from a certain angle. Further inspection of the phenomenon will then prove it to be largely benign.
At the time of writing, 159 of the 12.026 accounts (1.32%) discovered above had been suspended by Twitter.
Temporal analysis methods can be useful for determining whether a Twitter account might be publishing tweets using automation. This section describes techniques, the results of which are included later in this document. Here are some common temporal analysis methods:
- Gather a histogram of counts of time intervals between the publishing of tweets. Numerous high counts of similar time intervals between tweets can indicate robotic behaviour.
- A “heatmap” of the time of day, and day of the week that an account tweets can be gathered. The heatmap can then be examined (either by eye, or programmatically) for anomalous patterns. Using this technique, it is easy to identify accounts that tweet non-stop, with no breaks. If this is the case, it is possible that some (or all) of the tweets are being published via automation.
- A heatmap analysis may also illustrate that certain accounts publish tweets en-masse at specific times of the day, and remain dormant for many hours in between. This behavior can also be an indicator that an account is automated – for instance, this is somewhat common with marketing automation or news feeds.
Here are some interesting examples found from the dataset. Note that these examples are intended as illustration, and not as indications that the associated accounts are bots.
The stephanhawes2 account tweets in short bursts at specific times of the day, with no activity at any other time. The precise time windows during which this user tweets (18:00-20:59 and 00:00-01:59) looks odd. This account retweets a great deal of far-right content.
Here are the time deltas (in seconds) observed between the account’s last 3200 tweets. You’ll notice that a majority of the tweets are published between 5 and 15 seconds apart.
The JimNola42035005 account, which amplifies a lot of pro-leave content, pauses tweeting between 08:00 UTC and 13:00 UTC. This is indicative of a user not residing in the UK’s time zone.
The interarrival pattern for this account shows a strong tendency for multiple tweets to be published in rapid succession (5-30 seconds apart).
The tobytortoise1 account tweets at very high volume, and almost always shows up at or near the top of the most active users tweeting about Brexit. This is a pro-leave account. Here’s the heatmap for that account. Note the bursts of activity exceeding 100 tweets in an hour:
Here is the interarrival pattern for that account:
The walshr108 account, which publishes pro-leave content, appears to pause roughly around UK night-time hours. However, the interarrivals pattern of this account raises suspicion.
Over 350 of walshr108’s last 3200 tweets were published less than one second apart.
Unconventional source fields
Each published tweet includes a “source” field that is set by the agent that was used to publish that tweet. For instance, tweets published from the web interface have their source field set to “Twitter Web Client”. Tweets published from an iPhone have a source field set to “Twitter for iPhone”. And so on. Tweets can be published from a variety of sources, including services that allow tweets to be scheduled for publishing (for instance “IFTTT”), services that allow users to track follows and unfollows (such as “Unfollowspy”), apps within web pages, and social media aggregators. Twitter sources can be roughly grouped into:
- Sources associated with manual tweeting (such as “Twitter Web Client”, “Twitter for iPhone”)
- Sources associated with known automation services (such as “IFTTT”)
- Sources that don’t match either of the above
While services that allow the automation of tweeting (such as “IFTTT”) can be used for malicious purposes, they can also be used for legitimate purposes (such as brand marketing, news feeds, and aggregators). Malicious actors sometimes shy away from such services for two reasons:
- It is easy for researchers to identify tweet automation by examining source fields
- Sophisticated tools exist that allow bot herders to publish tweets from multiple accounts without the use of the API, and which can spoof their user agent to match legitimate sources (often “Twitter for Android”)
Despite the availability of professional bot tools, there are still some malicious actors that use Twitter’s API and attempt to disguise what they’re doing. One way to do this is to create an app who’s source field is a string similar to that of a known source (e.g. “Twitter for Android” <- note that this string has two spaces between the words “for” and “Android”). Another way is to replace ascii characters with non-ascii characters (e.g. “Оwly” <- the “O” in this string is non-ascii). API-based apps can also “hide” by using source strings that look like legitimate product names – there are a plethora of legitimate apps available that all have similar-looking names: Twuffer, Twibble, Twitterfy, Tweepsmap, Tweetsmap. It’s easy enough to create a similarly absurd, nonsensical word, and hide amongst all of these.
Over 6000 unique source strings were found in the dataset. There is no definitive list of “legitimate” Twitter sources available, and hence each and every one of the unique source strings found must be examined manually in order to build a list of acceptable versus unacceptable sources. This process involves either searching for the source string, locating a website, and reading it, or visiting the account that is using the unknown source string and manually checking the “legitimacy” of that account. At the time of writing, we had managed to hand-verify about 150 source strings that belonged to Twitter clients, known automation services, and custom apps used by legitimate services (such as news sites and aggregators). We found roughly 2 million tweets across the entire dataset that were published with source strings that we had yet to hand-verify. These tweets were published by just under 17,000 accounts.
As mentioned previously, since there are dozens of legitimate services that allow Twitter to be automated, it isn’t easy to programmatically identify whether these automation sources are being used for malicious purposes. Each use of such a service found in the dataset would need to be examined by hand (or by the use of custom filtering logic for each subset of examples). This is simply not feasible. As such, using Twitter’s source field to determine whether suspicious, malicious, or automated behaviour is occurring is a complex endeavour, and one that is outside of the scope of the research described in this document.
Comparison of remain and leave-centric communities
We collected retweet interactions over our entire dataset and created a large node-edge graph. The reason why we only captured retweet interactions in this case was based on the assumption that if users wish to extend the reach of a particular tweet, they’d more likely retweet it than reply to it, or simply mention an account name. While the process of “liking” a tweet also seems to amplify a tweet’s visibility (via Twitter’s underlying recommendation mechanisms), instances of users “liking” tweets are, unfortunately, not available via Twitter’s streaming API.
The graph of all retweet activity across the entire collection period contained 219,328 nodes (unique Twitter accounts) and 1,184,262 edges (each edge was one or more observed retweet). Using python-igraph’s multilevel community detection algorithm, we partitioned the graph into communities. A total of 8,881 communities were discovered during this process. We performed string searches on the members of each identified community for high-profile accounts we’d seen engaging in leave and remain conversations throughout the research period, and were able to discover a leave-centric community containing 39,961 users and a remain-centric community containing 52,205 users. We then separately queried our full dataset with each list of users to isolate all relevant data (tweets, interactions, hashtags, and urls). Below are the findings from that analysis work.
The leave-centric community comprised of 39,961 users. They published 1.1 million unique tweets, and a total of 4.3 million retweets across the dataset.
- 2779 (6.95%) accounts that were seen at least 100 times in the dataset retweeted 95% (or more) of the time.
- 278 (0.70%) accounts retweeted over 2000 times across the entire dataset, for a total of 880,620 retweets (20.5%). Of these, temporal analysis suggests that 33 of the accounts exhibited potentially suspicious behavior (11.9%) and 14 accounts tweeted during non-UK time schedules.
- At the time of writing, 133 of these accounts (0.33%) had been suspended by Twitter.
The remain-centric community comprised of 52,205 users. They published 1.7 million unique tweets, and a total of 6.2 million retweets across the dataset.
- 3413 (6.54%) accounts that were seen at least 100 times in the dataset retweeted 95% (or more) of the time.
- 436 (0.84%) accounts retweeted over 2000 times across the entire dataset, for a total of 1,471,515 retweets (23.7%). Of these, temporal analysis suggests that 41 of the accounts exhibited potentially suspicious behavior (9.4%) and 18 accounts tweeted during non-UK time schedules.
- At the time of writing, 54 of these accounts (0.10%) had been suspended by Twitter.
Although we were initially suspicious of high-volume retweeters, the presence of these in roughly equal proportions in both groups led us to believe that this sort of behaviour might be somewhat standard on Twitter. The top remain-centric group’s high-volume retweeters published more often than top leave-centric high-volume retweeters. I observed that many of the top retweeters from the remain-centric group tended to tweet a lot about Labour.
The top retweeted account in the leave-centric group received substantially more retweets than the next most highly retweeted account. This was not the case for the remain-centric group.
- Top hashtags used by the remain-centric group included: #peoplesvote, #stopbrexit, #eu, #fbpe, #remain, #labour, #finalsay, and #revokea50. All of the top-50 hashtags in this group were themed around anti-brexit sentiment, around politicians, or around political events that happened in the UK during the data collection period (#specialplaceinhell, #theresamay, #corbyn, #donaldtusk, #newsnight).
- Top hashtags used by the leave-centric group included: #eu, #nodeal, #standup4brexit, #ukip, #leavemeansleave, #projectfear and #leave. Notable other hashtags on the top-50 list for this group were other “no-deal” hashtags (#gowto, #wto, #letsgowto, #nodealnoproblem, #wtobrexit), hashtags referring to protests in France and the adoption of high-vis vest by far-right UK protesters (#giletsjaunes, #yellowvestsuk, #yellowvestuk) and the hashtag #trump.
- Both groups heavily advertised links to UK Parliament online petitions relevant to the outcome of brexit. The remain group advertised links to petitions requesting a second referendum, whilst the leave group advertised links to petitions demanding the UK leave the EU, regardless of the outcome of negotiations.
- From the users we identified as having retweeted more than 95% of the time, we found 62 accounts from the leave-centric group that were clearly American right-wing personas. These accounts associated with #Trump and #MAGA, amplified US political content, and interacted with US-based alt-right personalities (in addition to amplifying Brexit-related content). The description fields of these accounts usually included words such as “Patriot”, “Christian”, “NRA”, and “Israel”. Many of these accounts had their locations set to a state or city in the US. The most common locations set for these accounts were: Texas, Florida, California, New York, and North Carolina. We found no evidence of equivalent accounts in the remain-centric group.
- Following on from our previous discovery, using a simple string search, we found 1294 accounts in the leave-centric group and 12 accounts in the remain-centric group that had #MAGA in either their name or description fields. We manually visited a random selection of these accounts to verify that they were alt-right personas. A few of the #MAGA accounts identified in the remain group were not what we would consider alt-right – they showed up in the results due to the presence of negative comments about MAGA culture in their account description fields.
- As detailed earlier, some of the accounts in the leave-centric group interacted with non-authoritative, far-right “news” accounts, or shared links with sites associated with these accounts (such as UnityNewsNet, BreakingNLive, LibertyDefenders, INewsNet, Voice of Europe, ZNEWSNET, PoliticalUK, and PoliticaLite.) We didn’t find an analogous activity in the remain-centric group.
We created a few plots of the number of times a hashtag was observed during each hour of the day. For a baseline reference, here’s what that plot looks like for the #brexit hashtag:
You can clearly see a lull in activity during night-time hours in the UK. Compare the above baseline with the plot for the #yellowvestuk hashtag:
This clearly shows that the #yellowvest hashtag is most frequently used in the late evening UK time (mid-afternoon US time). Here is the plot for #yellowvestsuk:
Note that this hashtag follows a different pattern to #yellowvestuk, and is used most often around lunchtime in the UK. Both of these graphs show a lull in activity during night-time hours in the UK, indicating that the accounts pushing these hashtags most likely belong to people living in the UK, and that possibly different groups are promoting these two competing hashtags.
It is very difficult to determine whether a Twitter account is a bot, or acting as part of a coordinated astroturfing campaign, simply by performing queries with the standard API. Twitter’s programmatic interface imposes many limitations to what can be done when analyzing an account. For instance, by default, only the last 3200 tweets can be collected from any given account, and Twitter restricts how often such a query can be run. Most of the potentially suspicious accounts identified during this research have published tens, or even hundreds of thousands of tweets over their lifetimes, most of which are now inaccessible.
Since Twitter’s API doesn’t support observing when a user “likes” a tweet, and has limited support for querying which accounts retweeted a tweet, or replied to a tweet, it is impossible to track all actions that occur on the platform. Nowadays, a user’s Twitter timeline contains a series of recommendations (for instance, tweets that appear on user’s timeline may indicate that they are there because “user x that you follow liked this tweet”). The timeline is no longer just a sequential list of tweets published by accounts a user follows. Hence it is important to understand which actions might increase the likelihood that a tweet appears on a user’s timeline, is recommended to a user (via notifications) or appears in a curated list when performing a search.
We do know that Twitter’s systems track an internal representation of the quality of every account, and give more engagement weight to higher quality accounts. Although it is likely that many of the potentially suspicious accounts identified during our research have low quality scores, it is still possible that their collective actions may incorrectly modify the sentiment of certain viewpoints and opinions, or cause content to be shown to users when it otherwise shouldn’t have.
From analysis of the “leave” and “remain” communities obtained by graph analysis, it seems clear to us that the remain-centric group looks quite organic, whilst the leave-centric group are being bolstered by non-UK far-right Twitter accounts. Leave users also utilize a number of “non-authoritative” news sources to spread their messages. Given that we also observed a subset of leave accounts performing amplification of political content related to French and US politics, we wouldn’t be surprised if coordinated astroturfing activity is being used to amplify pro-Brexit sentiment. Finding such a phenomenon would require additional work – most of the tweets published by this group likely weren’t captured by our stream-search for the word “brexit”. It’s clear that an internationally-coordinated collective of far-right activists are promoting content on Twitter (and likely other social networks) in order to steer discussions and amplify sentiment and opinion towards their own goals, Brexit being one of them.
During the course of our research, we created over 90 separate Jupyter notebooks and custom analysis tools. We would approximate that 90% of the approaches we tried ended up in dead ends. Despite all of this analysis work, we didn’t find the “next big” political disinformation botnet. We did, however, find many phenomena that were both interesting and odd.