Category Archives: Twitter

Twitter CEO Jack Dorsey Says Biometrics May Defeat Bots

Trailrunner7 shares a report from Duo Security: From the beginning, Twitter's creators made the decision not to require real names on the service. It's a policy that's descended from older chat services, message boards and Usenet newsgroups and was designed to allow users to express themselves freely. Free expression is certainly one of the things that happens on Twitter, but that policy has had a number of unintended consequences, too. The service is flooded with bots, automated accounts that are deployed by a number of different types of users, some legitimate, others not so much. Many companies and organizations use automation in their Twitter accounts, especially for customer service. But a wide variety of malicious actors use bots, too, for a lot of different purposes. Governments have used bots to spread disinformation for influence campaigns, cybercrime groups employ bots as part of the command-and-control infrastructure for botnets, and bots are an integral part of the cryptocurrency scam ecosystem. This has been a problem for years on Twitter, but only became a national and international issue after the 2016 presidential election. Twitter CEO Jack Dorsey said this week that he sees potential in biometric authentication as a way to help combat manipulation and increase trust on the platform. "If we can utilize technologies like Face ID or Touch ID or some of the biometric things that we find on our devices today to verify that this is a real person, then we can start labeling that and give people more context for what they're interacting with and ideally that adds some more credibility to the equation. It is something we need to fix. We haven't had strong technology solutions in the past, but that's definitely changing with these supercomputers we have in our pockets now," Dorsey said. Jordan Wright, an R&D engineer at Duo Labs writes: "I think it's a step in the right direction in terms of making general authentication usable, depending on how it's implemented. But I'm not sure how much it will help the bot/automation issue. There will almost certainly need to be a fallback authentication method for users without an iOS device. Bot owners who want to do standard authentication will use whichever method is easiest for them, so if a password-based flow is still offered, they'd likely default to that." "The fallback is the tricky bit. If one exists, then Touch ID/Face ID might be helpful in identifying that there is a human behind an account, but not necessarily the reverse -- that a given account is not human because it doesn't use Touch ID," Wright adds.

Read more of this story at Slashdot.

Twitter bug exposed private tweets of Android users to public for years

By Carolina

A security bug in Twitter exposed private tweets of users to the public. The flaw only affected Android users of the Twitter app while iPhone users were not affected. According to Twitter, private tweets of users from November 3, 2014, to January 14, 2019, were exposed. Although the company did not say how many people were affected […]

This is a post from HackRead.com Read the original post: Twitter bug exposed private tweets of Android users to public for years

Twitter Android App Bug Revealed Private Tweets Spanning Five Years

Social media giant Twitter has just announced a bug fix that has been affecting users of its Android App. However,

Twitter Android App Bug Revealed Private Tweets Spanning Five Years on Latest Hacking News.

Twitter fixed a bug in its Android App that exposed Protected Tweets

A bug in the Twitter app for Android may have had exposed tweets, the social media platform revealed on Thursday.

The bug in the Android Twitter app affects the “Protect my Tweets” option from the account’s “Privacy and safety” settings that allows viewing user’s posts only to approved followers.

People who used the Twitter app for Android may have had the protected tweets setting disabled after they made some changes to account settings, for example after a change to the email address associated with the profile.

“We’ve become aware of an issue in Twitter for Android that disabled the “Protect your Tweets” setting if certain account changes were made.” reads the security advisory published by the company.

“You may have been impacted by this issue if you had protected Tweets turned on in your settings, used Twitter for Android, and made certain changes to account settings such as changing the email address associated with your account between November 3, 2014, and January 14, 2019.”

The vulnerability was introduced on November 3, 2014, and was fixed on January 14, 2019, users using the iOS app or the web version were not impacted. 

Twitter has notified impacted users and has turned “Protect your Tweets” back on for them if it was disabled.

“We are providing this broader notice through the Twitter Help Center since we can’t confirm every account that may have been impacted. We encourage you to review your privacy settings to ensure that your ‘Protect your Tweets’ setting reflects your preferences,” continues the advisory.

Recently Twitter addressed a similar bug, in December the researcher Terence Eden discovered that the permissions dialog when authorizing certain apps to Twitter could expose direct messages to the third-party.

In September 2018, the company announced that an issue in Twitter Account Activity API had exposed some users’ direct messages (DMs) and protected tweets to wrong developers.

Twitter is considered one of the most powerful social media platforms, it was used in multiple cases by nation-state actors as a vector for disinformation and propaganda.

In December Twitter discovered a possible nation-state attack while it was investigating an information disclosure flaw affecting its platform.

Pierluigi Paganini

(SecurityAffairs – Twitter app, Android)

The post Twitter fixed a bug in its Android App that exposed Protected Tweets appeared first on Security Affairs.

Los tuits antiguos revelan secretos ocultos

Los tuits antiguos revelan más de lo que crees, según un nuevo informe publicado este mes. Los tuits pueden revelar los lugares que visitaste y las cosas que hiciste, incluso si no lo mencionaste explícitamente. Investigadores de la Fundación para la Investigación y Tecnología de Grecia y la Universidad de Illinois descubrieron todo esto después […]

Smashing Security #110: What? You can get paid to leave Facebook?

Smashing Security #110: What? You can get paid to leave Facebook?

Twitter and the not-so-ethical hacking of celebrity accounts, study discovers how you can pay someone to quit Facebook for a year, and the millions of dollars you can make from uncovering software vulnerabilities.

All this and much more is discussed in the latest edition of the award-winning “Smashing Security” podcast by computer security veterans Graham Cluley and Carole Theriault, joined this week by Maria Varmazis.

German Teen Confesses to Data Breach Affecting 1,000 Politicians, Journalists

2019 kicked off with a major security breach in Germany that compromised the personal data of some 1,000 politicians, journalists and celebrities, including Angela Merkel, Green party leader Robert Habeck, TV personality Jan Böhmermann and many others, including rappers and members of the German parliament, writes the BBC. For now, there is no evidence suggesting far-right party AfD members were also targeted.

While authorities initially had no idea who was behind the cyberattack, they brought in a 20-year-old German man for questioning, says The Guardian. At first he denied accusations but confirmed he knew who was behind the Twitter account that caused the breach: @_0rbit located in Hamburg, Germany.

In December, the Twitter account @_0rbit published the stolen data online disguised in a daily advent calendar. The compromised data includes telephone numbers, credit card information, photos, addresses, private conversations and contacts, reported BKA – the German federal criminal police. The account, which had over 17,000 followers, has been suspended.

Shortly after interrogation, the man, identified as Jan S., confessed to the attack, which he claims he carried out “alone and out of annoyance at statements made by the public figures he attacked.” On Twitter he also used the account name “G0d.” BKA says so far there is no evidence that a third-party was involved.

Interior Minister Seehofer told the BBC at the time that the data was accessed through “wrongful use of log-in information for cloud services, email accounts or social networks.” There is no evidence that government systems were hacked.

German newspaper Bild claims the data compromised is as old as October 2018, possibly even older.

Jan S. was released on Monday “due to a lack of grounds for detention.”

A week in security (December 31, 2018 – January 6, 2019)

Last week on Labs, we looked back at 2018 as the year of data breaches, homed in on pre-installed malware on mobile devices, and profiled a malicious duo, Vidar and GandCrab.

Other cybersecurity news

  • 2019’s first data breach: It took less than 24 hours. An unauthorized third-party downloaded 30,000 details of Australian public servants in Victoria. It was believed that a government employee was phished prior to the breach. (Source: CBR Online)
  • Dark Overlord hackers release alleged 9/11 lawsuit documents. The hacker group known as The Dark Overlord (TDO) targeted law firms and banks related to the 9/11 attack. TDO has a history of releasing stolen information after receiving payment for its extortions. (Source: Sophos’ Naked Security Blog)
  • Data of 2.4 million Blur password manager users left exposed online. 2.4 million users of the password manager, Blur, were affected by a data breach that happened in mid-December of last year and publicly revealed on New Year’s Eve. No passwords stored in the managers were exposed. (Source: ZDNet)
  • Hacker leaked data on Angela Merkel and hundreds of German lawmakers. A hacker leaked sensitive information, which includes email addresses and phone numbers, of Angela Merkel, senior German lawmakers, and other political figures on Twitter. The account was suspended following this incident. (Source: TechCrunch)
  • Hackers seize dormant Twitter accounts to push terrorist propaganda. Dormant Twitter accounts are being hacked and used to further push terrorist propaganda via the platform. It’s easy for these hackers to guess the email addresses of these accounts since Twitter, by default, reveals partly-concealed addresses which clue them in. (Source: Engadget)
  • MobSTSPY spyware weaseled its way into Google Play. Another spyware app made its way into Google Play and onto the mobile devices of thousands of users. The malware steals SMS messages, call logs, contact lists, and other files. (Source: SC Magazine UK)
  • Apple phone phishing scams getting better. A new phone-based scam targeting iPhone users was perceived to likely fool many because the scammer’s fake call is lumped together with a record of legitimate calls from Apple Support. (Source: KrebsOnSecurity)
  • Staying relevant in an increasingly cyber world. Small- to medium-sized businesses may not have the upper hand when it comes to hiring people with talent in cybersecurity, but this shouldn’t be an organization’s main focus. Dr. Kevin Harris, program director of cybersecurity for the American Military University, advised that employers must focus on giving all their employees “cyber skills.” (Source: Federal News Network)
  • Adobe issues emergency patch following December miss. Adobe released an out-of-band patch to address critical vulnerabilities in Acrobat and Reader. (Source: Dark Reading)

Stay safe, everyone!

The post A week in security (December 31, 2018 – January 6, 2019) appeared first on Malwarebytes Labs.

Hackers seize dormant Twitter accounts to push terrorist propaganda

As much progress as Twitter has made kicking terrorists off its platform, it still has a long way to go. TechCrunch has learned that ISIS supporters are hijacking long-dormant Twitter accounts to promote their ideology. Security researcher WauchulaGhost found that the extremists were using a years-old trick to get in. Many of these idle accounts used email addresses that either expired or never existed, often with names identical to their Twitter handles -- the social site didn't confirm email addresses for roughly a decade, making it possible to use the service without a valid inbox. As Twitter only partly masks those addresses, it's easy to create those missing addresses and reset those passwords.

Source: TechCrunch

Project Lakhta: Putin’s Chef spends $35M on social media influence

Project Lakhta is the name of a Russian project that was further documented by the Department of Justice last Friday in the form of sharing a Criminal Complaint against Elena Alekseevna Khusyaynova, said to be the accountant in charge of running a massive organization designed to inject distrust and division into the American elections and American society in general.

https://www.justice.gov/opa/press-release/file/1102316/download
In a fairly unusual step, the 39 page Criminal Complaint against Khusyaynova, filed just last month in Alexandria, Virginia, has already been unsealed, prior to any indictment or specific criminal charges being brought against her before a grand jury.  US Attorney G. Zachary Terwilliger says "The strategic goal of this alleged conspiracy, which continues to this day, is to sow discord in the U.S. political system and to undermine faith in our democratic institutions."

The data shared below, intended to summarize the 39 page criminal complaint, contains many direct quotes from the document, which has been shared by the DOJ. ( Click for full Criminal Complaint against Elena Khusyaynova )

Since May 2014 the complaint shows that the following organizations were used as cover to spread distrust towards candidates for political office and the political system in general.

Internet Research Agency LLC ("IRA")
Internet Research LLC
MediaSintez LLC
GlavSet LLC
MixInfo LLC
Azimut LLC
NovInfo LLC
Nevskiy News LLC ("NevNov")
Economy Today LLC
National News LLC
Federal News Agency LLC ("FAN")
International News Agency LLC ("MAN")

These entities employed hundreds of individuals in support of Project Lakhta's operations with an annual global budget of millions of US dollars.  Only some of their activity was directed at the United States.

Prigozhin and Concord 

Concord Management and Consulting LLC and Concord Catering (collectively referred to as "Concord") are related Russian entities with various Russian government contracts.  Concord was the primary source of funding for Project Lakhta, controlling funding, recommending personnel, and overseeing activities through reporting and interaction with the management of various Project Lakhta entities.

Yevgeniy Viktorovich Prigozhin is a Russian oligarch closely identified with Russian President Vladimir Putin.  He began his career in the food and restaurant business and is sometimes referred to as "Putin's Chef."  Concord has Russian government contracts to feed school children and the military.

Prigozhin was previously indicted, along with twelve others and three Russian companies, with committing federal crimes while seeking to interfere with the US elections and political process, including the 2016 presidential election.

Project Lakhta internally referred to their work as "information warfare against the United States of America" which was conducted through fictitious US personas on social media platforms and other Internet-based media.

Lakhta has a management group which organized the project into departments, including a design and graphics department, an analysts department, a search-engine optimization ("SEO") department, an IT department and a finance department.

Khusyaynova has been the chief accountant of Project Lakhta's finance department since April of 2014, which included the budgets of most or all of the previously named organizations.  She submitted hundreds of financial vouchers, budgets, and payments requests for the Project Lakhta entities.  The money was managed through at least 14 bank accounts belonging to more Project Lakhta affiliates, including:

Glavnaya Liniya LLC
Merkuriy LLC
Obshchepit LLC
Potentsial LLC
RSP LLC
ASP LLC
MTTs LLC
Kompleksservis LLC
SPb Kulinariya LLC
Almira LLC
Pishchevik LLC
Galant LLC
Rayteks LLC
Standart LLC

Project Lakhta Spending 

Monthly reports were provided by Khusyaynova to Concord about the spendings for at least the period from January 2016 through July 2018.

A document sent in January 2017 including the projected budget for February 2017 (60 million rubles, or roughly $1 million USD), and an accounting of spending for all of calendar 2016 (720 million rubles, or $12 million USD).  Expenses included:

Registration of domain names
Purchasing proxy servers
Social media marketing expenses, including:
 - purchasing posts for social networks
 - advertisements on Facebook
 - advertisements on VKontakte
 - advertisements on Instagram
 - promoting posts on social networks

Other expenses were for Activists, Bloggers, and people who "developed accounts" on Twitter to promote online videos.

In January 2018, the "annual report" for 2017 showed 733 million Russian rubles of expenditure ($12.2M USD).

More recent expenses, between January 2018 and June 2018, included more than $60,000 in Facebook ads, and $6,000 in Instagram ads, as well as $18,000 for Bloggers and Twitter account developers.

Project Lakhta Messaging

From December 2016 through May 2018, Lakhta analysts and activist spread messages "to inflame passions on a wide variety of topics" including:
  • immigration
  • gun control and the Second Amendment 
  • the Confederate flag
  • race relations
  • LGBT issues 
  • the Women's March 
  • and the NFL national anthem debate.


Events in the United States were seized upon "to anchor their themes" including the Charleston church shootings, the Las Vegas concert shootings, the Charlottesville "Unite the Right" rally, police shootings of African-American men, and the personnel and policy decisions of the Trump administration.

Many of the graphics that were shared will be immediately recognizable to most social media users.

"Rachell Edison" Facebook profile
The graphic above was shared by a confirmed member of the conspiracy on December 5, 2016. "Rachell Edison" was a Facebook profile controlled by someone on payroll from Project Lakhta.  Their comment read  "Whatever happens, blacks are innocent. Whatever happens, it's all guns and cops. Whatever happens, it's all racists and homophobes. Mainstream Media..."

The Rachell Edison account was created in September 2016 and controlled the Facebook page "Defend the 2nd".  Between December 2016 and May 2017, "while concealing its true identity, location, and purpose" this account was used to share over 700 inflammatory posts related to gun control and the Second Amendment.

Other accounts specialized on other themes.  Another account, using the name "Bertha Malone", was created in June 2015, using fake information to claim that the account holder lived in New York City and attended a university in NYC.   In January 2016, the account created a Facebook page called "Stop All Invaders" (StopAI) which shared over 400 hateful anti-immigration and anti-Islam memes, implying that all immigrants were either terrorists or criminals.  Posts shared by this acount reached 1.3 million individuals and at least 130,851 people directly engaged with the content (for example, by liking, sharing, or commenting on materials that originated from this account.)

Some examples of the hateful posts shared by "Bertha Malone" that were included in the DOJ criminal complaint,  included these:




The latter image was accompanied by the comment:

"Instead this stupid witch hunt on Trump, media should investigate this traitor and his plane to Islamize our country. If you are true enemy of America, take a good look at Barack Hussein Obama and Muslim government officials appointed by him."

Directions to Project Lakhta Team Members


The directions shared to the propaganda spreaders gave very specific examples of how to influence American thought with guidance on what sources and techniques should be used to influence particular portions of our society.  For example, to further drive wedges in the Republican party, Republicans who spoke out against Trump were attacked in social media:
(all of these are marked in the Criminal Complaint as "preliminary translations of Russian text"):

"Brand McCain as an old geezer who has lost it and who long ago belonged in a home for the elderly. Emphasize that John McCain's pathological hatred towards Donald Trump and towards all his initiatives crosses all reasonable borders and limits.  State that dishonorable scoundrels, such as McCain, immediately aim to destroy all the conservative voters' hopes as soon as Trump tries to fulfill his election promises and tries to protect the American interests."

"Brand Paul Ryan a complete and absolute nobody incapable of any decisiveness.  Emphasize that while serving as Speaker, this two-faced loudmouth has not accomplished anything good for America or for American citizens.  State that the only way to get rid of Ryan from Congress, provided he wins in the 2018 primaries, is to vote in favor of Randy Brice, an American veteran and an iron worker and a Democrat."

Frequently the guidance was in relation to a particular news headline, where directions on how to use the headline to spread their message of division where shared. A couple examples of these:

After a news story "Trump: No Welfare To Migrants for Grants for First 5 Years" was shared, the conspiracy was directed to twist the messaging like this:

"Fully support Donald Trump and express the hope that this time around Congress will be forced to act as the president says it should. Emphasize that if Congress continues to act like the Colonial British government did before the War of Independence, this will call for another revolution.  Summarize that Trump once again proved that he stands for protecting the interests of the United States of America."

In response to an article about scandals in the Robert Mueller investigation, the direction was to use this messaging:

"Special prosecutor Mueller is a puppet of the establishment. List scandals that took place when Mueller headed the FBI.  Direct attention to the listed examples. State the following: It is a fact that the Special Prosector who leads the investigation against Trump represents the establishment: a politician with proven connections to the U.S. Democratic Party who says things that should either remove him from his position or disband the entire investigation commission. Summarize with a statement that Mueller is a very dependent and highly politicized figure; therefore, there will be no honest and open results from his investigation. Emphasize that the work of this commission is damaging to the country and is aimed to declare impeachement of Trump. Emphasize that it cannot be allowed, no matter what."

Many more examples are given, some targeted at particular concepts, such as this direction regarding "Sanctuary Cities":

"Characterize the position of the Californian sanctuary cities along with the position of the entire California administration as absolutely and completely treacherous and disgusting. Stress that protecting an illegal rapist who raped an American child is the peak of wickedness and hypocrisy. Summarize in a statement that "sanctuary city" politicians should surrender their American citizenship, for they behave as true enemies of the United States of America"

Some more basic guidance shared by Project Lakhta was about how to target conservatives vs. liberals, such as "if you write posts in a liberal group, you must not use Breitbart titles.  On the contrary, if you write posts in a conservative group, do not use Washington Post or BuzzFeed's titles."

We see the "headline theft" implied by this in some of their memes.  For example, this Breitbart headline:


Became this Project Lakhta meme (shared by Stop All Immigrants):


Similarly this meme originally shared as a quote from the Heritage Foundation, was adopted and rebranded by Lakhta-funded "Stop All Immigrants": 



Twitter Messaging and Specific Political Races

Many Twitter accounts shown to be controlled by paid members of the conspiracy were making very specific posts in support of or in opposition to particular candidates for Congress or Senate.  Some examples listed in the Criminal Complaint include:

@CovfefeNationUS posting:

Tell us who you want to defeat!  Donate $1.00 to defeat @daveloebsack Donate $2.00 to defeat @SenatorBaldwin Donate $3.00 to defeat @clairecmc Donate $4.00 to defeat @NancyPelosi Donate $5.00 to defeat @RepMaxineWaters Donate $6.00 to defeat @SenWarren

Several of the Project Lakhta Twitter accounts got involved in the Alabama Senate race, but to point out that the objective of Lakhta is CREATE DISSENT AND DISTRUST, they actually tweeted on opposite sides of the campaign:

One Project Lakhta Twitter account, @KaniJJackson, posted on December 12, 2017: 

"Dear Alabama, You have a choice today. Doug Jones put the KKK in prison for murdering 4 young black girls.  Roy Moore wants to sleep with your teenage daughters. This isn't hard. #AlabamaSenate"

while on the same day @JohnCopper16, also a confirmed Project Lakhta Twitter account, tweeted:

"People living in Alabama have different values than people living in NYC. They will vote for someone who represents them, for someone who they can trust. Not you.  Dear Alabama, vote for Roy Moore."

@KaniJJackson was a very active voice for Lakhta.  Here are some additional tweets for that account:

"If Trump fires Robert Mueller, we have to take to the streets in protest.  Our democracy is at stake." (December 16, 2017)

"Who ended DACA? Who put off funding CHIP for 4 months? Who rejected a deal to restore DACA? It's not #SchumerShutdown. It's #GOPShutdown." (January 19, 2018)

@JohnCopper16 also tweeted on that topic: 
"Anyone who believes that President Trump is responsible for #shutdown2018 is either an outright liar or horribly ignorant. #SchumerShutdown for illegals. #DemocratShutdown #DemocratLosers #DemocratsDefundMilitary #AlternativeFacts"   (January 20, 2018)

@KaniJJackson on Parkland, Florida and the 2018 Midterm election: 
"Reminder: the same GOP that is offering thoughts and prayers today are the same ones that voted to allow loosening gun laws for the mentally ill last February.  If you're outraged today, VOTE THEM OUT IN 2018. #guncontrol #Parkland"

They even tweet about themselves, as shown in this pair of tweets!

@JemiSHaaaZzz (February 16, 2018):
"Dear @realDonaldTrump: The DOJ indicted 13 Russian nationals at the Internet Research Agency for violating federal criminal law to help your campaign and hurt other campaigns. Still think this Russia thing is a hoax and a witch hunt? Because a lot of witches just got indicted."

@JohnCopper16 (February 16, 2018): 
"Russians indicted today: 13  Illegal immigrants crossing Mexican border indicted today: 0  Anyway, I hope all those Internet Research Agency f*ckers will be sent to gitmo." 

The Russians are also involved in "getting out the vote" - especially of those who hold strongly divisive views:

@JohnCopper16 (February 27, 2018):
"Dem2018 platform - We want women raped by the jihadists - We want children killed - We want higher gas prices - We want more illegal aliens - We want more Mexican drugs And they are wondering why @realDonaldTrump became the President"

@KaniJJackson (February 19, 2018): 
"Midterms are 261 days, use this time to: - Promote your candidate on social media - Volunteer for a campaign - Donate to a campaign - Register to vote - Help others register to vote - Spread the word We have only 261 days to guarantee survival of democracy. Get to work! 

More recent tweets have been on a wide variety of topics, with other accounts expressing strong views around racial tensions, and then speaking to the Midterm elections: 

@wokeluisa (another confirmed Project Lakhta account): 
"Just a reminder that: - Majority black Flint, Michigan still has drinking water that will give you brain damage if consumed - Republicans are still trying to keep black people from voting - A terrorist has been targeting black families for assassination in Austin, Texas" 

and then, also @wokeluisa: (March 19, 2018): 
"Make sure to pre-register to vote if you are 16 y.o. or older. Don't just sit back, do something about everything that's going on because November 6, 2018 is the date that 33 senate seats, 436 seats in the House of Representatives and 36 governorships will be up for re-election." 

And from @johncopper16 (March 22, 2018):
"Just a friendly reminder to get involved in the 2018 Midterms. They are motivated They hate you They hate your morals They hate your 1A and 2A rights They hate the Police They hate the Military They hate YOUR President" 

Some of the many additional Twitter accounts controlled by the conspiracy mentioned in the Criminal Complaint: 

@UsaUsafortrump, @USAForDTrump, @TrumpWithUSA, @TrumpMov, @POTUSADJT, @imdeplorable201, @swampdrainer659, @maga2017trump, @TXCowboysRawk, @covfefeNationUS, @wokeluisa (2,000 tweets and at least 55,000 followers), @JohnCopper16, @Amconvoice, @TheTrainGuy13, @KaniJJackson, @JemiSHaaaZzz 




Cyber Security Roundup for May 2018

I'm sure the release of the GDPR on 25th May hasn't escaped anyone's attention. After years of warnings about the EU parliament's intended tough stance on enforcing the human right to privacy in the digital realm, a real 'game changer' of a global privacy regulation has finally landed, which impacts any organisation which touches EU citizen personal data. 

The GDPR's potential hefty financial penalties for breaching its requirements is firmly on the radar of directors at large enterprises and small businesses alike, hence the massive barrage of emails we have all have received in recent weeks, on changes to company privacy statements and requesting consent, many of which I noted as not being GDPR compliant as obtaining "explicit consent" from the data subject. So there is a long way to go for many organisations before they become truly GDPR compliant state based on what I've seen so far in my mailbox.

Cybercriminals have been quick to take advantage of the GDPR privacy emails deluge, using the subject matter in their phishing attacks to cheat access to accounts and con victims.
On a positive GDPR note, also on 25th May, IBM developerWorks released a three-part guidance series written by myself, aimed at helping Application Developers to develop GDPR compliant applications.

Developing GDPR Compliant Applications Guidance

Overshadowed by the GDPR coming in force, was the release of new NHS Data Security and Protection Toolkit, aimed at the NHS and their service providers, and the European NIS Directive (for telecom providers) went under the radar, but they are significant to those working in those industries.

Always make sure your Broadband Router\Hub does not permit remote administrative access (over the internet) and is always kept up-to-date with the latest security patches, otherwise, it will be at serious risk of being hacked and remotely controlled by cyber-criminals. As evidenced with month, after a DNS flaw in over 800,000 Draytek Routers has allowed hackers to take them over, malware called VPNFilter has infected 500,000 routers, and serious vulnerabilities has been reported in TP-Link EAP controllers.

IBM made headlines after banning its workers from using USB sticks, which I think is a good and reasonable policy. As quite frankly any modern enterprise, whether large or small, with a decent IT infrastructure and cloud services, staff shouldn't need to use USB devices to move data either internally or externally with third parties, so I see this as a rather smart business and security move to ban all USB devices, as it forces staff to use the more secure and more efficient technology made available.

As my @securityexpert twitter account crossed the 10,000 follower threshold Twitter advised 300 million users to reset their passwords after internal error. Apparently, the passwords for the Twitter accounts were accidentally stored in a database in their "plain text" value instead of using a hashed value for the password, as per best practice. I always strongly recommend Twitter users to take advantage and use the multi-factor authentication system Twitter provides, which reduces the risk of account hacking.

Breaches of note in May included a T-Mobile website bug which exposed personal customer data, Coca-Cola said an insider breached 8,000 accounts, and BMW cars were found to have over a dozen security vulnerabilities.

As always a busy month of new security patch releases, with Microsoft, Adobe, PHP, PGP, Google, Git, and Dell all releasing critical security updates to fix significant security flaws. Click the links for the full details.

Analysis of DDoS Attacks at Cloudflare, has revealed that while organisations in the UK have certainly upped their spending on DDoS mitigation, cyber-criminals are now responding by switching to Layer 7 based DDoS attacks
Some interesting articles about the Welsh Cyber Security Revolution and a review of the NHS a year on from the WannaCry outbreak

Reports of interest this month include the Thales Data Threat Report, which found UK businesses to be the most breached in Europe. The LastPass Psychology of Passwords Report which found 59% of people surveyed used the same passwords across multiple accounts, despite 91% of them knowing that using the same password for multiple accounts is a security risk. The 2017 Cylance Report stated the number of cyber-attacks on industries such as healthcare, manufacturing, professional services, and education rose by about 13.4% between 2016 and 2017.

NEWS
AWARENESS, EDUCATION AND THREAT INTELLIGENCE

Pr0nbots2: Revenge Of The Pr0nbots

A month and a half ago I posted an article in which I uncovered a series of Twitter accounts advertising adult dating (read: scam) websites. If you haven’t read it yet, I recommend taking a look at it before reading this article, since I’ll refer back to it occasionally.

To start with, let’s recap. In my previous research, I used a script to recursively query Twitter accounts for specific patterns, and found just over 22,000 Twitter bots using this process. This figure was based on the fact that I concluded my research (stopped my script) after querying only 3000 of the 22,000 discovered accounts. I have a suspicion that my script would have uncovered a lot more accounts, had I let it run longer.

This week, I decided to re-query all the Twitter IDs I found in March, to see if anything had changed. To my surprise, I was only able to query 2895 of the original 21964 accounts, indicating that Twitter has taken action on most of those accounts.

In order to find out whether the culled accounts were deleted or suspended, I wrote a small python script that utilized the requests module to directly query each account’s URL. If the script encountered a 404 error, it indicated that the account was removed or renamed. A reply indicated that the account was suspended. Of the 19069 culled accounts checked, 18932 were suspended, and 137 were deleted/renamed.

I also checked the surviving accounts in a similar manner, using requests to identify which ones were “restricted” (by checking for specific strings in the html returned from the query). Of the 2895 surviving accounts, 47 were set to restricted and the other 2848 were not.

As noted in my previous article, the accounts identified during my research had creation dates ranging from a few days old to over a decade in age. I checked the creation dates of both the culled set and the survivor’s set (using my previously recorded data) for patterns, but I couldn’t find any. Here they are, for reference:

Based on the connectivity I recorded between the original bot accounts, I’ve created a new graph visualization depicting the surviving communities. Of the 2895 survivors, only 402 presumably still belong to the communities I observed back then. The rest of the accounts were likely orphaned. Here’s a representation of what the surviving communities might look like, if the entity controlling these accounts didn’t make any changes in the meantime.

By the way, I’m using Gephi to create these graph visualizations, in case you were wondering.

Erik Ellason (@slickrockweb) contacted me recently with some evidence that the bots I’d discovered might be re-tooling. He pointed me to a handful of accounts that contained the shortened URL in a pinned tweet (instead of in the account’s description). Here’s an example profile:

Fetching a user object using the Twitter API will also return the last tweet that account published, but I’m not sure it would necessarily return the pinned Tweet. In fact, I don’t think there’s a way of identifying a pinned Tweet using the standard API. Hence, searching for these accounts by their promotional URL would be time consuming and problematic (you’d have to iterate through their tweets).

Fortunately, automating discovery of Twitter profiles similar to those Eric showed me was fairly straightforward. Like the previous botnet, the accounts could be crawled due to the fact that they follow each other. Also, all of these new accounts had text in their descriptions that followed a predictable pattern. Here’s an example of a few of those sentences:

look url in last post
go on link in top tweet
go at site in last post

It was trivial to construct a simple regular expression to find all such sentences:

desc_regex = "(look|go on|go at|see|check|click) (url|link|site) in (top|last) (tweet|post)"

I modified my previous script to include the above regular expression, seeded it with the handful of accounts that Eric had provided me, and let it run. After 24 hours, my new script had identified just over 20000 accounts. Mapping the follower/following relationships between these accounts gave me the following graph:

As we zoom in, you’ll notice that these accounts are way more connected than the older botnet. The 20,000 or so accounts identified at this point map to just over 100 separate communities. With roughly the same amount of accounts, the previous botnet contained over 1000 communities.

Zooming in further shows the presence of “hubs” in each community, similar to in our previous botnet.

Given that this botnet showed a greater degree of connectivity than the previous one studied, I decided to continue my discovery script and collect more data. The discovery rate of new accounts slowed slightly after the first 24 hours, but remained steady for the rest of the time it was running. After 4 days, my script had found close to 44,000 accounts.

And eight days later, the total was just over 80,000.

Here’s another way of visualizing that data:


Here’s the size distribution of communities detected for the 80,000 node graph. Smaller community sizes may indicate places where my discovery script didn’t yet look. The largest communities contained over 1000 accounts. There may be a way of searching more efficiently for these accounts by prioritizing crawling within smaller communities, but this is something I’ve yet to explore.

I shut down my discovery script at this point, having queried just over 30,000 accounts. I’m fairly confident this rabbit hole goes a lot deeper, but it would have taken weeks to query the next 50,000 accounts, not to mention the countless more that would have been added to the list during that time.

As with the previous botnet, the creation dates of these accounts spanned over a decade.

Here’s the oldest account I found.

Using the same methodology I used to analyze the survivor accounts from the old botnet, I checked which of these new accounts were restricted by Twitter. There was an almost exactly even split between restricted and non-restricted accounts in this new set.

Given that these new bots show many similarities to the previously discovered botnet (similar avatar pictures, same URL shortening services, similar usage of the English language) we might speculate that this new set of accounts is being managed by the same entity as those older ones. If this is the case, a further hypothesis is that said entity is re-tooling based on Twitter’s action against their previous botnet (for instance, to evade automation).

Because these new accounts use a pinned Tweet to advertise their services, we can test this hypothesis by examining the creation dates of the most recent Tweet from each account. If the entity is indeed re-tooling, all of the accounts should have Tweeted fairly recently. However, a brief examination of last tweet dates for these accounts revealed a rather large distribution, tracing back as far as 2012. The distribution had a long tail, with a majority of the most recent Tweets having been published within the last year. Here’s the last year’s worth of data graphed.

Here’s the oldest Tweet I found:

This data, on it’s own, would refute the theory that the owner of this botnet has been recently retooling. However, a closer look at some of the discovered accounts reveals an interesting story. Here are a few examples.

This account took a 6 year break from Twitter, and switched language to English.

This account mentions a “url in last post” in its bio, but there isn’t one.

This account went from posting in Korean to posting in English, with a 3 year break in between. However, the newer Tweet mentions “url in bio”. Sounds vaguely familiar.

Examining the text contained in the last Tweets from these discovered accounts revealed around 76,000 unique Tweets. Searching these Tweets for links containing the URL shortening services used by the previous botnet revealed 8,200 unique Tweets. Here’s a graph of the creation dates of those particular Tweets.

As we can see, the Tweets containing shortened URLs date back only 21 days. Here’s a distribution of domains seen in those Tweets.

My current hypothesis is that the owner of the previous botnet has purchased a batch of Twitter accounts (of varying ages) and has been, at least for the last 21 days, repurposing those accounts to advertise adult dating sites using the new pinned-Tweet approach.

One final thing – I checked the 2895 survivor accounts from the previously discovered botnet to see if any had been reconfigured to use a pinned Tweet. At the time of checking, only one of those accounts had been changed.

If you’re interested in looking at the data I collected, I’ve uploaded names/ids of all discovered accounts, the follower/following mappings found between these accounts, the gephi save file for the 80,000 node graph, and a list of accounts queried by my script (in case someone would like to continue iterating through the unqueried accounts.) You can find all of that data in this github repo.

Marketing “Dirty Tinder” On Twitter

About a week ago, a Tweet I was mentioned in received a dozen or so “likes” over a very short time period (about two minutes). I happened to be on my computer at the time, and quickly took a look at the accounts that generated those likes. They all followed a similar pattern. Here’s an example of one of the accounts’ profiles:

This particular avatar was very commonly used as a profile picture in these accounts.

All of the accounts I checked contained similar phrases in their description fields. Here’s a list of common phrases I identified:

  • Check out
  • Check this
  • How do you like my site
  • How do you like me
  • You love it harshly
  • Do you like fast
  • Do you like it gently
  • Come to my site
  • Come in
  • Come on
  • Come to me
  • I want you
  • You want me
  • Your favorite
  • Waiting you
  • Waiting you at

All of the accounts also contained links to URLs in their description field that pointed to domains such as the following:

  • me2url.info
  • url4.pro
  • click2go.info
  • move2.pro
  • zen5go.pro
  • go9to.pro

It turns out these are all shortened URLs, and the service behind each of them has the exact same landing page:

“I will ban drugs, spam, porn, etc.” Yeah, right.

My colleague, Sean, checked a few of the links and found that they landed on “adult dating” sites. Using a VPN to change the browser’s exit node, he noticed that the landing pages varied slightly by region. In Finland, the links ended up on a site called “Dirty Tinder”.

Checking further, I noticed that some of the accounts either followed, or were being followed by other accounts with similar traits, so I decided to write a script to programmatically “crawl” this network, in order to see how large it is.

The script I wrote was rather simple. It was seeded with the dozen or so accounts that I originally witnessed, and was designed to iterate friends and followers for each user, looking for other accounts displaying similar traits. Whenever a new account was discovered, it was added to the query list, and the process continued. Of course, due to Twitter API rate limit restrictions, the whole crawler loop was throttled so as to not perform more queries than the API allowed for, and hence crawling the network took quite some time.

My script recorded a graph of which accounts were following/followed by which other accounts. After a few hours I checked the output and discovered an interesting pattern:

Graph of follower/following relationships between identified accounts after about a day of running the discovery script.

The discovered accounts seemed to be forming independent “clusters” (through follow/friend relationships). This is not what you’d expect from a normal social interaction graph.

After running for several days the script had queried about 3000 accounts, and discovered a little over 22,000 accounts with similar traits. I stopped it there. Here’s a graph of the resulting network.

Pretty much the same pattern I’d seen after one day of crawling still existed after one week. Just a few of the clusters weren’t “flower” shaped. Here’s a few zooms of the graph.

 

Since I’d originally noticed several of these accounts liking the same tweet over a short period of time, I decided to check if the accounts in these clusters had anything in common. I started by checking this one:

Oddly enough, there were absolutely no similarities between these accounts. They were all created at very different times and all Tweeted/liked different things at different times. I checked a few other clusters and obtained similar results.

One interesting thing I found was that the accounts were created over a very long time period. Some of the accounts discovered were over eight years old. Here’s a breakdown of the account ages:

As you can see, this group has less new accounts in it than older ones. That big spike in the middle of the chart represents accounts that are about six years old. One reason why there are fewer new accounts in this network is because Twitter’s automation seems to be able to flag behaviors or patterns in fresh accounts and automatically restrict or suspend them. In fact, while my crawler was running, many of the accounts on the graphs above were restricted or suspended.

Here are a few more breakdowns – Tweets published, likes, followers and following.

Here’s a collage of some of the profile pictures found. I modified a python script to generate this – far better than using one of those “free” collage making tools available on the Internets. 🙂

So what are these accounts doing? For the most part, it seems they’re simply trying to advertise the “adult dating” sites linked in the account profiles. They do this by liking, retweeting, and following random Twitter accounts at random times, fishing for clicks. I did find one that had been helping to sell stuff:

Individually the accounts probably don’t break any of Twitter’s terms of service. However, all of these accounts are likely controlled by a single entity. This network of accounts seems quite benign, but in theory, it could be quickly repurposed for other tasks including “Twitter marketing” (paid services to pad an account’s followers or engagement), or to amplify specific messages.

If you’re interested, I’ve saved a list of both screen_name and id_str for each discovered account here. You can also find the scraps of code I used while performing this research in that same github repo.

How To Get Twitter Follower Data Using Python And Tweepy

In January 2018, I wrote a couple of blog posts outlining some analysis I’d performed on followers of popular Finnish Twitter profiles. A few people asked that I share the tools used to perform that research. Today, I’ll share a tool similar to the one I used to conduct that research, and at the same time, illustrate how to obtain data about a Twitter account’s followers.

This tool uses Tweepy to connect to the Twitter API. In order to enumerate a target account’s followers, I like to start by using Tweepy’s followers_ids() function to get a list of Twitter ids of accounts that are following the target account. This call completes in a single query, and gives us a list of Twitter ids that can be saved for later use (since both screen_name and name an be changed, but the account’s id never changes). Once I’ve obtained a list of Twitter ids, I can use Tweepy’s lookup_users(userids=batch) to obtain Twitter User objects for each Twitter id. As far as I know, this isn’t exactly the documented way of obtaining this data, but it suits my needs. /shrug

Once a full set of Twitter User objects has been obtained, we can perform analysis on it. In the following tool, I chose to look at the account age and friends_count of each account returned, print a summary, and save a summarized form of each account’s details as json, for potential further processing. Here’s the full code:

from tweepy import OAuthHandler
from tweepy import API
from collections import Counter
from datetime import datetime, date, time, timedelta
import sys
import json
import os
import io
import re
import time

# Helper functions to load and save intermediate steps
def save_json(variable, filename):
    with io.open(filename, "w", encoding="utf-8") as f:
        f.write(unicode(json.dumps(variable, indent=4, ensure_ascii=False)))

def load_json(filename):
    ret = None
    if os.path.exists(filename):
        try:
            with io.open(filename, "r", encoding="utf-8") as f:
                ret = json.load(f)
        except:
            pass
    return ret

def try_load_or_process(filename, processor_fn, function_arg):
    load_fn = None
    save_fn = None
    if filename.endswith("json"):
        load_fn = load_json
        save_fn = save_json
    else:
        load_fn = load_bin
        save_fn = save_bin
    if os.path.exists(filename):
        print("Loading " + filename)
        return load_fn(filename)
    else:
        ret = processor_fn(function_arg)
        print("Saving " + filename)
        save_fn(ret, filename)
        return ret

# Some helper functions to convert between different time formats and perform date calculations
def twitter_time_to_object(time_string):
    twitter_format = "%a %b %d %H:%M:%S %Y"
    match_expression = "^(.+)\s(\+[0-9][0-9][0-9][0-9])\s([0-9][0-9][0-9][0-9])$"
    match = re.search(match_expression, time_string)
    if match is not None:
        first_bit = match.group(1)
        second_bit = match.group(2)
        last_bit = match.group(3)
        new_string = first_bit + " " + last_bit
        date_object = datetime.strptime(new_string, twitter_format)
        return date_object

def time_object_to_unix(time_object):
    return int(time_object.strftime("%s"))

def twitter_time_to_unix(time_string):
    return time_object_to_unix(twitter_time_to_object(time_string))

def seconds_since_twitter_time(time_string):
    input_time_unix = int(twitter_time_to_unix(time_string))
    current_time_unix = int(get_utc_unix_time())
    return current_time_unix - input_time_unix

def get_utc_unix_time():
    dts = datetime.utcnow()
    return time.mktime(dts.timetuple())

# Get a list of follower ids for the target account
def get_follower_ids(target):
    return auth_api.followers_ids(target)

# Twitter API allows us to batch query 100 accounts at a time
# So we'll create batches of 100 follower ids and gather Twitter User objects for each batch
def get_user_objects(follower_ids):
    batch_len = 100
    num_batches = len(follower_ids) / 100
    batches = (follower_ids[i:i+batch_len] for i in range(0, len(follower_ids), batch_len))
    all_data = []
    for batch_count, batch in enumerate(batches):
        sys.stdout.write("\r")
        sys.stdout.flush()
        sys.stdout.write("Fetching batch: " + str(batch_count) + "/" + str(num_batches))
        sys.stdout.flush()
        users_list = auth_api.lookup_users(user_ids=batch)
        users_json = (map(lambda t: t._json, users_list))
        all_data += users_json
    return all_data

# Creates one week length ranges and finds items that fit into those range boundaries
def make_ranges(user_data, num_ranges=20):
    range_max = 604800 * num_ranges
    range_step = range_max/num_ranges

# We create ranges and labels first and then iterate these when going through the whole list
# of user data, to speed things up
    ranges = {}
    labels = {}
    for x in range(num_ranges):
        start_range = x * range_step
        end_range = x * range_step + range_step
        label = "%02d" % x + " - " + "%02d" % (x+1) + " weeks"
        labels[label] = []
        ranges[label] = {}
        ranges[label]["start"] = start_range
        ranges[label]["end"] = end_range
    for user in user_data:
        if "created_at" in user:
            account_age = seconds_since_twitter_time(user["created_at"])
            for label, timestamps in ranges.iteritems():
                if account_age > timestamps["start"] and account_age < timestamps["end"]:
                    entry = {} 
                    id_str = user["id_str"] 
                    entry[id_str] = {} 
                    fields = ["screen_name", "name", "created_at", "friends_count", "followers_count", "favourites_count", "statuses_count"] 
                    for f in fields: 
                        if f in user: 
                            entry[id_str][f] = user[f] 
                    labels[label].append(entry) 
    return labels

if __name__ == "__main__": 
    account_list = [] 
    if (len(sys.argv) > 1):
        account_list = sys.argv[1:]

    if len(account_list) < 1:
        print("No parameters supplied. Exiting.")
        sys.exit(0)

    consumer_key=""
    consumer_secret=""
    access_token=""
    access_token_secret=""

    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    auth_api = API(auth)

    for target in account_list:
        print("Processing target: " + target)

# Get a list of Twitter ids for followers of target account and save it
        filename = target + "_follower_ids.json"
        follower_ids = try_load_or_process(filename, get_follower_ids, target)

# Fetch Twitter User objects from each Twitter id found and save the data
        filename = target + "_followers.json"
        user_objects = try_load_or_process(filename, get_user_objects, follower_ids)
        total_objects = len(user_objects)

# Record a few details about each account that falls between specified age ranges
        ranges = make_ranges(user_objects)
        filename = target + "_ranges.json"
        save_json(ranges, filename)

# Print a few summaries
        print
        print("\t\tFollower age ranges")
        print("\t\t===================")
        total = 0
        following_counter = Counter()
        for label, entries in sorted(ranges.iteritems()):
            print("\t\t" + str(len(entries)) + " accounts were created within " + label)
            total += len(entries)
            for entry in entries:
                for id_str, values in entry.iteritems():
                    if "friends_count" in values:
                        following_counter[values["friends_count"]] += 1
        print("\t\tTotal: " + str(total) + "/" + str(total_objects))
        print
        print("\t\tMost common friends counts")
        print("\t\t==========================")
        total = 0
        for num, count in following_counter.most_common(20):
            total += count
            print("\t\t" + str(count) + " accounts are following " + str(num) + " accounts")
        print("\t\tTotal: " + str(total) + "/" + str(total_objects))
        print
        print

Let’s run this tool against a few accounts and see what results we get. First up: @realDonaldTrump

realdonaldtrump_age_ranges

Age ranges of new accounts following @realDonaldTrump

As we can see, over 80% of @realDonaldTrump’s last 5000 followers are very new accounts (less than 20 weeks old), with a majority of those being under a week old. Here’s the top friends_count values of those accounts:

realdonaldtrump_friends_counts

Most common friends_count values seen amongst the new accounts following @realDonaldTrump

No obvious pattern is present in this data.

Next up, an account I looked at in a previous blog post – @niinisto (the president of Finland).

Age ranges of new accounts following @niinisto

Many of @niinisto’s last 5000 followers are new Twitter accounts. However, not in as large of a proportion as in the @realDonaldTrump case. In both of the above cases, this is to be expected, since both accounts are recommended to new users of Twitter. Let’s look at the friends_count values for the above set.

Most common friends_count values seen amongst the new accounts following @niinisto

In some cases, clicking through the creation of a new Twitter account (next, next, next, finish) will create an account that follows 21 Twitter profiles. This can explain the high proportion of accounts in this list with a friends_count value of 21. However, we might expect to see the same (or an even stronger) pattern with the @realDonaldTrump account. And we’re not. I’m not sure why this is the case, but it could be that Twitter has some automation in place to auto-delete programmatically created accounts. If you look at the output of my script you’ll see that between fetching the list of Twitter ids for the last 5000 followers of @realDonaldTrump, and fetching the full Twitter User objects for those ids, 3 accounts “went missing” (and hence the tool only collected data for 4997 accounts.)

Finally, just for good measure, I ran the tool against my own account (@r0zetta).

Age ranges of new accounts following @r0zetta

Here you see a distribution that’s probably common for non-celebrity Twitter accounts. Not many of my followers have new accounts. What’s more, there’s absolutely no pattern in the friends_count values of these accounts:

Most common friends_count values seen amongst the new accounts following @r0zetta

Of course, there are plenty of other interesting analyses that can be performed on the data collected by this tool. Once the script has been run, all data is saved on disk as json files, so you can process it to your heart’s content without having to run additional queries against Twitter’s servers. As usual, have fun extending this tool to your own needs, and if you’re interested in reading some of my other guides or analyses, here’s full list of those articles.

Searching Twitter With Twarc

Twarc makes it really easy to search Twitter via the API. Simply create a twarc object using your own API keys and then pass your search query into twarc’s search() function to get a stream of Tweet objects. Remember that, by default, the Twitter API will only return results from the last 7 days. However, this is useful enough if we’re looking for fresh information on a topic.

Since this methodology is so simple, posting code for a tool that simply prints the resulting tweets to stdout would make for a boring blog post. Here I present a tool that collects a bunch of metadata from the returned Tweet objects. Here’s what it does:

  • records frequency distributions of URLs, hashtags, and users
  • records interactions between users and hashtags
  • outputs csv files that can be imported into Gephi for graphing
  • downloads all images found in Tweets
  • records each Tweet’s text along with the URL of the Tweet

The code doesn’t really need explanation, so here’s the whole thing.

from collections import Counter
from itertools import combinations
from twarc import Twarc
import requests
import sys
import os
import shutil
import io
import re
import json

# Helper functions for saving json, csv and formatted txt files
def save_json(variable, filename):
  with io.open(filename, "w", encoding="utf-8") as f:
    f.write(unicode(json.dumps(variable, indent=4, ensure_ascii=False)))

def save_csv(data, filename):
  with io.open(filename, "w", encoding="utf-8") as handle:
    handle.write(u"Source,Target,Weight\n")
    for source, targets in sorted(data.items()):
      for target, count in sorted(targets.items()):
        if source != target and source is not None and target is not None:
          handle.write(source + u"," + target + u"," + unicode(count) + u"\n")

def save_text(data, filename):
  with io.open(filename, "w", encoding="utf-8") as handle:
    for item, count in data.most_common():
      handle.write(unicode(count) + "\t" + item + "\n")

# Returns the screen_name of the user retweeted, or None
def retweeted_user(status):
  if "retweeted_status" in status:
    orig_tweet = status["retweeted_status"]
    if "user" in orig_tweet and orig_tweet["user"] is not None:
      user = orig_tweet["user"]
      if "screen_name" in user and user["screen_name"] is not None:
        return user["screen_name"]

# Returns a list of screen_names that the user interacted with in this Tweet
def get_interactions(status):
  interactions = []
  if "in_reply_to_screen_name" in status:
    replied_to = status["in_reply_to_screen_name"]
    if replied_to is not None and replied_to not in interactions:
      interactions.append(replied_to)
  if "retweeted_status" in status:
    orig_tweet = status["retweeted_status"]
    if "user" in orig_tweet and orig_tweet["user"] is not None:
      user = orig_tweet["user"]
      if "screen_name" in user and user["screen_name"] is not None:
        if user["screen_name"] not in interactions:
          interactions.append(user["screen_name"])
  if "quoted_status" in status:
    orig_tweet = status["quoted_status"]
    if "user" in orig_tweet and orig_tweet["user"] is not None:
      user = orig_tweet["user"]
      if "screen_name" in user and user["screen_name"] is not None:
        if user["screen_name"] not in interactions:
          interactions.append(user["screen_name"])
  if "entities" in status:
    entities = status["entities"]
    if "user_mentions" in entities:
      for item in entities["user_mentions"]:
        if item is not None and "screen_name" in item:
          mention = item['screen_name']
          if mention is not None and mention not in interactions:
            interactions.append(mention)
  return interactions

# Returns a list of hashtags found in the tweet
def get_hashtags(status):
  hashtags = []
  if "entities" in status:
    entities = status["entities"]
    if "hashtags" in entities:
      for item in entities["hashtags"]:
        if item is not None and "text" in item:
          hashtag = item['text']
          if hashtag is not None and hashtag not in hashtags:
            hashtags.append(hashtag)
  return hashtags

# Returns a list of URLs found in the Tweet
def get_urls(status):
  urls = []
  if "entities" in status:
    entities = status["entities"]
      if "urls" in entities:
        for item in entities["urls"]:
          if item is not None and "expanded_url" in item:
            url = item['expanded_url']
            if url is not None and url not in urls:
              urls.append(url)
  return urls

# Returns the URLs to any images found in the Tweet
def get_image_urls(status):
  urls = []
  if "entities" in status:
    entities = status["entities"]
    if "media" in entities:
      for item in entities["media"]:
        if item is not None:
          if "media_url" in item:
            murl = item["media_url"]
            if murl not in urls:
              urls.append(murl)
  return urls

# Main starts here
if __name__ == '__main__':
# Add your own API key values here
  consumer_key=""
  consumer_secret=""
  access_token=""
  access_token_secret=""

  twarc = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)

# Check that search terms were provided at the command line
  target_list = []
  if (len(sys.argv) > 1):
    target_list = sys.argv[1:]
  else:
    print("No search terms provided. Exiting.")
    sys.exit(0)

  num_targets = len(target_list)
  for count, target in enumerate(target_list):
    print(str(count + 1) + "/" + str(num_targets) + " searching on target: " + target)
# Create a separate save directory for each search query
# Since search queries can be a whole sentence, we'll check the length
# and simply number it if the query is overly long
    save_dir = ""
    if len(target) < 30:
      save_dir = target.replace(" ", "_")
    else:
      save_dir = "target_" + str(count + 1)
    if not os.path.exists(save_dir):
      print("Creating directory: " + save_dir)
      os.makedirs(save_dir)
# Variables for capturing stuff
    tweets_captured = 0
    influencer_frequency_dist = Counter()
    mentioned_frequency_dist = Counter()
    hashtag_frequency_dist = Counter()
    url_frequency_dist = Counter()
    user_user_graph = {}
    user_hashtag_graph = {}
    hashtag_hashtag_graph = {}
    all_image_urls = []
    tweets = {}
    tweet_count = 0
# Start the search
    for status in twarc.search(target):
# Output some status as we go, so we know something is happening
      sys.stdout.write("\r")
      sys.stdout.flush()
      sys.stdout.write("Collected " + str(tweet_count) + " tweets.")
      sys.stdout.flush()
      tweet_count += 1
    
      screen_name = None
      if "user" in status:
        if "screen_name" in status["user"]:
          screen_name = status["user"]["screen_name"]

      retweeted = retweeted_user(status)
      if retweeted is not None:
        influencer_frequency_dist[retweeted] += 1
      else:
        influencer_frequency_dist[screen_name] += 1

# Tweet text can be in either "text" or "full_text" field...
      text = None
      if "full_text" in status:
        text = status["full_text"]
      elif "text" in status:
        text = status["text"]

      id_str = None
      if "id_str" in status:
        id_str = status["id_str"]

# Assemble the URL to the tweet we received...
      tweet_url = None
      if "id_str" is not None and "screen_name" is not None:
        tweet_url = "https://twitter.com/" + screen_name + "/status/" + id_str

# ...and capture it
      if tweet_url is not None and text is not None:
        tweets[tweet_url] = text

# Record mapping graph between users
      interactions = get_interactions(status)
        if interactions is not None:
          for user in interactions:
            mentioned_frequency_dist[user] += 1
            if screen_name not in user_user_graph:
              user_user_graph[screen_name] = {}
            if user not in user_user_graph[screen_name]:
              user_user_graph[screen_name][user] = 1
            else:
              user_user_graph[screen_name][user] += 1

# Record mapping graph between users and hashtags
      hashtags = get_hashtags(status)
      if hashtags is not None:
        if len(hashtags) > 1:
          hashtag_interactions = []
# This code creates pairs of hashtags in situations where multiple
# hashtags were found in a tweet
# This is used to create a graph of hashtag-hashtag interactions
          for comb in combinations(sorted(hashtags), 2):
            hashtag_interactions.append(comb)
          if len(hashtag_interactions) > 0:
            for inter in hashtag_interactions:
              item1, item2 = inter
              if item1 not in hashtag_hashtag_graph:
                hashtag_hashtag_graph[item1] = {}
              if item2 not in hashtag_hashtag_graph[item1]:
                hashtag_hashtag_graph[item1][item2] = 1
              else:
                hashtag_hashtag_graph[item1][item2] += 1
          for hashtag in hashtags:
            hashtag_frequency_dist[hashtag] += 1
            if screen_name not in user_hashtag_graph:
              user_hashtag_graph[screen_name] = {}
            if hashtag not in user_hashtag_graph[screen_name]:
              user_hashtag_graph[screen_name][hashtag] = 1
            else:
              user_hashtag_graph[screen_name][hashtag] += 1

      urls = get_urls(status)
      if urls is not None:
        for url in urls:
          url_frequency_dist[url] += 1

      image_urls = get_image_urls(status)
      if image_urls is not None:
        for url in image_urls:
          if url not in all_image_urls:
            all_image_urls.append(url)

# Iterate through image URLs, fetching each image if we haven't already
      print
      print("Fetching images.")
      pictures_dir = os.path.join(save_dir, "images")
      if not os.path.exists(pictures_dir):
        print("Creating directory: " + pictures_dir)
        os.makedirs(pictures_dir)
      for url in all_image_urls:
        m = re.search("^http:\/\/pbs\.twimg\.com\/media\/(.+)$", url)
        if m is not None:
          filename = m.group(1)
          print("Getting picture from: " + url)
          save_path = os.path.join(pictures_dir, filename)
          if not os.path.exists(save_path):
            response = requests.get(url, stream=True)
            with open(save_path, 'wb') as out_file:
              shutil.copyfileobj(response.raw, out_file)
            del response

# Output a bunch of files containing the data we just gathered
      print("Saving data.")
      json_outputs = {"tweets.json": tweets,
                      "urls.json": url_frequency_dist,
                      "hashtags.json": hashtag_frequency_dist,
                      "influencers.json": influencer_frequency_dist,
                      "mentioned.json": mentioned_frequency_dist,
                      "user_user_graph.json": user_user_graph,
                      "user_hashtag_graph.json": user_hashtag_graph,
                      "hashtag_hashtag_graph.json": hashtag_hashtag_graph}
      for name, dataset in json_outputs.iteritems():
        filename = os.path.join(save_dir, name)
        save_json(dataset, filename)

# These files are created in a format that can be easily imported into Gephi
      csv_outputs = {"user_user_graph.csv": user_user_graph,
                     "user_hashtag_graph.csv": user_hashtag_graph,
                     "hashtag_hashtag_graph.csv": hashtag_hashtag_graph}
      for name, dataset in csv_outputs.iteritems():
        filename = os.path.join(save_dir, name)
        save_csv(dataset, filename)

      text_outputs = {"hashtags.txt": hashtag_frequency_dist,
                      "influencers.txt": influencer_frequency_dist,
                      "mentioned.txt": mentioned_frequency_dist,
                      "urls.txt": url_frequency_dist}
      for name, dataset in text_outputs.iteritems():
        filename = os.path.join(save_dir, name)
        save_text(dataset, filename)

Running this tool will create a directory for each search term provided at the command-line. To search for a sentence, or to include multiple terms, enclose the argument with quotes. Due to Twitter’s rate limiting, your search may hit a limit, and need to pause to wait for the rate limit to reset. Luckily twarc takes care of that. Once the search is finished, a bunch of files will be written to the previously created directory.

Since I use a Mac, I can use its Quick Look functionality from the Finder to browse the output files created. Since pytorch is gaining a lot of interest, I ran my script against that search term. Here’s some examples of how I can quickly view the output files.

The preview pane is enough to get an overview of the recorded data.

 

Pressing spacebar opens the file in Quick Look, which is useful for data that doesn’t fit neatly into the preview pane

Importing the user_user_graph.csv file into Gephi provided me with some neat visualizations about the pytorch community.

A full zoom out of the pytorch community

Here we can see who the main influencers are. It seems that Yann LeCun and François Chollet are Tweeting about pytorch, too.

Here’s a zoomed-in view of part of the network.

Zoomed in view of part of the Gephi graph generated.

If you enjoyed this post, check out the previous two articles I published on using the Twitter API here and here. I hope you have fun tailoring this script to your own needs!

NLP Analysis Of Tweets Using Word2Vec And T-SNE

In the context of some of the Twitter research I’ve been doing, I decided to try out a few natural language processing (NLP) techniques. So far, word2vec has produced perhaps the most meaningful results. Wikipedia describes word2vec very precisely:

“Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.”

During the two weeks leading up to the  January 2018 Finnish presidential elections, I performed an analysis of user interactions and behavior on Twitter, based on search terms relevant to that event. During the course of that analysis, I also dumped each Tweet’s raw text field to a text file, one item per line. I then wrote a small tool designed to preprocess the collected Tweets, feed that processed data into word2vec, and finally output some visualizations. Since word2vec creates multidimensional tensors, I’m using T-SNE for dimensionality reduction (the resulting visualizations are in two dimensions, compared to the 200 dimensions of the original data.)

The rest of this blog post will be devoted to listing and explaining the code used to perform these tasks. I’ll present the code as it appears in the tool. The code starts with a set of functions that perform processing and visualization tasks. The main routine at the end wraps everything up by calling each routine sequentially, passing artifacts from the previous step to the next one. As such, you can copy-paste each section of code into an editor, save the resulting file, and the tool should run (assuming you’ve pip installed all dependencies.) Note that I’m using two spaces per indent purely to allow the code to format neatly in this blog. Let’s start, as always, with importing dependencies. Off the top of my head, you’ll probably want to install tensorflow, gensim, six, numpy, matplotlib, and sklearn (although I think some of these install as part of tensorflow’s installation).

# -*- coding: utf-8 -*-
from tensorflow.contrib.tensorboard.plugins import projector
from sklearn.manifold import TSNE
from collections import Counter
from six.moves import cPickle
import gensim.models.word2vec as w2v
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import multiprocessing
import os
import sys
import io
import re
import json

The next listing contains a few helper functions. In each processing step, I like to save the output. I do this for two reasons. Firstly, depending on the size of your raw data, each step can take some time. Hence, if you’ve performed the step once, and saved the output, it can be loaded from disk to save time on subsequent passes. The second reason for saving each step is so that you can examine the output to check that it looks like what you want. The try_load_or_process() function attempts to load the previously saved output from a function. If it doesn’t exist, it runs the function and then saves the output. Note also the rather odd looking implementation in save_json(). This is a workaround for the fact that json.dump() errors out on certain non-ascii characters when paired with io.open().

def try_load_or_process(filename, processor_fn, function_arg):
  load_fn = None
  save_fn = None
  if filename.endswith("json"):
    load_fn = load_json
    save_fn = save_json
  else:
    load_fn = load_bin
    save_fn = save_bin
  if os.path.exists(filename):
    return load_fn(filename)
  else:
    ret = processor_fn(function_arg)
    save_fn(ret, filename)
    return ret

def print_progress(current, maximum):
  sys.stdout.write("\r")
  sys.stdout.flush()
  sys.stdout.write(str(current) + "/" + str(maximum))
  sys.stdout.flush()

def save_bin(item, filename):
  with open(filename, "wb") as f:
    cPickle.dump(item, f)

def load_bin(filename):
  if os.path.exists(filename):
    with open(filename, "rb") as f:
      return cPickle.load(f)

def save_json(variable, filename):
  with io.open(filename, "w", encoding="utf-8") as f:
    f.write(unicode(json.dumps(variable, indent=4, ensure_ascii=False)))

def load_json(filename):
  ret = None
  if os.path.exists(filename):
    try:
      with io.open(filename, "r", encoding="utf-8") as f:
        ret = json.load(f)
    except:
      pass
  return ret

Moving on, let’s look at the first preprocessing step. This function takes the raw text strings dumped from Tweets, removes unwanted characters and features (such as user names and URLs), removes duplicates, and returns a list of sanitized strings. Here, I’m not using string.printable for a list of characters to keep, since Finnish includes additional letters that aren’t part of the english alphabet (äöåÄÖÅ). The regular expressions used in this step have been somewhat tailored for the raw input data. Hence, you may need to tweak them for your own input corpus.

def process_raw_data(input_file):
  valid = u"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ#@.:/ äöåÄÖÅ"
  url_match = "(https?:\/\/[0-9a-zA-Z\-\_]+\.[\-\_0-9a-zA-Z]+\.?[0-9a-zA-Z\-\_]*\/?.*)"
  name_match = "\@[\_0-9a-zA-Z]+\:?"
  lines = []
  print("Loading raw data from: " + input_file)
  if os.path.exists(input_file):
    with io.open(input_file, 'r', encoding="utf-8") as f:
      lines = f.readlines()
  num_lines = len(lines)
  ret = []
  for count, text in enumerate(lines):
    if count % 50 == 0:
      print_progress(count, num_lines)
    text = re.sub(url_match, u"", text)
    text = re.sub(name_match, u"", text)
    text = re.sub("\&amp\;?", u"", text)
    text = re.sub("[\:\.]{1,}$", u"", text)
    text = re.sub("^RT\:?", u"", text)
    text = u''.join(x for x in text if x in valid)
    text = text.strip()
    if len(text.split()) > 5:
      if text not in ret:
        ret.append(text)
  return ret

The next step is to tokenize each sentence (or Tweet) into words.

def tokenize_sentences(sentences):
  ret = []
  max_s = len(sentences)
  print("Got " + str(max_s) + " sentences.")
  for count, s in enumerate(sentences):
    tokens = []
    words = re.split(r'(\s+)', s)
    if len(words) > 0:
      for w in words:
        if w is not None:
          w = w.strip()
          w = w.lower()
          if w.isspace() or w == "\n" or w == "\r":
            w = None
          if len(w) < 1:
            w = None
          if w is not None:
            tokens.append(w)
    if len(tokens) > 0:
      ret.append(tokens)
    if count % 50 == 0:
      print_progress(count, max_s)
  return ret

The final text preprocessing step removes unwanted tokens. This includes numeric data and stop words. Stop words are the most common words in a language. We omit them from processing in order to bring out the meaning of the text in our analysis. I downloaded a json dump of stop words for all languages from here, and placed it in the same directory as this script. If you plan on trying this code out yourself, you’ll need to perform the same steps. Note that I included extra stopwords of my own. After looking at the output of this step, I noticed that Twitter’s truncation of some tweets caused certain word fragments to occur frequently.

def clean_sentences(tokens):
  all_stopwords = load_json("stopwords-iso.json")
  extra_stopwords = ["ssä", "lle", "h.", "oo", "on", "muk", "kov", "km", "ia", "täm", "sy", "but", ":sta", "hi", "py", "xd", "rr", "x:", "smg", "kum", "uut", "kho", "k", "04n", "vtt", "htt", "väy", "kin", "#8", "van", "tii", "lt3", "g", "ko", "ett", "mys", "tnn", "hyv", "tm", "mit", "tss", "siit", "pit", "viel", "sit", "n", "saa", "tll", "eik", "nin", "nii", "t", "tmn", "lsn", "j", "miss", "pivn", "yhn", "mik", "tn", "tt", "sek", "lis", "mist", "tehd", "sai", "l", "thn", "mm", "k", "ku", "s", "hn", "nit", "s", "no", "m", "ky", "tst", "mut", "nm", "y", "lpi", "siin", "a", "in", "ehk", "h", "e", "piv", "oy", "p", "yh", "sill", "min", "o", "va", "el", "tyn", "na", "the", "tit", "to", "iti", "tehdn", "tlt", "ois", ":", "v", "?", "!", "&"]
  stopwords = None
  if all_stopwords is not None:
    stopwords = all_stopwords["fi"]
    stopwords += extra_stopwords
  ret = []
  max_s = len(tokens)
  for count, sentence in enumerate(tokens):
    if count % 50 == 0:
      print_progress(count, max_s)
    cleaned = []
    for token in sentence:
      if len(token) > 0:
        if stopwords is not None:
          for s in stopwords:
            if token == s:
              token = None
        if token is not None:
            if re.search("^[0-9\.\-\s\/]+$", token):
              token = None
        if token is not None:
            cleaned.append(token)
    if len(cleaned) > 0:
      ret.append(cleaned)
  return ret

The next function creates a vocabulary from the processed text. A vocabulary, in this context, is basically a list of all unique tokens in the data. This function creates a frequency distribution of all tokens (words) by counting the number of occurrences of each token. We will use this later to “trim” the vocabulary down to a manageable size.

def get_word_frequencies(corpus):
  frequencies = Counter()
  for sentence in corpus:
    for word in sentence:
      frequencies[word] += 1
  freq = frequencies.most_common()
  return freq

Now we’re done with all preprocessing steps, let’s get into the more interesting analysis functions. The following function accepts the tokenized and cleaned data generated from the steps above, and uses it to train a word2vec model. The num_features parameter sets the number of features each word is assigned (and hence the dimensionality of the resulting tensor.) It is recommended to set it between 100 and 1000. Naturally, larger values take more processing power and memory/disk space to handle. I found 200 to be enough, but I normally start with a value of 300 when looking at new datasets. The min_count variable passed to word2vec designates how to trim the vocabulary. For example, if min_count is set to 3, all words that appear in the data set less than 3 times will be discarded from the vocabulary used when training the word2vec model. In the dimensionality reduction step we perform later, large vocabulary sizes cause T-SNE iterations to take a long time. Hence, I tuned min_count to generate a vocabulary of around 10,000 words. Increasing the value of sample, will cause word2vec to randomly omit words with high frequency counts. I decided that I wanted to keep all of those words in my analysis, so it’s set to zero. Increasing epoch_count will cause word2vec to train for more iterations, which will, naturally take longer. Increase this if you have a fast machine or plenty of time on your hands 🙂

def get_word2vec(sentences):
  num_workers = multiprocessing.cpu_count()
  num_features = 200
  epoch_count = 10
  sentence_count = len(sentences)
  w2v_file = os.path.join(save_dir, "word_vectors.w2v")
  word2vec = None
  if os.path.exists(w2v_file):
    print("w2v model loaded from " + w2v_file)
    word2vec = w2v.Word2Vec.load(w2v_file)
  else:
    word2vec = w2v.Word2Vec(sg=1,
                            seed=1,
                            workers=num_workers,
                            size=num_features,
                            min_count=min_frequency_val,
                            window=5,
                            sample=0)

    print("Building vocab...")
    word2vec.build_vocab(sentences)
    print("Word2Vec vocabulary length:", len(word2vec.wv.vocab))
    print("Training...")
    word2vec.train(sentences, total_examples=sentence_count, epochs=epoch_count)
    print("Saving model...")
    word2vec.save(w2v_file)
  return word2vec

Tensorboard has some good tools to visualize word embeddings in the word2vec model we just created. These visualizations can be accessed using the “projector” tab in the interface. Here’s code to create tensorboard embeddings:

def create_embeddings(word2vec):
  all_word_vectors_matrix = word2vec.wv.syn0
  num_words = len(all_word_vectors_matrix)
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  dim = word2vec.wv[vocab[0]].shape[0]
  embedding = np.empty((num_words, dim), dtype=np.float32)
  metadata = ""
  for i, word in enumerate(vocab):
    embedding[i] = word2vec.wv[word]
    metadata += word + "\n"
  metadata_file = os.path.join(save_dir, "metadata.tsv")
  with io.open(metadata_file, "w", encoding="utf-8") as f:
    f.write(metadata)

  tf.reset_default_graph()
  sess = tf.InteractiveSession()
  X = tf.Variable([0.0], name='embedding')
  place = tf.placeholder(tf.float32, shape=embedding.shape)
  set_x = tf.assign(X, place, validate_shape=False)
  sess.run(tf.global_variables_initializer())
  sess.run(set_x, feed_dict={place: embedding})

  summary_writer = tf.summary.FileWriter(save_dir, sess.graph)
  config = projector.ProjectorConfig()
  embedding_conf = config.embeddings.add()
  embedding_conf.tensor_name = 'embedding:0'
  embedding_conf.metadata_path = 'metadata.tsv'
  projector.visualize_embeddings(summary_writer, config)

  save_file = os.path.join(save_dir, "model.ckpt")
  print("Saving session...")
  saver = tf.train.Saver()
  saver.save(sess, save_file)

Once this code has been run, tensorflow log entries will be created in save_dir. To start a tensorboard session, run the following command from the same directory where this script was run from:

tensorboard –logdir=save_dir

You should see output like the following once you’ve run the above command:

TensorBoard 0.4.0rc3 at http://node.local:6006 (Press CTRL+C to quit)

Navigate your web browser to localhost:<port_number> to see the interface. From the “Inactive” pulldown menu, select “Projector”.

tensorboard projector menu item

The “projector” menu is often hiding under the “inactive” pulldown.

Once you’ve selected “projector”, you should see a view like this:

Tensorboard's projector view

Tensorboard’s projector view allows you to interact with word embeddings, search for words, and even run t-sne on the dataset.

There are a lot of things to play around with in this view. You can search for words, fly around the embeddings, and even run t-sne (on the bottom left) on the dataset. If you get to this step, have fun playing with the interface!

And now, back to the code. One of word2vec’s most interesting functions is to find similarities between words. This is done via the word2vec.wv.most_similar() call. The following function calls word2vec.wv.most_similar() for a word and returns num-similar words. The returned value is a list containing the queried word, and a list of similar words. ( [queried_word, [similar_word1, similar_word2, …]] ).

def most_similar(input_word, num_similar):
  sim = word2vec.wv.most_similar(input_word, topn=num_similar)
  output = []
  found = []
  for item in sim:
    w, n = item
    found.append(w)
  output = [input_word, found]
  return output

The following function takes a list of words to be queried, passes them to the above function, saves the output, and also passes the queried words to t_sne_scatterplot(), which we’ll show later. It also writes a csv file – associations.csv – which can be imported into Gephi to generate graphing visualizations. You can see some Gephi-generated visualizations in the accompanying blog post.

I find that manually viewing the word2vec_test.json file generated by this function is a good way to read the list of similarities found for each word queried with wv.most_similar().

def test_word2vec(test_words):
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  output = []
  associations = {}
  test_items = test_words
  for count, word in enumerate(test_items):
    if word in vocab:
      print("[" + str(count+1) + "] Testing: " + word)
      if word not in associations:
        associations[word] = []
      similar = most_similar(word, num_similar)
      t_sne_scatterplot(word)
      output.append(similar)
      for s in similar[1]:
        if s not in associations[word]:
          associations[word].append(s)
    else:
      print("Word " + word + " not in vocab")
  filename = os.path.join(save_dir, "word2vec_test.json")
  save_json(output, filename)
  filename = os.path.join(save_dir, "associations.json")
  save_json(associations, filename)
  filename = os.path.join(save_dir, "associations.csv")
  handle = io.open(filename, "w", encoding="utf-8")
  handle.write(u"Source,Target\n")
  for w, sim in associations.iteritems():
    for s in sim:
      handle.write(w + u"," + s + u"\n")
  return output

The next function implements standalone code for creating a scatterplot from the output of T-SNE on a set of data points obtained from a word2vec.wv.most_similar() query. The scatterplot is visualized with matplotlib. Unfortunately, my matplotlib skills leave a lot to be desired, and these graphs don’t look great. But they’re readable.

def t_sne_scatterplot(word):
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  dim0 = word2vec.wv[vocab[0]].shape[0]
  arr = np.empty((0, dim0), dtype='f')
  w_labels = [word]
  nearby = word2vec.wv.similar_by_word(word, topn=num_similar)
  arr = np.append(arr, np.array([word2vec[word]]), axis=0)
  for n in nearby:
    w_vec = word2vec[n[0]]
    w_labels.append(n[0])
    arr = np.append(arr, np.array([w_vec]), axis=0)

  tsne = TSNE(n_components=2, random_state=1)
  np.set_printoptions(suppress=True)
  Y = tsne.fit_transform(arr)
  x_coords = Y[:, 0]
  y_coords = Y[:, 1]

  plt.rc("font", size=16)
  plt.figure(figsize=(16, 12), dpi=80)
  plt.scatter(x_coords[0], y_coords[0], s=800, marker="o", color="blue")
  plt.scatter(x_coords[1:], y_coords[1:], s=200, marker="o", color="red")

  for label, x, y in zip(w_labels, x_coords, y_coords):
    plt.annotate(label.upper(), xy=(x, y), xytext=(0, 0), textcoords='offset points')
  plt.xlim(x_coords.min()-50, x_coords.max()+50)
  plt.ylim(y_coords.min()-50, y_coords.max()+50)
  filename = os.path.join(plot_dir, word + "_tsne.png")
  plt.savefig(filename)
  plt.close()

In order to create a scatterplot of the entire vocabulary, we need to perform T-SNE over that whole dataset. This can be a rather time-consuming operation. The next function performs that operation, attempting to save and re-load intermediate steps (since some of them can take over 30 minutes to complete).

def calculate_t_sne():
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  arr = np.empty((0, dim0), dtype='f')
  labels = []
  vectors_file = os.path.join(save_dir, "vocab_vectors.npy")
  labels_file = os.path.join(save_dir, "labels.json")
  if os.path.exists(vectors_file) and os.path.exists(labels_file):
    print("Loading pre-saved vectors from disk")
    arr = load_bin(vectors_file)
    labels = load_json(labels_file)
  else:
    print("Creating an array of vectors for each word in the vocab")
    for count, word in enumerate(vocab):
      if count % 50 == 0:
        print_progress(count, vocab_len)
      w_vec = word2vec[word]
      labels.append(word)
      arr = np.append(arr, np.array([w_vec]), axis=0)
    save_bin(arr, vectors_file)
    save_json(labels, labels_file)

  x_coords = None
  y_coords = None
  x_c_filename = os.path.join(save_dir, "x_coords.npy")
  y_c_filename = os.path.join(save_dir, "y_coords.npy")
  if os.path.exists(x_c_filename) and os.path.exists(y_c_filename):
    print("Reading pre-calculated coords from disk")
    x_coords = load_bin(x_c_filename)
    y_coords = load_bin(y_c_filename)
  else:
    print("Computing T-SNE for array of length: " + str(len(arr)))
    tsne = TSNE(n_components=2, random_state=1, verbose=1)
    np.set_printoptions(suppress=True)
    Y = tsne.fit_transform(arr)
    x_coords = Y[:, 0]
    y_coords = Y[:, 1]
    print("Saving coords.")
    save_bin(x_coords, x_c_filename)
    save_bin(y_coords, y_c_filename)
 return x_coords, y_coords, labels, arr

The next function takes the data calculated in the above step, and data obtained from test_word2vec(), and plots the results from each word queried on the scatterplot of the entire vocabulary. These plots are useful for visualizing which words are closer to others, and where clusters commonly pop up. This is the last function before we get onto the main routine.

def show_cluster_locations(results, labels, x_coords, y_coords):
  for item in results:
    name = item[0]
    print("Plotting graph for " + name)
    similar = item[1]
    in_set_x = []
    in_set_y = []
    out_set_x = []
    out_set_y = []
    name_x = 0
    name_y = 0
    for count, word in enumerate(labels):
      xc = x_coords[count]
      yc = y_coords[count]
      if word == name:
        name_x = xc
        name_y = yc
      elif word in similar:
        in_set_x.append(xc)
        in_set_y.append(yc)
      else:
        out_set_x.append(xc)
        out_set_y.append(yc)
    plt.figure(figsize=(16, 12), dpi=80)
    plt.scatter(name_x, name_y, s=400, marker="o", c="blue")
    plt.scatter(in_set_x, in_set_y, s=80, marker="o", c="red")
    plt.scatter(out_set_x, out_set_y, s=8, marker=".", c="black")
    filename = os.path.join(big_plot_dir, name + "_tsne.png")
    plt.savefig(filename)
    plt.close()

Now let’s write our main routine, which will call all the above functions, process our collected Twitter data, and generate visualizations. The first few lines take care of our three preprocessing steps, and generation of a frequency distribution / vocabulary. The script expects the raw Twitter data to reside in a relative path (data/tweets.txt). Change those variables as needed. Also, all output is saved to a subdirectory in the relative path (analysis/). Again, tailor this to your needs.

if __name__ == '__main__':
  input_dir = "data"
  save_dir = "analysis"
  if not os.path.exists(save_dir):
    os.makedirs(save_dir)

  print("Preprocessing raw data")
  raw_input_file = os.path.join(input_dir, "tweets.txt")
  filename = os.path.join(save_dir, "data.json")
  processed = try_load_or_process(filename, process_raw_data, raw_input_file)
  print("Unique sentences: " + str(len(processed)))

  print("Tokenizing sentences")
  filename = os.path.join(save_dir, "tokens.json")
  tokens = try_load_or_process(filename, tokenize_sentences, processed)

  print("Cleaning tokens")
  filename = os.path.join(save_dir, "cleaned.json")
  cleaned = try_load_or_process(filename, clean_sentences, tokens)

  print("Getting word frequencies")
  filename = os.path.join(save_dir, "frequencies.json")
  frequencies = try_load_or_process(filename, get_word_frequencies, cleaned)
  vocab_size = len(frequencies)
  print("Unique words: " + str(vocab_size))

Next, I trim the vocabulary, and save the resulting list of words. This allows me to look over the trimmed list and ensure that the words I’m interested in survived the trimming operation. Due to the nature of the Finnish language, (and Twitter), the vocabulary of our “cleaned” set, prior to trimming, was over 100,000 unique words. After trimming it ended up at around 11,000 words.

  trimmed_vocab = []
  min_frequency_val = 6
  for item in frequencies:
    if item[1] >= min_frequency_val:
      trimmed_vocab.append(item[0])
  trimmed_vocab_size = len(trimmed_vocab)
  print("Trimmed vocab length: " + str(trimmed_vocab_size))
  filename = os.path.join(save_dir, "trimmed_vocab.json")
  save_json(trimmed_vocab, filename)

The next few lines do all the compute-intensive work. We’ll create a word2vec model with the cleaned token set, create tensorboard embeddings (for the visualizations mentioned above), and calculate T-SNE. Yes, this part can take a while to run, so go put the kettle on.

  print
  print("Instantiating word2vec model")
  word2vec = get_word2vec(cleaned)
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  print("word2vec vocab contains " + str(vocab_len) + " items.")
  dim0 = word2vec.wv[vocab[0]].shape[0]
  print("word2vec items have " + str(dim0) + " features.")

  print("Creating tensorboard embeddings")
  create_embeddings(word2vec)

  print("Calculating T-SNE for word2vec model")
  x_coords, y_coords, labels, arr = calculate_t_sne()

Finally, we’ll take the top 50 most frequent words from our frequency distrubution, query them for 40 most similar words, and plot both labelled graphs of each set, and a “big plot” of that set on the entire vocabulary.

  plot_dir = os.path.join(save_dir, "plots")
  if not os.path.exists(plot_dir):
    os.makedirs(plot_dir)

  num_similar = 40
  test_words = []
  for item in frequencies[:50]:
    test_words.append(item[0])
  results = test_word2vec(test_words)

  big_plot_dir = os.path.join(save_dir, "big_plots")
  if not os.path.exists(big_plot_dir):
    os.makedirs(big_plot_dir)
  show_cluster_locations(results, labels, x_coords, y_coords)

And that’s it! Rather a lot of code, but it does quite a few useful tasks. If you’re interested in seeing the visualizations I created using this tool against the Tweets collected from the January 2018 Finnish presidential elections, check out this blog post.

NLP Analysis And Visualizations Of #presidentinvaalit2018

During the lead-up to the January 2018 Finnish presidential elections, I collected a dataset consisting of raw Tweets gathered from search words related to the election. I then performed a series of natural language processing experiments on this raw data. The methodology, including all the code used, can be found in an accompanying blog post. This article details the results of my experiments, and shows some of the visualizations generated.

I pre-processed the raw dataset, used it to train a word2vec model, and then used that model to perform analyses using word2vec.wv.most_similar(), T-SNE, and Tensorboard.

My first experiment involved creating scatterplots of words found to be similar to frequently encountered tokens within the Twitter data. I looked at the 50 most frequent tokens encountered in this way, and used T-SNE to reduce the dimensionality of the set of vectors generated in each case. Results were plotted using matplotlib. Here are a few examples of the output generated.

T-SNE scatterplot of the 40 most similar words to #laura2018

T-SNE scatterplot of the 40 most similar words to #laura2018

Here you can see that word2vec easily identified other hashtags related to the #laura2018 campaign, including #suomitakaisin, #suomitakas, #siksilaura and #siksips. Laura Huhtasaari was candidate number 5 on the voting slip, and that was also identified, along with other hashtags associated with her name.

T-SNE scatterplot of the 40 most similar words to #turpo

T-SNE scatterplot of the 40 most similar words to #turpo

Here’s an analysis of the hashtag #turpo (short for turvallisuuspolitiikka – National Security). Here you can see that word2vec identified many references to NATO (one issue that was touched upon during election campaigning), jäsenyys (membership), #ulpo – ulkopolitiikka (Foreign Policy), and references to regions and countries (venäjä – Russia, ruotsi – Sweden, itämeri – Baltic).

T-SNE scatterplot of the 40 most similar words to venäjä

T-SNE scatterplot of the 40 most similar words to venäjä

On a similar note, here’s a scatterplot of words similar to venäjä (Russia). As expected, word2vec identified NATO in close relationship. Names of countries are expected to register as similar in word2vec, and we see Ruotsi (Sweden), Ukraine, USA, Turkki (Turkey), Syria, Kiina (China). Word2vec also finds the word Putin to be similar, and interestingly, Neuvostoliito (USSR) was mentioned in the Twitter data.

T-SNE scatterplot of the 40 most similar words to presidentti

T-SNE scatterplot of the 40 most similar words to presidentti

Above is a scatterplot based on the word “presidentti” (president). Note how word2vec identified Halonen, Urho, Kekkonen, Donald, and Trump.

Moving on, I took the names of the eight presidential candidates in Sunday’s election, and plotted them, along with the 40 most similar guesses from word2vec, on scatterplots of the entire vocabulary. Here are the results.

All candidates plotted against the full vocabulary. The blue dot is the target. Red dots are similar tokens.

All candidates plotted against the full vocabulary. The blue dot is the target. Red dots are similar tokens.

As you can see above, all of the candidates occupied separate spaces on the graph, and there was very little overlap amongst words similar to each candidate’s name.

I created word embeddings using Tensorflow, and opened the resulting log files in Tensorboard in order to produce some visualizations with that tool. Here are some of the outputs.

Tensorboard visualization of words related to #haavisto on a 2d representation of word embeddings, dimensionally reduced using T-SNE

Tensorboard visualization of words related to #haavisto2018 on a 2D representation of word embeddings, dimensionally reduced using T-SNE

The above shows word vectors in close proximity to #haavisto2018, based on the embeddings I created (from the word2vec model). Here you can find references to Tavastia, a club in Helsinki where Pekka Haavisto’s campaign hosted an event on 20th January 2018. Words clearly associated with this event include liput (tickets), ilta (evening), livenä (live), and biisejä (songs). The event was called “Siksipekka”. Here’s a view of that hashtag.

Again, we see similar words, including konsertti (concert). Another nearby word vector identified was #vihreät (the green party).

In my last experiment, I compiled lists of similar words for all of the top 50 most frequent words found in the Twitter data, and recorded associations between the lists generated. I imported this data into Gephi, and generated some graphs with it.

I got interested in Gephi after recently collaborating with Erin Gallagher (@3r1nG) to visualize the data I collected on some bots found to be following Finnish recommended Twitter accounts. I highly recommend that you check out some of her other blog posts, where you’ll see some amazing visualizations. Gephi is a powerful tool, but it takes quite some time to master. As you’ll see, my attempts at using it pale in comparison to what Erin can do.

A zoomed-out view of the mapping between the 40 most similar words to the 50 most frequent words in the Twitter data collected

A zoomed-out view of the mapping between the 40 most similar words to the 50 most frequent words in the Twitter data collected

The above is a graph of all the words found. Larger circles indicate that a word has more other words associated with it.

A zoomed-in view of some of the candidates

A zoomed-in view of some of the candidates

Here’s a zoom-in on some of the candidates. Note that I treated hashtags as unique words, which turned out to be useful for this analysis. For reference, here are a few translations: äänestää = vote, vaalit = elections, puhuu = to speak, presitenttiehdokas = presidential candidate.

Words related to foreign policy and national security

Words related to foreign policy and national security

Here is a zoomed-in view of the words associated with foreign policy and national security.

Words associated with Suomi (Finland)

Words associated with Suomi (Finland)

Finally, here are some words associated with #suomi (Finland). Note lots of references to nature (luonto), winter (talvi), and snow (lumi).

As you might have gathered, word2vec finds interesting and fairly accurate associations between words, even in messy data such as Tweets. I plan on delving further into this area in hopes of finding some techniques that might improve the Twitter research I’ve been doing. The dataset collected during the Finnish elections was fairly small (under 150,000 Tweets). Many of the other datasets I work with are orders of magnitude larger. Hence I’m particularly interested in figuring out if there’s a way to accurately cluster Twitter data using these techniques.

 

How To Get Tweets From A Twitter Account Using Python And Tweepy

In this blog post, I’ll explain how to obtain data from a specified Twitter account using tweepy and Python. Let’s jump straight into the code!

As usual, we’ll start off by importing dependencies. I’ll use the datetime and Counter modules later on to do some simple analysis tasks.

from tweepy import OAuthHandler
from tweepy import API
from tweepy import Cursor
from datetime import datetime, date, time, timedelta
from collections import Counter
import sys

The next bit creates a tweepy API object that we will use to query for data from Twitter. As usual, you’ll need to create a Twitter application in order to obtain the relevant authentication keys and fill in those empty strings. You can find a link to a guide about that in one of the previous articles in this series.

consumer_key=""
consumer_secret=""
access_token=""
access_token_secret=""

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
auth_api = API(auth)

Names of accounts to be queried will be passed in as command-line arguments. I’m going to exit the script if no args are passed, since there would be no reason to continue.

account_list = []
if (len(sys.argv) > 1):
  account_list = sys.argv[1:]
else:
  print("Please provide a list of usernames at the command line.")
  sys.exit(0)

Next, let’s iterate through the account names passed and use tweepy’s API.get_user() to obtain a few details about the queried account.

if len(account_list) > 0:
  for target in account_list:
    print("Getting data for " + target)
    item = auth_api.get_user(target)
    print("name: " + item.name)
    print("screen_name: " + item.screen_name)
    print("description: " + item.description)
    print("statuses_count: " + str(item.statuses_count))
    print("friends_count: " + str(item.friends_count))
    print("followers_count: " + str(item.followers_count))

Twitter User Objects contain a created_at field that holds the creation date of the account. We can use this to calculate the age of the account, and since we also know how many Tweets that account has published (statuses_count), we can calculate the average Tweets per day rate of that account. Tweepy provides time-related values as datetime objects which are easy to calculate things like time deltas with.

    tweets = item.statuses_count
    account_created_date = item.created_at
    delta = datetime.utcnow() - account_created_date
    account_age_days = delta.days
    print("Account age (in days): " + str(account_age_days))
    if account_age_days > 0:
      print("Average tweets per day: " + "%.2f"%(float(tweets)/float(account_age_days)))

Next, let’s iterate through the user’s Tweets using tweepy’s API.user_timeline(). Tweepy’s Cursor allows us to stream data from the query without having to manually query for more data in batches. The Twitter API will return around 3200 Tweets using this method (which can take a while). To make things quicker, and show another example of datetime usage we’re going to break out of the loop once we hit Tweets that are more than 30 days old. While looping, we’ll collect lists of all hashtags and mentions seen in Tweets.

    hashtags = []
    mentions = []
    tweet_count = 0
    end_date = datetime.utcnow() - timedelta(days=30)
    for status in Cursor(auth_api.user_timeline, id=target).items():
      tweet_count += 1
      if hasattr(status, "entities"):
        entities = status.entities
        if "hashtags" in entities:
          for ent in entities["hashtags"]:
            if ent is not None:
              if "text" in ent:
                hashtag = ent["text"]
                if hashtag is not None:
                  hashtags.append(hashtag)
        if "user_mentions" in entities:
          for ent in entities["user_mentions"]:
            if ent is not None:
              if "screen_name" in ent:
                name = ent["screen_name"]
                if name is not None:
                  mentions.append(name)
      if status.created_at < end_date:
        break

Finally, we’ll use Counter.most_common() to print out the ten most used hashtags and mentions.

    print
    print("Most mentioned Twitter users:")
    for item, count in Counter(mentions).most_common(10):
      print(item + "\t" + str(count))

    print
    print("Most used hashtags:")
    for item, count in Counter(hashtags).most_common(10):
      print(item + "\t" + str(count))

    print
    print "All done. Processed " + str(tweet_count) + " tweets."
    print

And that’s it. A simple tool. But effective. And, of course, you can extend this code in any direction you like.

How To Get Streaming Data From Twitter

I occasionally receive requests to share my Twitter analysis tools. After a few recent requests, it finally occurred to me that it would make sense to create a series of articles that describe how to use Python and the Twitter API to perform basic analytical tasks. Teach a man to fish, and all that.

In this blog post, I’ll describe how to obtain streaming data using Python and the Twitter API.

I’m using twarc instead of tweepy to gather data from Twitter streams. I recently switched to using twarc, because has a simpler interface than tweepy, and handles most network errors and Twitter errors automatically.

In this article, I’ll provide two examples. The first one covers the simplest way to get streaming data from Twitter. Let’s start by importing our dependencies.

from twarc import Twarc
import sys

Next, create a twarc session. For this, you’ll need to create a Twitter application in order to obtain the relevant authentication keys and fill in those empty strings. You can find many guides on the Internet for this. Here’s one.

if __name__ == '__main__':
  consumer_key=""
  consumer_secret=""
  access_token=""
  access_token_secret=""

  twarc = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)

For the sake of brevity, let’s assume search terms will be passed as a list on the command-line. We’ll simply accept that list without checking it’s validity. Your own implementation should probably do more.

  target_list = []
  if (len(sys.argv) > 1):
    target_list = sys.argv[1:]

Finally, we’ll check if we have any search targets. If we do, we’ll create a search query. If not, we’ll attach to the sample stream.

  if len(target_list) > 0:
    query = ",".join(target_list)
    print "Search: " + query
    for tweet in twarc.filter(track = query):
      print_tweet(tweet)
  else:
    print "Getting 1% sample."
    for tweet in twarc.sample():
      print_tweet(tweet)

Here’s a function to print the “text” field of each tweet we receive from the stream.

def print_tweet(status):
  if "text" in status:
    print status["text"]

And that’s it. In just over 20 lines of code, you can attach to a Twitter stream, receive Tweets, and process (or in this case, print) them.

In my second example, incoming Tweet objects will be pushed onto a queue in the main thread, while a second processing thread will pull those objects off the queue and process them. The reason we would want to separate gathering and processing into separate threads is to prevent any blocking by the processing step. Although in this example, simply printing a Tweet’s text out is unlikely to block under normal circumstances, once your processing code becomes more complex, blocking is more likely to occur. By offloading processing to a separate thread, your script should be able to handle things such as heavy Tweet volume spikes, writing to disk, communicating over the network, using machine learning models, and working with large frequency distribution maps.

As before, we’ll start by importing dependencies. We’re including threading (for multithreading), Queue (to manage a queue), and time (for time.sleep).

from twarc import Twarc
import Queue
import threading
import sys
import time

The following two functions will run in our processing thread. One will process a Tweet object. In this case, we’ll do exactly the same as in our previous example, and simply print the Tweet’s text out.

# Processing thread
def process_tweet(status):
  if "text" in status:
    print status["text"]

The other function that will run in the context of the processing thread is a function to get items that were pushed into the queue. Here’s what it looks like.

def tweet_processing_thread():
  while True:
    item = tweet_queue.get()
    process_tweet(item)
    tweet_queue.task_done()

There are also two functions in our main thread. This one implements the same logic for attaching to a Twitter stream as in our first example. However, instead of calling process_tweet() directly, it pushes tweets onto the queue.

# Main thread
def get_tweet_stream(target_list, twarc):
  if len(target_list) > 0:
    query = ",".join(target_list)
    print "Search: " + query
    for tweet in twarc.filter(track = query):
      tweet_queue.put(tweet)
  else:
    print "Getting 1% sample."
    for tweet in twarc.sample():
      tweet_queue.put(tweet)

Now for our main function. We’ll start by creating a twarc object, and getting command-line args (as before):

if __name__ == '__main__':
  consumer_key=""
  consumer_secret=""
  access_token=""
  access_token_secret=""

  twarc = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)

  target_list = []
    if (len(sys.argv) > 1):
      target_list = sys.argv[1:]

Next, let’s create the queue and start our processing thread.

  tweet_queue = Queue.Queue()
  thread = threading.Thread(target=tweet_processing_thread)
  thread.daemon = True
  thread.start()

Since listening to a Twitter stream is essentially an endless loop, let’s add the ability to catch ctrl-c and clean up if needed.

  while True:
    try:
      get_tweet_stream(target_list, twarc)
    except KeyboardInterrupt:
      print "Keyboard interrupt..."
      # Handle cleanup (save data, etc)
      sys.exit(0)
    except:
      print("Error. Restarting...")
      time.sleep(5)
      pass

If you want to observe a queue buildup, add a sleep into the process_tweet() function, and attach to a stream with high enough volume (such as passing “trump” as a command-line parameter). Have fun listening to Twitter streams!

Further Analysis Of The Finnish Themed Twitter Botnet

In a blog post I published yesterday, I detailed the methodology I have been using to discover “Finnish themed” Twitter accounts that are most likely being programmatically created. In my previous post, I called them “bots”, but for the sake of clarity, let’s refer to them as “suspicious accounts”.

These suspicious accounts all follow a subset of recommended profiles presented to new Twitter users. In many cases, these automatically created Twitter accounts follow exactly 21 users. The reason I pursued this line of research was because it was similar to a phenomenon I’d seen happening in the US earlier last year. Check this post for more details about that case.

In an attempt to estimate the number of accounts created by the automated process described in my previous post, I ran the same analysis tool against a list of 114 Twitter profiles recommended to new Finnish users. Here is the list.

juhasipila
TuomasEnbuske
alexstubb
hsfi
mikko
rikurantala
yleuutiset
jatkoaika
smliiga
Valavuori
SarasvuoJari
niinisto
iltasanomat
Tami2605
KauppalehtiFi
talouselama
TeemuSel8nne
nokia
HeikelaJussi
hjallisharkimo
Linnanahde
tapio_suominen
vrantanen
meteorologit
tikitalk10
yleurheilu
JaajoLinnonmaa
hirviniemi
pvesterbacka
taloussanomat
TuomasKyr
MTVUutiset
Haavisto
SuomenKuvalehti
MikaelJungner
paavoarhinmaki
KajKunnas
SamiHedberg
VilleNiinisto
HenkkaHypponen
SaskaSaarikoski
jhiitela
Finnair
TarjaHalonen
leijonat
JollaHQ
filsdeproust
makinenantti
lottabacklund
jyrkikasvi
JethroRostedt
Ulkoministerio
valtioneuvosto
Yleisradio
annaperho
liandersson
pekkasauri
neiltyson
villetolvanen
akiriihilahti
TampereenPoika
madventures
Vapaavuori
jkekalainen
AppelsinUlla
pakalupapito
rakelliekki
kyleturris
tanelitikka
SlushHQ
arcticstartup
lindaliukas
goodnewsfinland
docventures
jasondemers5
Retee27
H_Kovalainen
ipaananen
FrenzziiiBull
ylenews
digitoday
jraitamaa
marmai
MikaVayrynen
LKomarov
ovi8
paulavesala
OsmoSoininvaara
juuuso
JaanaPelkonen
saaraaalto
yletiede
TimoHaapala
Huuhkajat
ErvastiPekka
JussiPullinen
rsiilasmaa
moia
Palloliitto
teroterotero
ARaanta31
kirsipiha
JPohjanpalo
startupsauna
aaltoes
Villebla
MariaVeitola
merjaya
MikiKuusi
MTVSportfi
EHaula
svuorikoski
andrewickstroem
kokoomus

For each account, my script saved a list of accounts suspected of being automatically created. After completing the analysis of these 114 accounts, I iterated through all collected lists in order to identify all unique account names across those lists.

Across the 114 recommended Twitter profiles, my analysis identified 5631 unique accounts. Here are the (first twenty) age ranges of the most recently created accounts:

All age ranges

Age ranges of all suspicious Twitter accounts identified by my script

It has been suggested (link in Finnish) that these accounts appeared when a popular game, Growtopia, asked its players to follow their Twitter account after a game outage, and those new accounts started following recommended Twitter profiles (including those of Haavisto and Niinistö). In order to check if this was the case, I collected a list of accounts following @growtopiagame, and checked for accounts that appear on both that list, and the list of suspicious accounts collected in my previous step. That number was 3. This likely indicates that the accounts my analysis identified aren’t players of Growtopia.

Someone Is Building A Finnish-Themed Twitter Botnet

Finland will hold a presidential election on the 28th January 2018. Campaigning just started, and candidates are being regularly interviewed by the press and on the TV. In a recent interview, one of the presidential candidates, Pekka Haavisto, mentioned that both his Twitter account, and the account of the current Finnish president, Sauli Niinistö had recently been followed by a number of bot accounts. I couldn’t resist investigating this myself.

I wrote a tool to analyze a Twitter account’s followers. The Twitter API only gives me access to the last 5000 accounts that have followed a queried account. However, this was enough for me to find some interesting data.

As I previously wrote, newly created bulk bot accounts often look very similar. I implemented some logic in my follower analysis tool that attempts to identify bots by looking for a combination of the following:

  • Is the account still an “egg” (default profile settings, default picture, etc.)?
  • Does the account follow exactly 21 other accounts?
  • Does the account follow very few accounts (less than 22)?
  • Does the account have a bot-like name (a string of random characters)?
  • Does the account have zero followers?
  • Has the account tweeted zero times?

Each of the above conditions give a score. If the total of all scores exceeds an arbitrary value, I record the name of the account.

I ran this tool against @Haavisto and @niinisto Twitter accounts and found the following:
Matches for @Haavisto account: 399
Matches for @niinisto account: 330

In both cases, the accounts in question were by-and-large under 2 months old.

Haavisto bot account age ranges

Account age ranges for bots following @Haavisto

 

Niinisto account bot follower age ranges

Account age ranges for bots following @niinisto

Interestingly, I checked the intersection between these two groups of bots. Only 49 of these accounts followed both @Haavisto and @niinisto.

Checking a handful of the flagged accounts manually using the Twitter web client, I quickly noticed that they all follow a similar selection of high-profile Finnish twitter accounts, including accounts such as:

Tuomas Enbuske (@TuomasEnbuske) – a Finnish celebrity
Riku Rantala (@rikurantala) – host of Madventures
Sauli Niinistö (@niinisto) – Finland’s current president
Juha Sipilä (@juhasipila) – Finland’s prime minister
Alexander Stubb (@alexstubb) – Former prime minister of Finland
Pekka Haavisto (@Haavisto) – presidential candidate
YLE (@yleuutiset) – Finland’s equivalent of the BBC
Kauppalehti (@KauppalehtiFi) – a popular Finnish newspaper
Ilta Sanomat (@iltasanomat) – a popular Finnish newspaper
Talous Sanomat (@taloussanomat) – a prominent financial news source
Helsingin Sanomat (@hsfi) – Helsinki’s local newspaper
Ilmatieteen laitos (@meteorologit) – Finnish weather reporting source

What the bots are following

All the bots were following similar popular Finnish Twitter accounts, such as these.

Running the same analysis tool against Riku Rantala’s account yielded similar results. In fact, Riku has been the recipient of 660 new bot followers (although some of them were added on previous waves, judging by the account ages).

Account age ranges for bots following @rikurantala

I have no doubt that the other accounts listed above (and a few more) have recently been followed by several hundred of these bots.

By the way, running the same analysis against the @realDonaldTrump account only found 220 new bots. To verify, I also ran the tool against @mikko yielding a count of 103 bots, and against @rsiilasmaa I found only 38.

It seems someone is busy building a Finnish-themed Twitter botnet. We don’t yet know what it will be used for.

Game of 72 Myth or Reality?

I can’t pretend that, in the mid 90s, I didn't pester my mum for a pair Adidas poppers joggers. Or that I didn't, against my better judgement, strut around in platform sneakers in an attempt to fit in with the in crowd. But emulating popular fashion was as far as I got. I don’t remember ever doing stupid or dangerous dares to impress my classmates. Initially, I thought, maybe I was just a good kid, but a quick straw poll around Smoothwall Towers, showed that my colleagues don’t recall hurting themselves or anyone else for a dare either. The closest example of a prank we could come up with between us was knock and run and egg and flour - hardly show stopping news.
But now, teenagers seem to be taking daring games to a whole new level through social media, challenging each other to do weird and even dangerous things. Like the #cinnamonchallenge on Twitter (where you dare someone to swallow a mouthful of cinnamon powder in 60 seconds without water). A quick visual check for the hashtag shows it’s still a thing today, despite initially going viral in 2013, and doctors having warned teens about the serious health implications. Now, apparently there’s another craze doing the rounds. #Gameof72 dares teens to go missing for 72 hours without contacting their parents. The first suspected case was reported in a local French newspaper in April, when a French student disappeared for three days and later told police she had been doing Game of 72. Then, in a separate incident, on 7 May, two schoolgirls from Essex went missing for a weekend in a suspected Game of 72 disappearance. Police later issued a statement to say the girls hadn't been playing the game. So why then, despite small incident numbers, and the absence of any actual evidence that Game of 72 is real, are parents and the authorities so panicked? Tricia Bailey from the Missing Children’s Society warned kids of the “immense and terrifying challenges they will face away from home.” And Stephen Fields, a communications coordinator at Windsor-Essex Catholic District School Board said, “it’s not cool”, and has warned students who participate that they could face suspension. It’s completely feasible that Game of 72 is actually a myth, created by a school kid with the intention of worrying the adults. And it’s worked; social media has made it seem even worse, when in reality, it’s probably not going to become an issue. I guess the truth is, we’ll probably never know, unless a savvy web filtering company finds a way of making these twitter-mobile games trackable at school, where peer pressure is often at its worst. Wait a minute...we already do that. Smoothwall allows school admins to block specific words and phrases including, Twitter hashtags. Say for instance that students were discussing Game of 72, or any other challenge, by tweet, and that phrase had been added to the list of banned words or phrases; the school’s administrator would be alerted, and their parents could be notified. Sure it won’t stop kids getting involved in online challenges, because they could take it to direct message and we’d lose the conversation. But, I think you’ll probably agree, the ability to track what students are saying in tweets is definitely a step in the right direction.

A new option to stem the tide of nefarious Twitter images…

Smoothwall's team of intrepid web-wranglers have recently noticed a change in Twitter's behaviour. Where once, it was impossible to differentiate the resources loaded from twimg.com, Twitter now includes some handy sub-domains so we can differentiate the optional user-uploaded images from the CSS , buttons, etc.

This means it's possible to prevent twitter loading user-content images without doing HTTPS inspection - something that's a bit of a broad brush, but given the fairly hefty amount of adult content swilling around Twitter, it's far from being the worst idea!

Smoothwall users: Twitter images are considered "unmoderated image hosting" - if you had previously made some changes to unblock CSS and JS from twimg, you can probably remove those now.

Twitter – Den of Iniquity or Paragon of Virtue… or Someplace in Between?


Twitter - Den of Iniquity or Paragon of Virtue or Someplace in Between


Recently there's been some coverage of Twitter's propensity for porn. Some research has shown that
one in every thousand tweets contains something pornographic. With 8662 tweets purportedly sent every second, that's quite a lot.

Now, this is not something that has escaped our notice here at Smoothwall HQ. We like to help our customers keep the web clean and tidy for their users, and mostly that means free of porn. With Twitter that's particularly difficult. Their filtering isn't easy to enforce and, while we have had some reasonable results with a combination of search term filtering and stripping certain tweets based on content, it's still not optimal. Twitter does not enforce content marking and 140 characters is right on the cusp of being impossible to content filter.

That said - how porn riddled is Twitter? Is there really sex round every corner? Is that little blue bird a pervert? Well, what we've found is: it's all relative.

Twitter is certainly among the more gutter variety of social networks, with Tumblr giving it a decent run for boobs-per-square-inch, but the likes of Facebook are much cleaner — with even images of breastfeeding mothers causing some controversy.

Interestingly, however, our back-of-a-beermat research leads us to believe that about 40 in every 1000 websites is in some way linked to porn — these numbers come from checking a quarter of a million of the most popular sites through Smoothwall's web filter and seeing what gets tagged as porn. Meanwhile, the Huffington Post reports that 30% of all Internet traffic is porn - the biggest number thus far. However, given the tendency of porn toward video, I guess we shouldn't be shocked.

Twitter: hard to filter, relatively porn-rich social network which is only doing its best to mirror the makeup of the Internet at large. As a school network admin, I would have it blocked for sure: Twitter themselves used to suggest a minimum age of 13, though this requirement quietly went away in a recent update to their terms of service.