Monthly Archives: January 2018

HotSpot Shield review: It’s fast, beautiful, and definitely not for anonymity

HotSpot Shield in brief:

  • P2P allowed: Yes
  • Business location: United States
  • Number of servers: 2,000+
  • Number of country locations: 25
  • Cost: $71.88 per year

Update: This review was updated on February 2, 2018 to add mention of AnchorFree’s transparency report.

Before people got serious about VPN encryption post-Snowden, a top choice for encrypting your Wi-Fi connection was AnchorFree’s HotSpot Shield. It was free, easy to set up, and only required that you look at some ads injected into your browsing.

To read this article in full, please click here

Archiving the alternative press threatened by wealthy buyers

An archivist at work in the stacks.
The U.S. National Archives

Freedom of the Press Foundation is launching an online archives collection in partnership with Archive-It, a service developed by the Internet Archive to help organizations preserve online content. Our collection, focusing on news outlets we deem to be especially vulnerable to "billionaire problem," aims to preserve sites in their entirety before their archives can be taken down or manipulated.

Archive-It collections grab snapshots of specified Web sites at a moment in time. Some institutions use Archive-It to capture collections of sites connected to particular social movements or historical events. UCLA, for example, maintains a collection documenting sites pertaining to the Occupy Wall Street protests. Another collection consists of snapshots of news and primary source Web documents from the Ukraine conflict.

To start our collection, we used Archive-It to crawl the entirety of Gawker.com, which we conducted amidst speculation that its archives might be purchased by a hostile party. Reported suitors have included Peter Thiel—who bankrolled the legal campaign that ultimately crushed the site—and more recently Mike Cernovich, who the site once described as a “D-list right-winger.”

We also captured a copy of L.A. Weekly shortly after its new owners—the identity of whom was initially concealed, even from its employees—restructured the operation and eliminated most of the writing jobs. At the time, one former employee published a short article titled “Who Owns L.A. Weekly,” which has since been removed from the site—though you can still view the version we captured. Since our crawl of the site, former employees have reported that stories are being "republished," validating our concerns about the integrity of the archive.

In these cases, and with all future sites added to this collection, the crawls we initiate through Archive-It will not just appear on our collection page, but will also be fed into the Internet Archive's Wayback Machine. The Wayback Machine is often the first stop for researchers seeking content that is no longer available online, so ensuring these sites are available there is an important way to reinforce the notion that this material is not irretrievably gone.

There are larger structural issues that render news outlets vulnerable to the billionaire problem. Those issues may be beyond the scope of any single organization to address. Our earlier work in this area includes gotham-grabber, which aims to limit the professional harm a vindictive media owner could do to the careers of individual journalists. We continue to extend that tool to work with additional outlets, including this weekend to cover The Toast, after its former editor reported that its archives will be shuttered; if you are a journalist who needs PDF backups of your work from archives that may not stick around, please get in touch.

Those efforts help individual journalists. But another important thing we can do to reduce the effectiveness of this kind of attack on press freedom is to commit ourselves to the wholesale preservation of threatened sites.

In this case, we seek to reduce the "upside" for wealthy individuals and organizations who would eliminate embarrassing or unflattering coverage by purchasing outlets outright. In other words, we hope that sites that can't simply be made to disappear will show some immunity to the billionaire problem.

Firestarter: Architecting Your Cloud with Accounts

Posted under: Firestarter

We are taking over our own Firestarter and kicking off a new series of discussions on cloud security… from soup to nuts (whatever that means). Each week for the next few months we will cover, in order, how to build out your cloud security program. We are taking our assessment framework and converting it into a series of discussions talking about what we find and how to avoid issues. This week we start with architecting your account structures, after a brief discussion of the impact of the Meltdown and Spectre vulnerabilities since they impact cloud (at least for now) more than your local computer.

Watch or listen:


- Rich (0) Comments Subscribe to our daily email digest

2018 Industry Analyst Cybersecurity Predictions

Key insights from top industry analysts to help demystify the cybersecurity landscape and reinforce critical areas of focus for organizations worldwide.


Category:

Information Security
Risk Management
Leadership Insights

Key insights from top industry analysts to help demystify the cybersecurity landscape and reinforce critical areas of focus for organizations worldwide.

Unconstitutional “ag-gag” laws criminalize journalism and insulate factory farms from accountability

moo

In 2013, animal rescue worker Amy Meyer filmed a forklift moving a sick cow at a Utah slaughterhouse. She was arrested and slapped with a misdemeanor charge of “agricultural operation interference”, and although her case was dropped after it attracted intense media attention, she became the first in the United States to be prosecuted under laws that ban documenting farm conditions with film or video.

Several states, in recent years, have passed so-called “ag-gag” laws, which are meant to protect the animal agriculture industry from public scrutiny by, in many cases, explicitly attempting to criminalize journalists and whistleblowers who expose its operating conditions.

Many of the politicians who have drafted and sponsored such legislation have direct ties to the industry and a vested interest in outlawing investigations, such as Representative Annette Sweeney, a former director of the Iowa Angus Association, who sponsored the Iowa “ag-gag” law. Authors of a similar bill in Minnesota that ultimately did not move forward included farm owners and a past president of the Minnesota Pork Producers Association. Here’s how the 9th Circuit Court of Appeals recently described how Idaho’s ag-gag law was drafted:

The bill was drafted by the Idaho Dairymen’s Association, a trade organization representing Idaho’s dairy industry. When the Association’s lawyer addressed legislators, he stated that one goal of the bill was “to protect Idaho farmers from wrongful interference. . . . Idaho farmers live and work spread out across the land where they’re uniquely vulnerable to interference by wrongful conduct.” Another goal was to shield the agricultural industry from undercover investigators who expose the industry to the “court of public opinion,” which destroys farmers’ reputations, results in death threats, and causes loss of customers.

The law in question explicitly outlawed entering making “audio or video recordings of the conduct of an agricultural production facility’s operations” without the owner’s consent. This is no accident—“ag-gag” laws intentionally aim to shield one of the country’s most secretive industries from accountability and public scrutiny by attempting to criminalize undercover journalism.

Thankfully, though, many courts are now ruling them unconstitutional. A federal judge ruled last year that the Utah law under which Meyer was charged violated the First Amendment. An appeals court ruling that recently struck down key parts of the Idaho law was a broad and robust defense of press freedom and undercover journalism.

The 9th Circuit said, “The act of recording is itself an inherently expressive activity; decisions about content, composition, lighting, volume, and angles, among others, are expressive in the same way as the written word or a musical score.” The judicial panel declared in its very first sentence, “Investigative journalism has long been a fixture in the American press, particularly with regard to food safety.”

Although courts have rightly struck down many “ag-gag” laws, documenting farm conditions is still criminalized in approximately seven states. Most of the states with “ag-gag” laws still in place are those in which the animal agriculture industry is especially powerful, like Arkansas and North Carolina. In some states, such as Missouri, anyone who captures evidence of animal abuse is required to turn it over to authorities within 24 hours.

Undercover documentation and investigations of farm conditions bring the disturbing animal cruelty of the commercial food system into public consciousness. Videos such as Meyer’s can spark backlash that can both motivate individuals to act and eat differently and pressure legislators to enact policies that protect animal welfare. Even in the case that sparked the challenge to the Idaho law, “The dairy farm owner responded to the video by firing the abusive employees who were caught on camera, instituting operational protocols, and conducting an animal welfare audit at the farm.”

There are countless examples of these types of videos impacting public opinion: A 2008 investigation of a California slaughterhouse that exposed sick cows being dragged by bulldozers onto trucks stopped sick animals from entering the food supply, and a disturbing video secretly recorded at a Foster Farms slaughterhouse, led to a criminal investigation of the farm. Some of the most appalling practices of the animal agriculture industry, including the force-feeding of ducks and confinement of calves raised for veal, have been outlawed due to public pressure in states like Arizona and California.  

Will Potter, a plaintiff in the Idaho case and an investigative journalist who has written extensively about “ag-gag” laws , notes that despite recent victories in court, this type of legislation is evolving to become more dangerous rather than more constitutional. He told the Freedom of the Press Foundation that states like Washington are considering broad “economic terrorism” legislation that does not just target animal agriculture or factory farm investigations. Instead, this type of legislation would apply to people who expose any industry without the consent of business owners, which would have vast chilling impacts on journalists and freedom of information.

Photographing and videographing information that challenges business interests is in the public interest. “Ag-gag” laws and other types of legislation that criminalize information gathering are an egregious assault on the First Amendment and press freedom, and as long as they survive, journalists and whistleblowers face risk of prosecution for bringing important information in the public interest into the light.


This Security Shit’s Hard and It Ain’t Gonna Get Any Easier

Posted under: Research and Analysis

In case you couldn’t tell from the title, this line is your official EXPLICIT tag. We writers sometimes need the full spectrum of language to make a point.

Yesterday Microsoft released a patch to roll back a patch that fixed the slightly-unpatchable Intel hardware bug because the patch causes reboots and potential data loss. Specifically, Intel’s Spectre 2 variant microcode patch is buggy. Just when we were getting a decent handle on endpoint security with well secured operating systems and six-figure-plus bug bounties, this shit happened. Plus, we probably can’t ever fully trust our silicone or operating systems in the first place.

Information security is hard. Information security is wonderful. Working in security is magical… if you have the proper state of mind.

I decided this year would be a good one for my mid-life crisis before I miss the boat and feel left out. The problem is that my life is actually pretty damn awesome, so I think I’m just screwing up my crisis pre-requisites. I like my wife, am already in pretty good physical shape, and don’t feel the need for a new car. Which appears to knock out pretty much all my options. The best I could come up with was to re-up my paramedic certification, expired for 20 years.

After working at the paramedic level again during my deployment to Puerto Rico it felt like time to go through the process and become official again. One of my first steps was to take a week off infosec and attend a paramedic refresher class.

A refresher class is an entirely different world than initial training. It’s a room full of experienced medics who are there to knock out the list of certifications they need to maintain every two years. Quite a few of the attendees in my class started working around the same time as me in the early 1990’s. Unlike me they stuck with it full-time and racked up 25 years or more of direct field experience.

There are no illusions among experienced medics (or firefighters or cops). If you go in thinking you are there to save lives you are usually out of the job in less than five years. You can’t possibly survive mentally if you think you are there to save the world, because once you actually meet the world, you realize it doesn’t want saving. The best you can usually do is offer someone a little comfort on the worst day of their life, and, maybe, sometimes help someone breathe a little longer.

You certainly aren’t going to change the string of bad life decisions that led you to their door. Bad diet, smoking, drugs, couch potatoitis, whatever. Not that everyone dials 911 as the result of seemingly irreversible decisions, but they do seem to take a disproportionate amount of our time. You either learn how to compartmentalize and survive, or process and survive, or you get another job. Even then it sometimes catches up to you and you eventually leave or kill yourself. Suicide is a very real occupational hazard.

Then there are new illnesses, antibacterial resistance, new ways of damaging the human body (vaping, exploding phones, airbags, hoverboards), the latest drug crisis, the latest drug shortage, ad infinitum. On the other side we have new drugs, new monitoring tools, new procedures, and new science.

For me this maps directly to the information security professional mindset.

As long as there are human beings and computer chips we will never win. There will never be an end. We face an endless stream of challenges and opportunities. Some years things are better. Other years things are worse. The challenge for us as professionals is to decide the role we want to play and how we want to play it.

There are EMS systems which still use proven bad techniques because someone in charge learned it, then decided they don’t want to change. Maybe due to sunk cost bias, maybe due to stubbornness. I know it was hard to learn that the technique I used to help the 14-year-old massive head injury patient 20+ years ago likely contributed to his permanent mental deficit. Not that I did anything wrong at the time, but because the science and our knowledge and understanding of the physiological mechanisms in play changed. I hurt that patient, while providing the best standard of care at the time.

Our password policies made sense at the time, but now we need to move past encoding unmemorable 8-character passwords rotated every 90 days into standards, and update our standards to reflect the widespread adoption of MFA and the latest password hashing mechanisms.

We don’t need to accept that there is literally no need for a DMZ in the cloud we just need to architect properly for the cloud.

We need to accept that Meltdown, ,Spectre and whatever new hardware vulnerabilities appear are out of our control, but we still need to do our best to mitigate the risk.

The bad medics aren’t the new medics or the old medics, but the medics who can’t accept that people don’t really change, and everything else does. Security is no different. In both professions the best leaders are those who continue to push themselves and adapt without burning out permanently. This is especially true for security today, as we face the biggest technology shifts in the history of our profession, while nation-states and extremely well-funded criminals keep raising the stakes.

But there is one key difference between being a paramedic and being a security professional (beyond pay). As a paramedic I may help someone with pain during the worst 10 to 60 minutes of their life, then move on to the next call. As a security professional I can help millions, if not billions (hello Amazon, Facebook, Apple, and Google Security), at a time. I find this especially rewarding and exciting, especially as we build new products we think can have major impacts at scale – but even if that doesn’t work, I know that both my research and direct client work have touched at least tens of millions of people who will never know who I am. Maybe I only helped keep them a little safer, but a little is better than nothing.

It doesn’t end, we don’t get to relax, but now that all society runs on technology, what we do matters, at scale – even if we can’t see it day-to-day. But we can only make this difference if we continue to learn, challenge ourselves, adapt to the ever changing-knowledge and technology around us, and avoid burnout.

As a paramedic I can help a person. As a security professional I can help a population. I hope you relish this opportunity as much as I do, for we are very fortunate to get it.

- Rich (1) Comments Subscribe to our daily email digest

New World, New Rules: Securing the Future State

I published an article today on the Oracle Cloud Security blog that takes a look at how approaches to information security must adapt to address the needs of the future state (of IT). For some organizations, it's really the current state. But, I like the term future state because it's inclusive of more than just cloud or hybrid cloud. It's the universe of Information Technology the way it will be in 5-10 years. It includes the changes in user behavior, infrastructure, IT buying, regulations, business evolution, consumerization, and many other factors that are all evolving simultaneously.

As we move toward that new world, our approach to security must adapt. Humans chasing down anomalies by searching through logs is an approach that will not scale and will not suffice. I included a reference in the article to a book called Afterlife. In it, the protagonist, FBI Agent Will Brody says "If you never change tactics, you lose the moment the enemy changes theirs." It's a fitting quote. Not only must we adapt to survive, we need to deploy IT on a platform that's designed for constant change, for massive scale, for deep analytics, and for autonomous security. New World, New Rules.

Here are a few excerpts:
Our environment is transforming rapidly. The assets we're protecting today look very different than they did just a few years ago. In addition to owned data centers, our workloads are being spread across multiple cloud platforms and services. Users are more mobile than ever. And we don’t have control over the networks, devices, or applications where our data is being accessed. It’s a vastly distributed environment where there’s no single, connected, and controlled network. Line-of-Business managers purchase compute power and SaaS applications with minimal initial investment and no oversight. And end-users access company data via consumer-oriented services from their personal devices. It's grown increasingly difficult to tell where company data resides, who is using it, and ultimately where new risks are emerging. This transformation is on-going and the threats we’re facing are morphing and evolving to take advantage of the inherent lack of visibility.
Here's the good news: The technologies that have exacerbated the problem can also be used to address it. On-premises SIEM solutions based on appliance technology may not have the reach required to address today's IT landscape. But, an integrated SIEM+UEBA designed from the ground up to run as a cloud service and to address the massively distributed hybrid cloud environment can leverage technologies like machine learning and threat intelligence to provide the visibility and intelligence that is so urgently needed.
Machine Learning (ML) mitigates the complexity of understanding what's actually happening and of sifting through massive amounts of activity that may otherwise appear to humans as normal. Modern attacks leverage distributed compute power and ML-based intelligence. So, countering those attacks requires a security solution with equal amounts of intelligence and compute power. As Larry Ellison recently said, "It can't be our people versus their computers. We're going to lose that war. It's got to be our computers versus their computers."
Click to read the full article: New World, New Rules: Securing the Future State.

NLP Analysis Of Tweets Using Word2Vec And T-SNE

In the context of some of the Twitter research I’ve been doing, I decided to try out a few natural language processing (NLP) techniques. So far, word2vec has produced perhaps the most meaningful results. Wikipedia describes word2vec very precisely:

“Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.”

During the two weeks leading up to the  January 2018 Finnish presidential elections, I performed an analysis of user interactions and behavior on Twitter, based on search terms relevant to that event. During the course of that analysis, I also dumped each Tweet’s raw text field to a text file, one item per line. I then wrote a small tool designed to preprocess the collected Tweets, feed that processed data into word2vec, and finally output some visualizations. Since word2vec creates multidimensional tensors, I’m using T-SNE for dimensionality reduction (the resulting visualizations are in two dimensions, compared to the 200 dimensions of the original data.)

The rest of this blog post will be devoted to listing and explaining the code used to perform these tasks. I’ll present the code as it appears in the tool. The code starts with a set of functions that perform processing and visualization tasks. The main routine at the end wraps everything up by calling each routine sequentially, passing artifacts from the previous step to the next one. As such, you can copy-paste each section of code into an editor, save the resulting file, and the tool should run (assuming you’ve pip installed all dependencies.) Note that I’m using two spaces per indent purely to allow the code to format neatly in this blog. Let’s start, as always, with importing dependencies. Off the top of my head, you’ll probably want to install tensorflow, gensim, six, numpy, matplotlib, and sklearn (although I think some of these install as part of tensorflow’s installation).

# -*- coding: utf-8 -*-
from tensorflow.contrib.tensorboard.plugins import projector
from sklearn.manifold import TSNE
from collections import Counter
from six.moves import cPickle
import gensim.models.word2vec as w2v
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import multiprocessing
import os
import sys
import io
import re
import json

The next listing contains a few helper functions. In each processing step, I like to save the output. I do this for two reasons. Firstly, depending on the size of your raw data, each step can take some time. Hence, if you’ve performed the step once, and saved the output, it can be loaded from disk to save time on subsequent passes. The second reason for saving each step is so that you can examine the output to check that it looks like what you want. The try_load_or_process() function attempts to load the previously saved output from a function. If it doesn’t exist, it runs the function and then saves the output. Note also the rather odd looking implementation in save_json(). This is a workaround for the fact that json.dump() errors out on certain non-ascii characters when paired with io.open().

def try_load_or_process(filename, processor_fn, function_arg):
  load_fn = None
  save_fn = None
  if filename.endswith("json"):
    load_fn = load_json
    save_fn = save_json
  else:
    load_fn = load_bin
    save_fn = save_bin
  if os.path.exists(filename):
    return load_fn(filename)
  else:
    ret = processor_fn(function_arg)
    save_fn(ret, filename)
    return ret

def print_progress(current, maximum):
  sys.stdout.write("\r")
  sys.stdout.flush()
  sys.stdout.write(str(current) + "/" + str(maximum))
  sys.stdout.flush()

def save_bin(item, filename):
  with open(filename, "wb") as f:
    cPickle.dump(item, f)

def load_bin(filename):
  if os.path.exists(filename):
    with open(filename, "rb") as f:
      return cPickle.load(f)

def save_json(variable, filename):
  with io.open(filename, "w", encoding="utf-8") as f:
    f.write(unicode(json.dumps(variable, indent=4, ensure_ascii=False)))

def load_json(filename):
  ret = None
  if os.path.exists(filename):
    try:
      with io.open(filename, "r", encoding="utf-8") as f:
        ret = json.load(f)
    except:
      pass
  return ret

Moving on, let’s look at the first preprocessing step. This function takes the raw text strings dumped from Tweets, removes unwanted characters and features (such as user names and URLs), removes duplicates, and returns a list of sanitized strings. Here, I’m not using string.printable for a list of characters to keep, since Finnish includes additional letters that aren’t part of the english alphabet (äöåÄÖÅ). The regular expressions used in this step have been somewhat tailored for the raw input data. Hence, you may need to tweak them for your own input corpus.

def process_raw_data(input_file):
  valid = u"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ#@.:/ äöåÄÖÅ"
  url_match = "(https?:\/\/[0-9a-zA-Z\-\_]+\.[\-\_0-9a-zA-Z]+\.?[0-9a-zA-Z\-\_]*\/?.*)"
  name_match = "\@[\_0-9a-zA-Z]+\:?"
  lines = []
  print("Loading raw data from: " + input_file)
  if os.path.exists(input_file):
    with io.open(input_file, 'r', encoding="utf-8") as f:
      lines = f.readlines()
  num_lines = len(lines)
  ret = []
  for count, text in enumerate(lines):
    if count % 50 == 0:
      print_progress(count, num_lines)
    text = re.sub(url_match, u"", text)
    text = re.sub(name_match, u"", text)
    text = re.sub("\&amp\;?", u"", text)
    text = re.sub("[\:\.]{1,}$", u"", text)
    text = re.sub("^RT\:?", u"", text)
    text = u''.join(x for x in text if x in valid)
    text = text.strip()
    if len(text.split()) > 5:
      if text not in ret:
        ret.append(text)
  return ret

The next step is to tokenize each sentence (or Tweet) into words.

def tokenize_sentences(sentences):
  ret = []
  max_s = len(sentences)
  print("Got " + str(max_s) + " sentences.")
  for count, s in enumerate(sentences):
    tokens = []
    words = re.split(r'(\s+)', s)
    if len(words) > 0:
      for w in words:
        if w is not None:
          w = w.strip()
          w = w.lower()
          if w.isspace() or w == "\n" or w == "\r":
            w = None
          if len(w) < 1:
            w = None
          if w is not None:
            tokens.append(w)
    if len(tokens) > 0:
      ret.append(tokens)
    if count % 50 == 0:
      print_progress(count, max_s)
  return ret

The final text preprocessing step removes unwanted tokens. This includes numeric data and stop words. Stop words are the most common words in a language. We omit them from processing in order to bring out the meaning of the text in our analysis. I downloaded a json dump of stop words for all languages from here, and placed it in the same directory as this script. If you plan on trying this code out yourself, you’ll need to perform the same steps. Note that I included extra stopwords of my own. After looking at the output of this step, I noticed that Twitter’s truncation of some tweets caused certain word fragments to occur frequently.

def clean_sentences(tokens):
  all_stopwords = load_json("stopwords-iso.json")
  extra_stopwords = ["ssä", "lle", "h.", "oo", "on", "muk", "kov", "km", "ia", "täm", "sy", "but", ":sta", "hi", "py", "xd", "rr", "x:", "smg", "kum", "uut", "kho", "k", "04n", "vtt", "htt", "väy", "kin", "#8", "van", "tii", "lt3", "g", "ko", "ett", "mys", "tnn", "hyv", "tm", "mit", "tss", "siit", "pit", "viel", "sit", "n", "saa", "tll", "eik", "nin", "nii", "t", "tmn", "lsn", "j", "miss", "pivn", "yhn", "mik", "tn", "tt", "sek", "lis", "mist", "tehd", "sai", "l", "thn", "mm", "k", "ku", "s", "hn", "nit", "s", "no", "m", "ky", "tst", "mut", "nm", "y", "lpi", "siin", "a", "in", "ehk", "h", "e", "piv", "oy", "p", "yh", "sill", "min", "o", "va", "el", "tyn", "na", "the", "tit", "to", "iti", "tehdn", "tlt", "ois", ":", "v", "?", "!", "&"]
  stopwords = None
  if all_stopwords is not None:
    stopwords = all_stopwords["fi"]
    stopwords += extra_stopwords
  ret = []
  max_s = len(tokens)
  for count, sentence in enumerate(tokens):
    if count % 50 == 0:
      print_progress(count, max_s)
    cleaned = []
    for token in sentence:
      if len(token) > 0:
        if stopwords is not None:
          for s in stopwords:
            if token == s:
              token = None
        if token is not None:
            if re.search("^[0-9\.\-\s\/]+$", token):
              token = None
        if token is not None:
            cleaned.append(token)
    if len(cleaned) > 0:
      ret.append(cleaned)
  return ret

The next function creates a vocabulary from the processed text. A vocabulary, in this context, is basically a list of all unique tokens in the data. This function creates a frequency distribution of all tokens (words) by counting the number of occurrences of each token. We will use this later to “trim” the vocabulary down to a manageable size.

def get_word_frequencies(corpus):
  frequencies = Counter()
  for sentence in corpus:
    for word in sentence:
      frequencies[word] += 1
  freq = frequencies.most_common()
  return freq

Now we’re done with all preprocessing steps, let’s get into the more interesting analysis functions. The following function accepts the tokenized and cleaned data generated from the steps above, and uses it to train a word2vec model. The num_features parameter sets the number of features each word is assigned (and hence the dimensionality of the resulting tensor.) It is recommended to set it between 100 and 1000. Naturally, larger values take more processing power and memory/disk space to handle. I found 200 to be enough, but I normally start with a value of 300 when looking at new datasets. The min_count variable passed to word2vec designates how to trim the vocabulary. For example, if min_count is set to 3, all words that appear in the data set less than 3 times will be discarded from the vocabulary used when training the word2vec model. In the dimensionality reduction step we perform later, large vocabulary sizes cause T-SNE iterations to take a long time. Hence, I tuned min_count to generate a vocabulary of around 10,000 words. Increasing the value of sample, will cause word2vec to randomly omit words with high frequency counts. I decided that I wanted to keep all of those words in my analysis, so it’s set to zero. Increasing epoch_count will cause word2vec to train for more iterations, which will, naturally take longer. Increase this if you have a fast machine or plenty of time on your hands 🙂

def get_word2vec(sentences):
  num_workers = multiprocessing.cpu_count()
  num_features = 200
  epoch_count = 10
  sentence_count = len(sentences)
  w2v_file = os.path.join(save_dir, "word_vectors.w2v")
  word2vec = None
  if os.path.exists(w2v_file):
    print("w2v model loaded from " + w2v_file)
    word2vec = w2v.Word2Vec.load(w2v_file)
  else:
    word2vec = w2v.Word2Vec(sg=1,
                            seed=1,
                            workers=num_workers,
                            size=num_features,
                            min_count=min_frequency_val,
                            window=5,
                            sample=0)

    print("Building vocab...")
    word2vec.build_vocab(sentences)
    print("Word2Vec vocabulary length:", len(word2vec.wv.vocab))
    print("Training...")
    word2vec.train(sentences, total_examples=sentence_count, epochs=epoch_count)
    print("Saving model...")
    word2vec.save(w2v_file)
  return word2vec

Tensorboard has some good tools to visualize word embeddings in the word2vec model we just created. These visualizations can be accessed using the “projector” tab in the interface. Here’s code to create tensorboard embeddings:

def create_embeddings(word2vec):
  all_word_vectors_matrix = word2vec.wv.syn0
  num_words = len(all_word_vectors_matrix)
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  dim = word2vec.wv[vocab[0]].shape[0]
  embedding = np.empty((num_words, dim), dtype=np.float32)
  metadata = ""
  for i, word in enumerate(vocab):
    embedding[i] = word2vec.wv[word]
    metadata += word + "\n"
  metadata_file = os.path.join(save_dir, "metadata.tsv")
  with io.open(metadata_file, "w", encoding="utf-8") as f:
    f.write(metadata)

  tf.reset_default_graph()
  sess = tf.InteractiveSession()
  X = tf.Variable([0.0], name='embedding')
  place = tf.placeholder(tf.float32, shape=embedding.shape)
  set_x = tf.assign(X, place, validate_shape=False)
  sess.run(tf.global_variables_initializer())
  sess.run(set_x, feed_dict={place: embedding})

  summary_writer = tf.summary.FileWriter(save_dir, sess.graph)
  config = projector.ProjectorConfig()
  embedding_conf = config.embeddings.add()
  embedding_conf.tensor_name = 'embedding:0'
  embedding_conf.metadata_path = 'metadata.tsv'
  projector.visualize_embeddings(summary_writer, config)

  save_file = os.path.join(save_dir, "model.ckpt")
  print("Saving session...")
  saver = tf.train.Saver()
  saver.save(sess, save_file)

Once this code has been run, tensorflow log entries will be created in save_dir. To start a tensorboard session, run the following command from the same directory where this script was run from:

tensorboard –logdir=save_dir

You should see output like the following once you’ve run the above command:

TensorBoard 0.4.0rc3 at http://node.local:6006 (Press CTRL+C to quit)

Navigate your web browser to localhost:<port_number> to see the interface. From the “Inactive” pulldown menu, select “Projector”.

tensorboard projector menu item

The “projector” menu is often hiding under the “inactive” pulldown.

Once you’ve selected “projector”, you should see a view like this:

Tensorboard's projector view

Tensorboard’s projector view allows you to interact with word embeddings, search for words, and even run t-sne on the dataset.

There are a lot of things to play around with in this view. You can search for words, fly around the embeddings, and even run t-sne (on the bottom left) on the dataset. If you get to this step, have fun playing with the interface!

And now, back to the code. One of word2vec’s most interesting functions is to find similarities between words. This is done via the word2vec.wv.most_similar() call. The following function calls word2vec.wv.most_similar() for a word and returns num-similar words. The returned value is a list containing the queried word, and a list of similar words. ( [queried_word, [similar_word1, similar_word2, …]] ).

def most_similar(input_word, num_similar):
  sim = word2vec.wv.most_similar(input_word, topn=num_similar)
  output = []
  found = []
  for item in sim:
    w, n = item
    found.append(w)
  output = [input_word, found]
  return output

The following function takes a list of words to be queried, passes them to the above function, saves the output, and also passes the queried words to t_sne_scatterplot(), which we’ll show later. It also writes a csv file – associations.csv – which can be imported into Gephi to generate graphing visualizations. You can see some Gephi-generated visualizations in the accompanying blog post.

I find that manually viewing the word2vec_test.json file generated by this function is a good way to read the list of similarities found for each word queried with wv.most_similar().

def test_word2vec(test_words):
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  output = []
  associations = {}
  test_items = test_words
  for count, word in enumerate(test_items):
    if word in vocab:
      print("[" + str(count+1) + "] Testing: " + word)
      if word not in associations:
        associations[word] = []
      similar = most_similar(word, num_similar)
      t_sne_scatterplot(word)
      output.append(similar)
      for s in similar[1]:
        if s not in associations[word]:
          associations[word].append(s)
    else:
      print("Word " + word + " not in vocab")
  filename = os.path.join(save_dir, "word2vec_test.json")
  save_json(output, filename)
  filename = os.path.join(save_dir, "associations.json")
  save_json(associations, filename)
  filename = os.path.join(save_dir, "associations.csv")
  handle = io.open(filename, "w", encoding="utf-8")
  handle.write(u"Source,Target\n")
  for w, sim in associations.iteritems():
    for s in sim:
      handle.write(w + u"," + s + u"\n")
  return output

The next function implements standalone code for creating a scatterplot from the output of T-SNE on a set of data points obtained from a word2vec.wv.most_similar() query. The scatterplot is visualized with matplotlib. Unfortunately, my matplotlib skills leave a lot to be desired, and these graphs don’t look great. But they’re readable.

def t_sne_scatterplot(word):
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  dim0 = word2vec.wv[vocab[0]].shape[0]
  arr = np.empty((0, dim0), dtype='f')
  w_labels = [word]
  nearby = word2vec.wv.similar_by_word(word, topn=num_similar)
  arr = np.append(arr, np.array([word2vec[word]]), axis=0)
  for n in nearby:
    w_vec = word2vec[n[0]]
    w_labels.append(n[0])
    arr = np.append(arr, np.array([w_vec]), axis=0)

  tsne = TSNE(n_components=2, random_state=1)
  np.set_printoptions(suppress=True)
  Y = tsne.fit_transform(arr)
  x_coords = Y[:, 0]
  y_coords = Y[:, 1]

  plt.rc("font", size=16)
  plt.figure(figsize=(16, 12), dpi=80)
  plt.scatter(x_coords[0], y_coords[0], s=800, marker="o", color="blue")
  plt.scatter(x_coords[1:], y_coords[1:], s=200, marker="o", color="red")

  for label, x, y in zip(w_labels, x_coords, y_coords):
    plt.annotate(label.upper(), xy=(x, y), xytext=(0, 0), textcoords='offset points')
  plt.xlim(x_coords.min()-50, x_coords.max()+50)
  plt.ylim(y_coords.min()-50, y_coords.max()+50)
  filename = os.path.join(plot_dir, word + "_tsne.png")
  plt.savefig(filename)
  plt.close()

In order to create a scatterplot of the entire vocabulary, we need to perform T-SNE over that whole dataset. This can be a rather time-consuming operation. The next function performs that operation, attempting to save and re-load intermediate steps (since some of them can take over 30 minutes to complete).

def calculate_t_sne():
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  arr = np.empty((0, dim0), dtype='f')
  labels = []
  vectors_file = os.path.join(save_dir, "vocab_vectors.npy")
  labels_file = os.path.join(save_dir, "labels.json")
  if os.path.exists(vectors_file) and os.path.exists(labels_file):
    print("Loading pre-saved vectors from disk")
    arr = load_bin(vectors_file)
    labels = load_json(labels_file)
  else:
    print("Creating an array of vectors for each word in the vocab")
    for count, word in enumerate(vocab):
      if count % 50 == 0:
        print_progress(count, vocab_len)
      w_vec = word2vec[word]
      labels.append(word)
      arr = np.append(arr, np.array([w_vec]), axis=0)
    save_bin(arr, vectors_file)
    save_json(labels, labels_file)

  x_coords = None
  y_coords = None
  x_c_filename = os.path.join(save_dir, "x_coords.npy")
  y_c_filename = os.path.join(save_dir, "y_coords.npy")
  if os.path.exists(x_c_filename) and os.path.exists(y_c_filename):
    print("Reading pre-calculated coords from disk")
    x_coords = load_bin(x_c_filename)
    y_coords = load_bin(y_c_filename)
  else:
    print("Computing T-SNE for array of length: " + str(len(arr)))
    tsne = TSNE(n_components=2, random_state=1, verbose=1)
    np.set_printoptions(suppress=True)
    Y = tsne.fit_transform(arr)
    x_coords = Y[:, 0]
    y_coords = Y[:, 1]
    print("Saving coords.")
    save_bin(x_coords, x_c_filename)
    save_bin(y_coords, y_c_filename)
 return x_coords, y_coords, labels, arr

The next function takes the data calculated in the above step, and data obtained from test_word2vec(), and plots the results from each word queried on the scatterplot of the entire vocabulary. These plots are useful for visualizing which words are closer to others, and where clusters commonly pop up. This is the last function before we get onto the main routine.

def show_cluster_locations(results, labels, x_coords, y_coords):
  for item in results:
    name = item[0]
    print("Plotting graph for " + name)
    similar = item[1]
    in_set_x = []
    in_set_y = []
    out_set_x = []
    out_set_y = []
    name_x = 0
    name_y = 0
    for count, word in enumerate(labels):
      xc = x_coords[count]
      yc = y_coords[count]
      if word == name:
        name_x = xc
        name_y = yc
      elif word in similar:
        in_set_x.append(xc)
        in_set_y.append(yc)
      else:
        out_set_x.append(xc)
        out_set_y.append(yc)
    plt.figure(figsize=(16, 12), dpi=80)
    plt.scatter(name_x, name_y, s=400, marker="o", c="blue")
    plt.scatter(in_set_x, in_set_y, s=80, marker="o", c="red")
    plt.scatter(out_set_x, out_set_y, s=8, marker=".", c="black")
    filename = os.path.join(big_plot_dir, name + "_tsne.png")
    plt.savefig(filename)
    plt.close()

Now let’s write our main routine, which will call all the above functions, process our collected Twitter data, and generate visualizations. The first few lines take care of our three preprocessing steps, and generation of a frequency distribution / vocabulary. The script expects the raw Twitter data to reside in a relative path (data/tweets.txt). Change those variables as needed. Also, all output is saved to a subdirectory in the relative path (analysis/). Again, tailor this to your needs.

if __name__ == '__main__':
  input_dir = "data"
  save_dir = "analysis"
  if not os.path.exists(save_dir):
    os.makedirs(save_dir)

  print("Preprocessing raw data")
  raw_input_file = os.path.join(input_dir, "tweets.txt")
  filename = os.path.join(save_dir, "data.json")
  processed = try_load_or_process(filename, process_raw_data, raw_input_file)
  print("Unique sentences: " + str(len(processed)))

  print("Tokenizing sentences")
  filename = os.path.join(save_dir, "tokens.json")
  tokens = try_load_or_process(filename, tokenize_sentences, processed)

  print("Cleaning tokens")
  filename = os.path.join(save_dir, "cleaned.json")
  cleaned = try_load_or_process(filename, clean_sentences, tokens)

  print("Getting word frequencies")
  filename = os.path.join(save_dir, "frequencies.json")
  frequencies = try_load_or_process(filename, get_word_frequencies, cleaned)
  vocab_size = len(frequencies)
  print("Unique words: " + str(vocab_size))

Next, I trim the vocabulary, and save the resulting list of words. This allows me to look over the trimmed list and ensure that the words I’m interested in survived the trimming operation. Due to the nature of the Finnish language, (and Twitter), the vocabulary of our “cleaned” set, prior to trimming, was over 100,000 unique words. After trimming it ended up at around 11,000 words.

  trimmed_vocab = []
  min_frequency_val = 6
  for item in frequencies:
    if item[1] >= min_frequency_val:
      trimmed_vocab.append(item[0])
  trimmed_vocab_size = len(trimmed_vocab)
  print("Trimmed vocab length: " + str(trimmed_vocab_size))
  filename = os.path.join(save_dir, "trimmed_vocab.json")
  save_json(trimmed_vocab, filename)

The next few lines do all the compute-intensive work. We’ll create a word2vec model with the cleaned token set, create tensorboard embeddings (for the visualizations mentioned above), and calculate T-SNE. Yes, this part can take a while to run, so go put the kettle on.

  print
  print("Instantiating word2vec model")
  word2vec = get_word2vec(cleaned)
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  print("word2vec vocab contains " + str(vocab_len) + " items.")
  dim0 = word2vec.wv[vocab[0]].shape[0]
  print("word2vec items have " + str(dim0) + " features.")

  print("Creating tensorboard embeddings")
  create_embeddings(word2vec)

  print("Calculating T-SNE for word2vec model")
  x_coords, y_coords, labels, arr = calculate_t_sne()

Finally, we’ll take the top 50 most frequent words from our frequency distrubution, query them for 40 most similar words, and plot both labelled graphs of each set, and a “big plot” of that set on the entire vocabulary.

  plot_dir = os.path.join(save_dir, "plots")
  if not os.path.exists(plot_dir):
    os.makedirs(plot_dir)

  num_similar = 40
  test_words = []
  for item in frequencies[:50]:
    test_words.append(item[0])
  results = test_word2vec(test_words)

  big_plot_dir = os.path.join(save_dir, "big_plots")
  if not os.path.exists(big_plot_dir):
    os.makedirs(big_plot_dir)
  show_cluster_locations(results, labels, x_coords, y_coords)

And that’s it! Rather a lot of code, but it does quite a few useful tasks. If you’re interested in seeing the visualizations I created using this tool against the Tweets collected from the January 2018 Finnish presidential elections, check out this blog post.

NLP Analysis And Visualizations Of #presidentinvaalit2018

During the lead-up to the January 2018 Finnish presidential elections, I collected a dataset consisting of raw Tweets gathered from search words related to the election. I then performed a series of natural language processing experiments on this raw data. The methodology, including all the code used, can be found in an accompanying blog post. This article details the results of my experiments, and shows some of the visualizations generated.

I pre-processed the raw dataset, used it to train a word2vec model, and then used that model to perform analyses using word2vec.wv.most_similar(), T-SNE, and Tensorboard.

My first experiment involved creating scatterplots of words found to be similar to frequently encountered tokens within the Twitter data. I looked at the 50 most frequent tokens encountered in this way, and used T-SNE to reduce the dimensionality of the set of vectors generated in each case. Results were plotted using matplotlib. Here are a few examples of the output generated.

T-SNE scatterplot of the 40 most similar words to #laura2018

T-SNE scatterplot of the 40 most similar words to #laura2018

Here you can see that word2vec easily identified other hashtags related to the #laura2018 campaign, including #suomitakaisin, #suomitakas, #siksilaura and #siksips. Laura Huhtasaari was candidate number 5 on the voting slip, and that was also identified, along with other hashtags associated with her name.

T-SNE scatterplot of the 40 most similar words to #turpo

T-SNE scatterplot of the 40 most similar words to #turpo

Here’s an analysis of the hashtag #turpo (short for turvallisuuspolitiikka – National Security). Here you can see that word2vec identified many references to NATO (one issue that was touched upon during election campaigning), jäsenyys (membership), #ulpo – ulkopolitiikka (Foreign Policy), and references to regions and countries (venäjä – Russia, ruotsi – Sweden, itämeri – Baltic).

T-SNE scatterplot of the 40 most similar words to venäjä

T-SNE scatterplot of the 40 most similar words to venäjä

On a similar note, here’s a scatterplot of words similar to venäjä (Russia). As expected, word2vec identified NATO in close relationship. Names of countries are expected to register as similar in word2vec, and we see Ruotsi (Sweden), Ukraine, USA, Turkki (Turkey), Syria, Kiina (China). Word2vec also finds the word Putin to be similar, and interestingly, Neuvostoliito (USSR) was mentioned in the Twitter data.

T-SNE scatterplot of the 40 most similar words to presidentti

T-SNE scatterplot of the 40 most similar words to presidentti

Above is a scatterplot based on the word “presidentti” (president). Note how word2vec identified Halonen, Urho, Kekkonen, Donald, and Trump.

Moving on, I took the names of the eight presidential candidates in Sunday’s election, and plotted them, along with the 40 most similar guesses from word2vec, on scatterplots of the entire vocabulary. Here are the results.

All candidates plotted against the full vocabulary. The blue dot is the target. Red dots are similar tokens.

All candidates plotted against the full vocabulary. The blue dot is the target. Red dots are similar tokens.

As you can see above, all of the candidates occupied separate spaces on the graph, and there was very little overlap amongst words similar to each candidate’s name.

I created word embeddings using Tensorflow, and opened the resulting log files in Tensorboard in order to produce some visualizations with that tool. Here are some of the outputs.

Tensorboard visualization of words related to #haavisto on a 2d representation of word embeddings, dimensionally reduced using T-SNE

Tensorboard visualization of words related to #haavisto2018 on a 2D representation of word embeddings, dimensionally reduced using T-SNE

The above shows word vectors in close proximity to #haavisto2018, based on the embeddings I created (from the word2vec model). Here you can find references to Tavastia, a club in Helsinki where Pekka Haavisto’s campaign hosted an event on 20th January 2018. Words clearly associated with this event include liput (tickets), ilta (evening), livenä (live), and biisejä (songs). The event was called “Siksipekka”. Here’s a view of that hashtag.

Again, we see similar words, including konsertti (concert). Another nearby word vector identified was #vihreät (the green party).

In my last experiment, I compiled lists of similar words for all of the top 50 most frequent words found in the Twitter data, and recorded associations between the lists generated. I imported this data into Gephi, and generated some graphs with it.

I got interested in Gephi after recently collaborating with Erin Gallagher (@3r1nG) to visualize the data I collected on some bots found to be following Finnish recommended Twitter accounts. I highly recommend that you check out some of her other blog posts, where you’ll see some amazing visualizations. Gephi is a powerful tool, but it takes quite some time to master. As you’ll see, my attempts at using it pale in comparison to what Erin can do.

A zoomed-out view of the mapping between the 40 most similar words to the 50 most frequent words in the Twitter data collected

A zoomed-out view of the mapping between the 40 most similar words to the 50 most frequent words in the Twitter data collected

The above is a graph of all the words found. Larger circles indicate that a word has more other words associated with it.

A zoomed-in view of some of the candidates

A zoomed-in view of some of the candidates

Here’s a zoom-in on some of the candidates. Note that I treated hashtags as unique words, which turned out to be useful for this analysis. For reference, here are a few translations: äänestää = vote, vaalit = elections, puhuu = to speak, presitenttiehdokas = presidential candidate.

Words related to foreign policy and national security

Words related to foreign policy and national security

Here is a zoomed-in view of the words associated with foreign policy and national security.

Words associated with Suomi (Finland)

Words associated with Suomi (Finland)

Finally, here are some words associated with #suomi (Finland). Note lots of references to nature (luonto), winter (talvi), and snow (lumi).

As you might have gathered, word2vec finds interesting and fairly accurate associations between words, even in messy data such as Tweets. I plan on delving further into this area in hopes of finding some techniques that might improve the Twitter research I’ve been doing. The dataset collected during the Finnish elections was fairly small (under 150,000 Tweets). Many of the other datasets I work with are orders of magnitude larger. Hence I’m particularly interested in figuring out if there’s a way to accurately cluster Twitter data using these techniques.

 

10 old-school security principles that (still) rule

If you're always scrambling to keep your IT infrastructure updated, you might think that newer is always better when it comes to security: new patches, new and more secure hardware, new crypto techniques, etc.  But when it comes to fundamentals, some things are eternal. For instance, according to Jeff Williams, CTO and co-founder of Contrast Security, "The design principles from Saltzer and Schroeder's  1975 article 'The Protection of Information in Computer Systems' are still incredibly useful and often ignored."

To read this article in full, please click here