Category Archives: R

rud.is: Taking a Tour of the Pirate Ship ‘GitHub DMCA’ with R

Despite having sailed through the core components of this year’s Talk Like A Pirate Day R post time has been an enemy of late so this will be a short post that others can build off of, especially since there’s lots more knife work ground to cover from the data.

DMC-WhAt?

Since this is TLAPD, I’ll pilfer some of the explanation from GitHub itself:

The Digital Millennium Copyright Act (DMCA) <start_of_current_pilfer>“provides a safe harbor for service providers that host user-generated content. Since even a single claim of copyright infringement can carry statutory damages of up to $150,000, the possibility of being held liable for user-generated content could be very harmful for service providers. With potential damages multiplied across millions of users, cloud-computing and user-generated content sites like YouTube, Facebook, or GitHub probably never would have existed without the DMCA (or at least not without passing some of that cost downstream to their users).”

“The DMCA addresses this issue by creating a copyright liability safe harbor for internet service providers hosting allegedly infringing user-generated content. Essentially, so long as a service provider follows the DMCA’s notice-and-takedown rules, it won’t be liable for copyright infringement based on user-generated content. Because of this, it is important for GitHub to maintain its DMCA safe-harbor status.”</end_of_current_pilfer>

(I’ll save you from a long fact- and opinion-based diatribe on the DMCA, but suffice it to say it’s done far more harm than good IMO. Also, hopefully the “piracy” connection makes sense, now :-)

If your initial reaction was “What does the DMCA have to do with GitHub?” it likely (quickly) turned to “Oh…GitHub is really just a version-controlled file sharing service&hellip”. As such it has to have a robust takedown policy and process.

I don’t know if Microsoft is going to keep the practice of being open about DMCA requests now that they own GitHub nor do I know if they’ll use the same process on themselves (since, as we’ll see, they have issued DMCA requests to GitHub in the past). For now, we’ll assume they will, thus making the code from this post usable in the future to check on the status of DMCA requests over a longer period of time. But first we need the data.

Hunting for treasure in the data hoard

Unsurprisingly, GitHub stores DMCA data on GitHub. Ironically, they store it openly — in-part — to shine a light on what giant, global megacorps like Microsoft are doing. Feel free to use one of the many R packages to clone the repo, but a simple command-line git clone git@github.com:github/dmca.git is quick and efficient (not everything needs to be done from R).

The directory structure looks like this:

├── 2011
├── 2012
├── 2013
├── 2014
├── 2015
├── 2016
├── 2017
├── 2017-02-01-RBoyApps-2.md
├── 2017-02-15-DeutscheBank.md
├── 2017-03-13-Jetbrains.md
├── 2017-06-26-Wipro-Counternotice.md
├── 2017-06-30-AdflyLink.md
├── 2017-07-28-Toontown-2.md
├── 2017-08-31-Tourzan.md
├── 2017-09-04-Random-House.md
├── 2017-09-05-RandomHouse-2.md
├── 2017-09-18-RandomHouse.md
├── 2017-09-19-Ragnarok.md
├── 2017-10-10-Broadcom.md
├── 2018
├── 2018-02-01-NihonAdSystems.md
├── 2018-03-03-TuneIn.md
├── 2018-03-16-Wabg.md
├── 2018-05-17-Packt.md
├── 2018-06-12-Suning.md
├── 2018-07-31-Pearson.md
├── CONTRIBUTING.md
├── data
└── README.md

Unfortunately, the data directory contains fools’ gold (it’s just high-level summary data).

We want DMCA filer names, repo names, file names and the DMCA notice text (though we’ll be leaving NLP projects up to the intrepid readers). For that, it will mean processing the directories of notices.

Notices are named (sadly, with some inconsistency) like this: 2018-03-15-Microsoft.md. Year, month, date and name of org. The contents are text-versions of correspondence (usually email text) that have some requirements in order to be processed. There’s also an online form one can fill out but it’s pretty much a free text field with some semblance of structure. It’s up to humans to follow that structure and — as such — there is inconsistency in the text as well. (Perhaps this is a great lesson that non-constrained inputs and human-originated filenames aren’t a great plan for curating data stores.)

You may have seen what look like takedown files in the top level of the repo. I have no idea if they are legit (since they aren’t in the structured directories) so we’ll be ignoring them.

When I took a look at the directories, some files end in .markdown but most end in .md. We’ll cover both instances (you’ll need to replace /data/github/dmca with the prefix where you stored the repo:

library(tools)
library(stringi)
library(hrbrthemes)
library(tidyverse)

list.files(
  path = sprintf("/data/github/dmca/%s", 2011:2018), 
  pattern = "\\.md$|\\.markdown$",
  full.names = TRUE
) -> dmca_files

As noted previously, we’re going to focus on DMCA views over time, look at organizations who filed DMCA notices and the notice content. It turns out the filenames also distinguish whether a notice is a takedown request or a counter-notice (i.e. an “oops…my bad…” by a takedown originator) or a retraction, so we’ll collect that metadata as well. Finally, we’ll slurp up the text along the way.

Again, I’ve taken a pass at this and found out the following:

  • Some dates are coded incorrectly (infrequently enough to be able to use some causal rules to fix)
  • Some org names are coded (often enough to skew counts, so we need to deal with it)
  • Counter-notice and retraction tags are inconsistent, so we need to deal with that as well

It’s an ugly pipeline, so I’ve annotated these initial steps to make what’s going on a bit clearer:

map_df(dmca_files, ~{
  
  file_path_sans_ext(.x) %>% # remove extension
    basename() %>% # get just the filename
    stri_match_all_regex(
      "([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{1,2})-(.*)" # try to find the date and the org
    ) %>% 
    unlist() -> date_org
  
  if (is.na(date_org[2])) { # handle a special case where the date pattern above didn't work
    file_path_sans_ext(.x) %>% 
      basename() %>%
      stri_match_all_regex(
        "([[:digit:]]{4}-[[:digit:]]{2})-(.*)"
      ) %>% 
      unlist() -> date_org
  }
  
  # a few files are still broken so we'll deal with them as special cases
  
  if (stri_detect_fixed(.x, "2017/2017-11-06-1776.md")) {
    date_org <- c("", "2017-11-06", "1776")
  } else if (stri_detect_fixed(.x, "2017/2017-Offensive-Security-7.md")) {
    date_org <- c("", "2017-12-30", "Offensive-Security-7")
  } else if (stri_detect_fixed(.x, "2017/Offensive-Security-6.md")) {
    date_org <- c("", "2017-12-29", "Offensive-Security-6")
  }
  
  # we used a somewhat liberal regex to capture dates since some are 
  # still broken. We'll deal with those first, then turn them
  # into proper Date objects
  
  list(
    notice_day = case_when(
      date_org[2] == "2015-12-3"  ~ "2015-12-03",
      date_org[2] == "2015-12-7"  ~ "2015-12-07",
      date_org[2] == "2016-08"    ~ "2016-08-01",
      date_org[2] == "2016-10-7"  ~ "2016-10-07",
      date_org[2] == "2016-11-1"  ~ "2016-11-01",
      date_org[2] == "2016-11-3"  ~ "2016-11-03",
      date_org[2] == "2017-06"    ~ "2017-06-01",
      date_org[2] == "0107-05-22" ~ "2017-05-22",
      date_org[2] == "2017-11-1"  ~ "2017-11-01",
      TRUE ~ date_org[2]
    ) %>% 
      lubridate::ymd(),
    notice_org = date_org[3] %>% # somtimes the org name is messed up so we need to clean it up
      stri_replace_last_regex("[-]*[[:digit:]]+$", "") %>% 
      stri_replace_all_fixed("-", " "),
    notice_content = list(read_lines(.x)) # grab the content
  ) -> ret
  
  # and there are still some broken org names
  if (stri_detect_fixed(.x, "2017/2017-11-06-1776.md")) {
    ret$notice_org <- "1776"
  } 
  
  ret
  
}) -> dmca

dmca
## # A tibble: 4,460 x 3
##    notice_day notice_org                   notice_content
##                                         
##  1 2011-01-27 sony                             
##  2 2011-01-28 tera                             
##  3 2011-01-31 sony                             
##  4 2011-02-03 sony counternotice                
##  5 2011-02-03 sony                          
##  6 2011-03-24 oracle                            
##  7 2011-03-30 mentor graphics                  
##  8 2011-05-24 cpp virtual world operations     
##  9 2011-06-07 sony                             
## 10 2011-06-13 diablominer                      
## # ... with 4,450 more rows

Much better. We’ve got more deck-swabbing to do, now, to tag the counter-notice and retractions:

mutate(
  dmca,
  counter_notice = stri_detect_fixed(notice_org, "counternotice|counter notice"), # handle inconsistency
  retraction = stri_detect_fixed(notice_org, "retraction"), 
  notice_org = stri_trans_tolower(notice_org) %>% 
    stri_replace_first_regex("\ *(counternotice|counter notice)\ *", "") %>% # clean up org names with tags
    stri_replace_first_regex("\ *retraction\ *", "")
) -> dmca

dmca
## # A tibble: 4,460 x 5
##    notice_day notice_org        notice_content counter_notice retraction
##                                              
##  1 2011-01-27 sony                   FALSE          FALSE     
##  2 2011-01-28 tera                   FALSE          FALSE     
##  3 2011-01-31 sony                   FALSE          FALSE     
##  4 2011-02-03 sony                    FALSE          FALSE     
##  5 2011-02-03 sony                FALSE          FALSE     
##  6 2011-03-24 oracle                  FALSE          FALSE     
##  7 2011-03-30 mentor graphics        FALSE          FALSE     
##  8 2011-05-24 cpp virtual worl…      FALSE          FALSE     
##  9 2011-06-07 sony                   FALSE          FALSE     
## 10 2011-06-13 diablominer            FALSE          FALSE     
## # ... with 4,450 more rows

I’ve lower-cased the org names to make it easier to wrangle them since we do, indeed, need to wrangle them.

I’m super-not-proud of the following code block, but I went into it thinking the org name corrections would be infrequent. But, as I worked with the supposedly-cleaned data, I kept adding correction rules and eventually created a monster:

mutate(
  dmca,
  notice_org = case_when(
    stri_detect_fixed(notice_org, "accenture")        ~ "accenture",
    stri_detect_fixed(notice_org, "adobe")            ~ "adobe",
    stri_detect_fixed(notice_org, "amazon")           ~ "amazon",
    stri_detect_fixed(notice_org, "ansible")          ~ "ansible",
    stri_detect_fixed(notice_org, "aspengrove")       ~ "aspengrove",
    stri_detect_fixed(notice_org, "apple")            ~ "apple",
    stri_detect_fixed(notice_org, "aws")              ~ "aws",
    stri_detect_fixed(notice_org, "blizzard")         ~ "blizzard",
    stri_detect_fixed(notice_org, "o reilly")         ~ "oreilly",
    stri_detect_fixed(notice_org, "random")           ~ "random house",
    stri_detect_fixed(notice_org, "casado")           ~ "casadocodigo",
    stri_detect_fixed(notice_org, "ccp")              ~ "ccp",
    stri_detect_fixed(notice_org, "cisco")            ~ "cisco",
    stri_detect_fixed(notice_org, "cloudsixteen")     ~ "cloud sixteen",
    stri_detect_fixed(notice_org, "collinsharper")    ~ "collins ’harper",
    stri_detect_fixed(notice_org, "contentanalytics") ~ "content analytics",
    stri_detect_fixed(notice_org, "packt")            ~ "packt",
    stri_detect_fixed(notice_org, "penguin")          ~ "penguin",
    stri_detect_fixed(notice_org, "wiley")            ~ "wiley",
    stri_detect_fixed(notice_org, "wind river")       ~ "windriver",
    stri_detect_fixed(notice_org, "windriver")        ~ "windriver",
    stri_detect_fixed(notice_org, "wireframe")        ~ "wireframe shader",
    stri_detect_fixed(notice_org, "listen")           ~ "listen",
    stri_detect_fixed(notice_org, "wpecommerce")      ~ "wpecommerce",
    stri_detect_fixed(notice_org, "yahoo")            ~ "yahoo",
    stri_detect_fixed(notice_org, "youtube")          ~ "youtube",
    stri_detect_fixed(notice_org, "x pressive")       ~ "xpressive",
    stri_detect_fixed(notice_org, "ximalaya")         ~ "ximalaya",
    stri_detect_fixed(notice_org, "pragmatic")        ~ "pragmatic",
    stri_detect_fixed(notice_org, "evadeee")          ~ "evadeee",
    stri_detect_fixed(notice_org, "iaai")             ~ "iaai",
    stri_detect_fixed(notice_org, "line corp")        ~ "line corporation",
    stri_detect_fixed(notice_org, "mediumrare")       ~ "medium rare",
    stri_detect_fixed(notice_org, "profittrailer")    ~ "profit trailer",
    stri_detect_fixed(notice_org, "smartadmin")       ~ "smart admin",
    stri_detect_fixed(notice_org, "microsoft")        ~ "microsoft",
    stri_detect_fixed(notice_org, "monotype")         ~ "monotype",
    stri_detect_fixed(notice_org, "qualcomm")         ~ "qualcomm",
    stri_detect_fixed(notice_org, "pearson")          ~ "pearson",
    stri_detect_fixed(notice_org, "sony")             ~ "sony",
    stri_detect_fixed(notice_org, "oxford")           ~ "oxford",
    stri_detect_fixed(notice_org, "oracle")           ~ "oracle",
    stri_detect_fixed(notice_org, "out fit")          ~ "outfit",
    stri_detect_fixed(notice_org, "nihon")            ~ "nihon",
    stri_detect_fixed(notice_org, "opencv")           ~ "opencv",
    stri_detect_fixed(notice_org, "newsis")           ~ "newsis",
    stri_detect_fixed(notice_org, "nostarch")         ~ "nostarch",
    stri_detect_fixed(notice_org, "stardog")          ~ "stardog",
    stri_detect_fixed(notice_org, "mswindows")        ~ "microsoft",
    stri_detect_fixed(notice_org, "moody")            ~ "moody",
    stri_detect_fixed(notice_org, "minecraft")        ~ "minecraft",
    stri_detect_fixed(notice_org, "medinasoftware")   ~ "medina software",
    stri_detect_fixed(notice_org, "linecorporation")  ~ "line corporation",
    stri_detect_fixed(notice_org, "steroarts")        ~ "stereoarts",
    stri_detect_fixed(notice_org, "mathworks")        ~ "mathworks",
    stri_detect_fixed(notice_org, "tmssoftware")      ~ "tmssoftware",
    stri_detect_fixed(notice_org, "toontown")         ~ "toontown",
    stri_detect_fixed(notice_org, "wahoo")            ~ "wahoo",
    stri_detect_fixed(notice_org, "webkul")           ~ "webkul",
    stri_detect_fixed(notice_org, "whmcs")            ~ "whmcs",
    stri_detect_fixed(notice_org, "viber")            ~ "viber",
    stri_detect_fixed(notice_org, "totalfree")        ~ "totalfreedom",
    stri_detect_fixed(notice_org, "successacademies") ~ "success academies",
    stri_detect_fixed(notice_org, "ecgwaves")         ~ "ecgwaves",
    stri_detect_fixed(notice_org, "synology")         ~ "synology",
    stri_detect_fixed(notice_org, "infistar")         ~ "infistar’",
    stri_detect_fixed(notice_org, "galleria")         ~ "galleria",
    stri_detect_fixed(notice_org, "jadoo")            ~ "jadoo",
    stri_detect_fixed(notice_org, "dofustouch")       ~ "dofus touch",
    stri_detect_fixed(notice_org, "gravityforms")     ~ "gravity forms",
    stri_detect_fixed(notice_org, "fujiannewland")    ~ "fujian newland",
    stri_detect_fixed(notice_org, "dk uk")            ~ "dk",
    stri_detect_fixed(notice_org, "dk us")            ~ "dk",
    stri_detect_fixed(notice_org, "dkuk")             ~ "dk",
    stri_detect_fixed(notice_org, "dkus")             ~ "dk",
    stri_detect_fixed(notice_org, "facet")            ~ "facet",
    stri_detect_fixed(notice_org, "fh admin")         ~ "fhadmin",
    stri_detect_fixed(notice_org, "electronicarts")   ~ "electronic arts",
    stri_detect_fixed(notice_org, "daikonforge")      ~ "daikon forge",
    stri_detect_fixed(notice_org, "corgiengine")      ~ "corgi engine",
    stri_detect_fixed(notice_org, "epicgames")        ~ "epic  games",
    stri_detect_fixed(notice_org, "essentialmode")    ~ "essentialmode",
    stri_detect_fixed(notice_org, "jetbrains")        ~ "jetbrains",
    stri_detect_fixed(notice_org, "foxy")             ~ "foxy themes",
    stri_detect_fixed(notice_org, "cambridgemobile")  ~ "cambridge mobile",
    stri_detect_fixed(notice_org, "offensive")        ~ "offensive security",
    stri_detect_fixed(notice_org, "outfit")           ~ "outfit",
    stri_detect_fixed(notice_org, "haihuan")          ~ "shanghai haihuan",
    stri_detect_fixed(notice_org, "schuster")         ~ "simon & schuster",
    stri_detect_fixed(notice_org, "silicon")          ~ "silicon labs",
    TRUE ~ notice_org
  )) %>% 
  arrange(notice_day) -> dmca

dmca
## # A tibble: 4,460 x 5
##    notice_day notice_org        notice_content counter_notice retraction
##                                              
##  1 2011-01-27 sony                   FALSE          FALSE     
##  2 2011-01-28 tera                   FALSE          FALSE     
##  3 2011-01-31 sony                   FALSE          FALSE     
##  4 2011-02-03 sony                    FALSE          FALSE     
##  5 2011-02-03 sony                FALSE          FALSE     
##  6 2011-03-24 oracle                  FALSE          FALSE     
##  7 2011-03-30 mentor graphics        FALSE          FALSE     
##  8 2011-05-24 cpp virtual worl…      FALSE          FALSE     
##  9 2011-06-07 sony                   FALSE          FALSE     
## 10 2011-06-13 diablominer            FALSE          FALSE     
## # ... with 4,450 more rows

You are heartily encouraged to create a translation table in place of that monstrosity.

But, we finally have usable data. You can avoid the above by downloading https://rud.is/dl/github-dmca.json.gz and using jsonlite::stream_in() or ndjson::stream_in() to get the above data frame.

Hoisting the mizzen sailplots

Let’s see what the notice submission frequency looks like over time:

# assuming you downloaded it as suggested
jsonlite::stream_in(gzfile("~/Data/github-dmca.json.gz")) %>% 
  tbl_df() %>% 
  mutate(notice_day = as.Date(notice_day)) -> dmca

filter(dmca, !retraction) %>% 
  mutate(
    notice_year = lubridate::year(notice_day),
    notice_ym = as.Date(format(notice_day, "%Y-%m-01"))
  ) %>% 
  dplyr::count(notice_ym) %>% 
  arrange(notice_ym) %>% 
  ggplot(aes(notice_ym, n)) +
  ggalt::stat_xspline(
    geom="area", fill=alpha(ft_cols$blue, 1/3), color=ft_cols$blue
  ) +
  scale_y_comma() +
  labs(
    x = NULL, y = "# Notices", 
    title = "GitHub DMCA Notices by Month Since 2011"
  ) +
  theme_ft_rc(grid="XY")

I’m not naive, but that growth was a bit of a shocker, which made want to jump in and see who the top-filers were:

count(dmca, notice_org, sort=TRUE)
## # A tibble: 1,948 x 2
##    notice_org             n
##                  
##  1 webkul                92
##  2 pearson               90
##  3 stereoarts            86
##  4 qualcomm              72
##  5 codility              71
##  6 random house          62
##  7 outfit                57
##  8 offensive security    49
##  9 sensetime             46
## 10 penguin               44
## # ... with 1,938 more rows

“Webkul” is an enterprise eCommerce (I kinda miss all the dashed “e-” prefixes we used to use back in the day) platform. I mention that since I didn’t know what it was either. There are some recognizable names there like “Pearson” and “Random House” and “Penguin” which make sense since it’s easy to share improperly share e-books (modern non-dashed idioms be darned).

Let’s see the top 15 orgs by year since 2015 (since that’s when DMCA filings really started picking up and because I like 2×2 grids). We’ll also leave out counter-notices and retractions and alpha-order it since I want to be able to scan the names more than I want to see rank:

filter(dmca, !retraction, !counter_notice, notice_day >= as.Date("2015-01-01")) %>%
  mutate(
    notice_year = lubridate::year(notice_day),
  ) %>% 
  dplyr::count(notice_year, notice_org) %>% 
  group_by(notice_year) %>% 
  top_n(15) %>% 
  slice(1:15) %>% 
  dplyr::ungroup() %>%
  mutate( # a-z order with "a" on top 
    notice_org = factor(notice_org, levels = unique(sort(notice_org, decreasing = TRUE)))
  ) %>% 
  ggplot(aes(n, notice_org, xend=0, yend=notice_org)) +
  geom_segment(size = 2, color = ft_cols$peach) +
  facet_wrap(~notice_year, scales = "free") +
  scale_x_comma(limits=c(0, 60)) +
  labs(
    x = NULL, y = NULL,
    title = "Top 15 GitHub DMCA Filers by Year Since 2015"
  ) +
  theme_ft_rc(grid="X")

Let’s look at rogues’ gallery of the pirates themselves:

dmca %>% 
  mutate(
    ghusers = notice_content %>% 
      map(~{
        stri_match_all_regex(.x, "http[s]*://github.com/([^/]+)/.*") %>% 
          discard(~is.na(.x[,1])) %>% 
          map_chr(~.x[,2]) %>% 
          unique() %>% 
          discard(`==`, "github") %>% 
          discard(~grepl(" ", .x))
      })
  ) %>% 
  unnest(ghusers) %>% 
  dplyr::count(ghusers, sort=TRUE) %>% 
  print() -> offenders
## # A tibble: 18,396 x 2
##    ghusers           n
##             
##  1 RyanTech         16
##  2 sdgdsffdsfff     12
##  3 gamamaru6005     10
##  4 ranrolls         10
##  5 web-padawan      10
##  6 alexinfopruna     8
##  7 cyr2242           8
##  8 liveqmock         8
##  9 promosirupiah     8
## 10 RandyMcMillan     8
## # ... with 18,386 more rows

As you might expect, most users have only 1 or two complaints filed against them since it was likely an oversight more than malice on their part:

ggplot(offenders, aes(x="", n)) +
  ggbeeswarm::geom_quasirandom(
    color = ft_cols$white, fill = alpha(ft_cols$red, 1/10),
    shape = 21, size = 3, stroke = 0.125
  ) +
  scale_y_comma(breaks=1:16, limits=c(1,16)) +
  coord_flip() +
  labs(
    x = NULL, y = NULL,
    title = "Distribution of the Number of GitHub DMCA Complaints Received by a User"
  ) +
  theme_ft_rc(grid="X")

But, there are hundreds of digital buccaneers, and we can have a bit of fun with them especially since I noticed quite a few had default (generated) avatars with lots of white in them (presenting this with a pirate hat-tip to Maëlle & Lucy):

library(magick)

dir.create("gh-pirates")
dir.create("gh-pirates-jpeg")

# this kinda spoils the surprise; i should have renamed it
download.file("https://rud.is/dl/jolly-roger.jpeg", "jolly-roger.jpeg")

ghs <- safely(gh::gh) # no need to add cruft to our namespace for one function 

filter(offenders, n>2) %>% 
  pull(ghusers) %>% 
  { .pb <<- progress_estimated(length(.)); . } %>% # there are a few hundred of them
  walk(~{
    .pb$tick()$print()
    user <- ghs(sprintf("/users/%s", .x))$result # the get-user and then download avatar idiom shld help us not bust GH API rate limits
    if (!is.null(user)) {
      download.file(user$avatar_url, file.path("gh-pirates", .x), quiet=TRUE) # can't assume avatar file type
    }
  })

# we'll convert them all to jpeg and resize them at the same time plus make sure they aren't greyscale
dir.create("gh-pirates-jpeg")
list.files("gh-pirates", full.names = TRUE, recursive = FALSE) %>%
  walk(~{
    image_read(.x) %>% 
      image_scale("72x72") %>% 
      image_convert("jpeg", type = "TrueColor", colorspace = "rgb") %>% 
      image_write(
        path = file.path("gh-pirates-jpeg", sprintf("%s.jpeg", basename(.x))), 
        format = "jpeg"
      )
  })

set.seed(20180919) # seemed appropriate for TLAPD
RsimMosaic::composeMosaicFromImageRandomOptim( # this takes a bit
  originalImageFileName = "jolly-roger.jpeg",
  outputImageFileName = "gh-pirates-flag.jpeg",
  imagesToUseInMosaic = "gh-pirates-jpeg",
  removeTiles = TRUE,
  fracLibSizeThreshold = 0.1
)

Finally, we’ll look at the types of pilfered files. To do that, we’ll first naively look for github repo URLs (there are github.io ones in there too, though, which is an exercise left to ye corsairs):

mutate(
  dmca,
  files = notice_content %>% 
    map(~{
      paste0(.x, collapse = " ") %>% 
        stri_extract_all_regex(gh_url_pattern, omit_no_match=FALSE, opts_regex = stri_opts_regex(TRUE)) %>% 
        unlist() %>% 
        stri_replace_last_regex("[[:punct:]]+$", "")
    })
) -> dmca_with_files

Now, we can see just how many resources/repos/files are in a complaint:

filter(dmca_with_files, map_lgl(files, ~!is.na(.x[1]))) %>% 
  select(notice_day, notice_org, files) %>% 
  mutate(num_refs = lengths(files)) %>%
  arrange(desc(num_refs)) %>%  # take a peek at the heavy hitters
  print() -> files_with_counts
## # A tibble: 4,020 x 4
##    notice_day notice_org files         num_refs
##                          
##  1 2014-08-27 monotype        2504
##  2 2011-02-03 sony            1160
##  3 2016-06-08 monotype        1015
##  4 2018-04-05 hexrays            906
##  5 2016-06-15 ibo                877
##  6 2016-08-18 jetbrains          777
##  7 2017-10-14 cengage            611
##  8 2016-08-23 yahoo              556
##  9 2017-08-30 altis              529
## 10 2015-09-22 jetbrains          468
## # ... with 4,010 more rows

ggplot(files_with_counts, aes(x="", num_refs)) +
  ggbeeswarm::geom_quasirandom(
    color = ft_cols$white, fill = alpha(ft_cols$red, 1/10),
    shape = 21, size = 3, stroke = 0.125
  ) +
  scale_y_comma(trans="log10") +
  coord_flip() +
  labs(
    x = NULL, y = NULL,
    title = "Distribution of the Number of Files/Repos per-GitHub DMCA Complaint",
    caption = "Note: Log10 Scale"
  ) +
  theme_ft_rc(grid="X")

And, what are the most offensive file types (per-year):

mutate(
  files_with_counts, 
  extensions = map(files, ~tools::file_ext(.x) %>% 
    discard(`==` , "")
  )
) %>% 
  select(notice_day, notice_org, extensions) %>% 
  unnest(extensions) %>% 
  mutate(year = lubridate::year(notice_day)) -> file_types

count(file_types, year, extensions) %>% 
  filter(year >= 2014) %>% 
  group_by(year) %>% 
  top_n(10) %>% 
  slice(1:10) %>% 
  ungroup() %>% 
  ggplot(aes(year, n)) +
  ggrepel::geom_text_repel(
    aes(label = extensions, size=n), 
    color = ft_cols$green, family=font_ps, show.legend=FALSE
  ) +
  scale_size(range = c(3, 10)) +
  labs(
    x = NULL, y = NULL,
    title = "Top 10 File-type GitHub DMCA Takedowns Per-year"
  ) +
  theme_ft_rc(grid="X") +
  theme(axis.text.y=element_blank())

It’s not all code (lots of fonts and books) but there are plenty of source code files in those annual lists.

FIN

That’s it for this year’s TLAPD post. You’ve got the data and some starter code so build away! There are plenty more insights left to find and if you do take a stab at finding your own treasure, definitely leave a note in the comments.



rud.is

toolsmith #133 – Anomaly Detection & Threat Hunting with Anomalize

When, in October and November's toolsmith posts, I redefined DFIR under the premise of Deeper Functionality for Investigators in R, I discovered a "tip of the iceberg" scenario. To that end, I'd like to revisit the concept with an additional discovery and opportunity. In reality, this is really a case of DFIR (Deeper Functionality for Investigators in R) within the general practice of the original and paramount DFIR (Digital Forensics/Incident Response).
As discussed here before, those of us in the DFIR practice, and Blue Teaming at large, are overwhelmed by data and scale. Success truly requires algorithmic methods. If you're not already invested here I have an immediately applicable case study for you in tidy anomaly detection with anomalize.
First, let me give credit where entirely due for the work that follows. Everything I discuss and provide is immediately derivative from Business Science (@bizScienc), specifically Matt Dancho (@mdancho84). He created anomalize, "a tidy anomaly detection algorithm that’s time-based (built on top of tibbletime) and scalable from one to many time series," when a client asked Business Science to build an open source anomaly detection algorithm that suited their needs. I'd say he responded beautifully, when his blogpost hit my radar via R-Bloggers it lived as an open tab in my browser for more than a month until generating this toolsmith. Please consider Matt's post a mandatory read as step one of the process here. I'll quote Matt specifically before shifting context: "Our client had a challenging problem: detecting anomalies in time series on daily or weekly data at scale. Anomalies indicate exceptional events, which could be increased web traffic in the marketing domain or a malfunctioning server in the IT domain. Regardless, it’s important to flag these unusual occurrences to ensure the business is running smoothly. One of the challenges was that the client deals with not one time series but thousands that need to be analyzed for these extreme events."
Key takeaway: Detecting anomalies in time series on daily or weekly data at scale. Anomalies indicate exceptional events.
Now shift context with me to security-specific events and incidents, as the pertain to security monitoring, incident response, and threat hunting. In my November 2017 post, recall that I discussed Time Series Regression with the Holt-Winters method and a focus on seasonality and trends. Unfortunately, I couldn't share the code for how we applied TSR, but pointed out alternate methods, including Seasonal and Trend Decomposition using Loess (STL):
  • Handles any type of seasonality ~ can change over time
  • Smoothness of the trend-cycle can also be controlled by the user
  • Robust to outliers
Here now, Matt has created a means to immediately apply the STL method, along with the Twitter method (reference page), as part of his time_decompose() function, one of three functions specific to the anomalize package. In addition to time_decompose(), which separates the time series into seasonal, trend, and remainder components, anomalize includes:
  • anomalize(): Applies anomaly detection methods to the remainder component.
  • time_recompose(): Calculates limits that separate the “normal” data from the anomalies
The methods used in anomalize(), including IQR and GESD are described in Matt's reference page. Matt ultimately set out to build a scalable adaptation of Twitter's AnomalyDetection package in order to address his client's challenges in dealing with not one time series but thousands needing to be analyzed for extreme events. You'll note that Matt describes anomalize using a dataset of the daily download counts of  the 15 tidyverse packages from CRAN, relevant as he leverages the tidyverse package. I initially toyed with tweaking Matt's demo to model downloads for security-specific R packages (yes, there are such things) from CRAN, including RAppArmor, net.security, securitytxt, and cymruservices, the latter two courtesy of Bob Rudis (@hrbrmstr) of our beloved Data-Driven Security: Analysis, Visualization and Dashboards. Alas, this was a mere rip and replace, and really didn't exhibit the use of anomalize in a deserving, varied, truly security-specific context. That said, I was able to generate immediate results doing so, as seen in Figure 1

Figure 1: Initial experiment
As an initial experiment you can replace packages names with those of your choosing in tidyverse_cran_downloads.R, run it in R Studio, then tweak variable names and labels in the code per Matt's README page.  
I wanted to run anomalize against a real security data scenario, so I went back to the dataset from the original DFIR articles where I'd utilized counts of 4624 Event IDs per day, per user, on a given set of servers. As utilized originally, I'd represented results specific to only one device and user, but herein is the beauty of anomalize. We can achieve quick results across multiple times series (multiple systems/users). This premise is but one of many where time series analysis and seasonality can be applied to security data.
I originally tried to write log data from log.csv straight to an anomalize.R script with logs = read_csv("log.csv") into a tibble (ready your troubles with tibbles jokes), which was not being parsed accurately, particularly time attributes. To correct this, from Matt's Github I grabbed tidyverse_cran_downloads.R, and modified it as follows:
This helped greatly thanks to the tibbletime package, which is "is an extension that allows for the creation of time aware tibbles. Some immediate advantages of this include: the ability to perform time based subsetting on tibbles, quickly summarising and aggregating results by time periods. Guess what, Matt wrote tibbletime too. :-)
I then followed Matt's sequence as he posted on Business Science, but with my logs defined as a function in Security_Access_Logs_Function.R. Following, I'll give you the code snippets, as revised from Matt's examples, followed by their respective results specific to processing my Event ID 4624 daily count log.
First, let's summarize daily login counts across three servers over four months.
The result is evident in Figure 2.

Figure 2: Server logon counts visualized
Next, let's determine which daily download logons are anomalous with Matt's three main functions, time_decompose(), anomalize(), and time_recompose(), along with the visualization function, plot_anomalies(), across the same three servers over four months.
The result is revealed in Figure 3.

Figure 3: Security event log anomalies
Following Matt's method using Twitter’s AnomalyDetection package, combining time_decompose(method = "twitter") with anomalize(method = "gesd"), while adjusting the trend = "4 months" to adjust median spans, we'll focus only on SERVER-549521.
In Figure 4, you'll note that there are anomalous logon counts on SERVER-549521 in June.
Figure 4: SERVER-549521 logon anomalies with Twitter & GESD methods
We can compare the Twitter (time_decompose) and GESD (anomalize) methods with the STL (time_decompose) and IQR (anomalize) methods, which use different decomposition and anomaly detection approaches.
Again, we note anomalies in June, as seen in Figure 5.
Figure 5: SERVER-549521 logon anomalies with STL & IQR methods
Obviously, the results are quite similar, as one would hope. Finally, let use Matt's plot_anomaly_decomposition() for visualizing the inner workings of how algorithm detects anomalies in the remainder for SERVER-549521.
The result is a four part visualization, including observed, season, trend, and remainder as seen in Figure 6.
Figure 6: Decomposition for SERVER-549521 Logins
I'm really looking forward to putting these methods to use at a much larger scale, across a far broader event log dataset. I firmly assert that blue teams are already way behind in combating automated adversary tactics and problems of sheer scale, so...much...data. It's only with tactics such as Matt's anomalize, and others of its ilk, that defenders can hope to succeed. Be sure the watch Matt's YouTube video on anomalize, Business Science is building a series of videos in addition, so keep an eye out there and on their GitHub for more great work that we can apply a blue team/defender's context to.
All the code snippets are in my GitHubGist here, and the sample log file, a single R script, and a Jupyter  Notebook are all available for you on my GitHub under toolsmith_r. I hope you find anomalize as exciting and useful as I have, great work by Matt, looking forward to see what's next from Business Science.
Cheers...until next time.

toolsmith #129 – DFIR Redefined: Deeper Functionality for Investigators with R – Part 2

You can have data without information, but you cannot have information without data. ~Daniel Keys Moran

Here we resume our discussion of DFIR Redefined: Deeper Functionality for Investigators with R as begun in Part 1.
First, now that my presentation season has wrapped up, I've posted the related material on the Github for this content. I've specifically posted the most recent version as presented at SecureWorld Seattle, which included Eric Kapfhammer's contributions and a bit of his forward thinking for next steps in this approach.
When we left off last month I parted company with you in the middle of an explanation of analysis of emotional valence, or the "the intrinsic attractiveness (positive valence) or averseness (negative valence) of an event, object, or situation", using R and the Twitter API. It's probably worth your time to go back and refresh with the end of Part 1. Our last discussion point was specific to the popularity of negative tweets versus positive tweets with a cluster of emotionally neutral retweets, two positive retweets, and a load of negative retweets. This type of analysis can quickly give us better understanding of an attacker collective's sentiment, particularly where the collective is vocal via social media. Teeing off the popularity of negative versus positive sentiment, we can assess the actual words fueling such sentiment analysis. It doesn't take us much R code to achieve our goal using the apply family of functions. The likes of apply, lapply, and sapply allow you to manipulate slices of data from matrices, arrays, lists and data frames in a repetitive way without having to use loops. We use code here directly from Michael Levy, Social Scientist, and his Playing with Twitter Data post.

polWordTables = 
  sapply(pol, function(p) {
    words = c(positiveWords = paste(p[[1]]$pos.words[[1]], collapse = ' '), 
              negativeWords = paste(p[[1]]$neg.words[[1]], collapse = ' '))
    gsub('-', '', words)  # Get rid of nothing found's "-"
  }) %>%
  apply(1, paste, collapse = ' ') %>% 
  stripWhitespace() %>% 
  strsplit(' ') %>%
  sapply(table)

par(mfrow = c(1, 2))
invisible(
  lapply(1:2, function(i) {
    dotchart(sort(polWordTables[[i]]), cex = .5)
    mtext(names(polWordTables)[i])
  }))

The result is a tidy visual representation of exactly what we learned at the end of Part 1, results as noted in Figure 1.

Figure 1: Positive vs negative words
Content including words such as killed, dangerous, infected, and attacks are definitely more interesting to readers than words such as good and clean. Sentiment like this could definitely be used to assess potential attacker outcomes and behaviors just prior, or in the midst of an attack, particularly in DDoS scenarios. Couple sentiment analysis with the ability to visualize networks of retweets and mentions, and you could zoom in on potential leaders or organizers. The larger the network node, the more retweets, as seen in Figure 2.

Figure 2: Who is retweeting who?
Remember our initial premise, as described in Part 1, was that attacker groups often use associated hashtags and handles, and the minions that want to be "part of" often retweet and use the hashtag(s). Individual attackers either freely give themselves away, or often become easily identifiable or associated, via Twitter. Note that our dominant retweets are for @joe4security, @HackRead,  @defendmalware (not actual attackers, but bloggers talking about attacks, used here for example's sake). Figure 3 shows us who is mentioning who.

Figure 3: Who is mentioning who?
Note that @defendmalware mentions @HackRead. If these were actual attackers it would not be unreasonable to imagine a possible relationship between Twitter accounts that are actively retweeting and mentioning each other before or during an attack. Now let's assume @HackRead might be a possible suspect and you'd like to learn a bit more about possible additional suspects. In reality @HackRead HQ is in Milan, Italy. Perhaps Milan then might be a location for other attackers. I can feed  in Twittter handles from my retweet and mentions network above, query the Twitter API with very specific geocode, and lock it within five miles of the center of Milan.
The results are immediate per Figure 4.

Figure 4: GeoLocation code and results
Obviously, as these Twitter accounts aren't actual attackers, their retweets aren't actually pertinent to our presumed attack scenario, but they definitely retweeted @computerweekly (seen in retweets and mentions) from within five miles of the center of Milan. If @HackRead were the leader of an organization, and we believed that associates were assumed to be within geographical proximity, geolocation via the Twitter API could be quite useful. Again, these are all used as thematic examples, no actual attacks should be related to any of these accounts in any way.

Fast Frugal Trees (decision trees) for prioritizing criticality

With the abundance of data, and often subjective or biased analysis, there are occasions where a quick, authoritative decision can be quite beneficial. Fast-and-frugal trees (FFTs) to the rescue. FFTs are simple algorithms that facilitate efficient and accurate decisions based on limited information.
Nathaniel D. Phillips, PhD created FFTrees for R to allow anyone to easily create, visualize and evaluate FFTs. Malcolm Gladwell has said that "we are suspicious of rapid cognition. We live in a world that assumes that the quality of a decision is directly related to the time and effort that went into making it.” FFTs, and decision trees at large, counter that premise and aid in the timely, efficient processing of data with the intent of a quick but sound decision. As with so much of information security, there is often a direct correlation with medical, psychological, and social sciences, and the use of FFTs is no different. Often, predictive analysis is conducted with logistic regression, used to "describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables." Would you prefer logistic regression or FFTs?

Figure 5: Thanks, I'll take FFTs
Here's a text book information security scenario, often rife with subjectivity and bias. After a breach, and subsequent third party risk assessment that generated a ton of CVSS data, make a fast decision about what treatments to apply first. Because everyone loves CVSS.

Figure 6: CVSS meh
Nothing like a massive table, scored by base, impact, exploitability, temporal, environmental, modified impact, and overall scores, all assessed by a third party assessor who may not fully understand the complexities or nuances of your environment. Let's say our esteemed assessor has decided that there are 683 total findings, of which 444 are non-critical and 239 are critical. Will FFTrees agree? Nay! First, a wee bit of R code.

library("FFTrees")
cvss <- c:="" coding="" csv="" p="" r="" read.csv="" rees="">cvss.fft <- data="cvss)</p" fftrees="" formula="critical">plot(cvss.fft, what = "cues")
plot(cvss.fft,
     main = "CVSS FFT",
     decision.names = c("Non-Critical", "Critical"))


Guess what, the model landed right on impact and exploitability as the most important inputs, and not just because it's logically so, but because of their position when assessed for where they fall in the area under the curve (AUC), where the specific curve is the receiver operating characteristic (ROC). The ROC is a "graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied." As for the AUC, accuracy is measured by the area under the ROC curve where an area of 1 represents a perfect test and an area of .5 represents a worthless test. Simply, the closer to 1, the better. For this model and data, impact and exploitability are the most accurate as seen in Figure 7.

Figure 7: Cue rankings prefer impact and exploitability
The fast and frugal tree made its decision where impact and exploitability with scores equal or less than 2 were non-critical and exploitability greater than 2 was labeled critical, as seen in Figure 8.

Figure 8: The FFT decides
Ah hah! Our FFT sees things differently than our assessor. With a 93% average for performance fitting (this is good), our tree, making decisions on impact and exploitability, decides that there are 444 non-critical findings and 222 critical findings, a 17 point differential from our assessor. Can we all agree that mitigating and remediating critical findings can be an expensive proposition? If you, with just a modicum of data science, can make an authoritative decision that saves you time and money without adversely impacting your security posture, would you count it as a win? Yes, that was rhetorical.

Note that the FFTrees function automatically builds several versions of the same general tree that make different error trade-offs with variations in performance fitting and false positives. This gives you the option to test variables and make potentially even more informed decisions within the construct of one model. Ultimately, fast frugal trees make very fast decisions on 1 to 5 pieces of information and ignore all other information. In other words, "FFTrees are noncompensatory, once they make a decision based on a few pieces of information, no additional information changes the decision."

Finally, let's take a look at monitoring user logon anomalies in high volume environments with Time Series Regression (TSR). Much of this work comes courtesy of Eric Kapfhammer, our lead data scientist on our Microsoft Windows and Devices Group Blue Team. The ideal Windows Event ID for such activity is clearly 4624: an account was successfully logged on. This event is typically one of the top 5 events in terms of volume in most environments, and has multiple type codes including Network, Service, and RemoteInteractive.
User accounts will begin to show patterns over time, in aggregate, including:
  • Seasonality: day of week, patch cycles, 
  • Trend: volume of logons increasing/decreasing over time
  • Noise: randomness
You could look at 4624 with a Z-score model, which sets a threshold based on the number of standard deviations away from an average count over a given period of time, but this is a fairly simple model. The higher the value, the greater the degree of “anomalousness”.
Preferably, via Time Series Regression (TSR), your feature set is more rich:
  • Statistical method for predicting a future response based on the response history (known as autoregressive dynamics) and the transfer of dynamics from relevant predictors
  • Understand and predict the behavior of dynamic systems from experimental or observational data
  • Commonly used for modeling and forecasting of economic, financial and biological systems
How to spot the anomaly in a sea of logon data?
Let's imagine our user, DARPA-549521, in the SUPERSECURE domain, with 90 days of aggregate 4624 Type 10 events by day.

Figure 9: User logon data
With 210 line of R, including comments, log read, file output, and graphing we can visualize and alert on DARPA-549521's data as seen in Figure 10

Figure 10: User behavior outside the confidence interval
We can detect when a user’s account exhibits  changes in their seasonality as it relates to a confidence interval established (learned) over time. In this case, on 27 AUG 2017, the user topped her threshold of 19 logons thus triggering an exception. Now imagine using this model to spot anomalous user behavior across all users and you get a good feel for the model's power.
Eric points out that there are, of course, additional options for modeling including:
  • Seasonal and Trend Decomposition using Loess (STL)
    • Handles any type of seasonality ~ can change over time
    • Smoothness of the trend-cycle can also be controlled by the user
    • Robust to outliers
  • Classification and Regression Trees (CART)
    • Supervised learning approach: teach trees to classify anomaly / non-anomaly
    • Unsupervised learning approach: focus on top-day hold-out and error check
  • Neural Networks
    • LSTM / Multiple time series in combination
These are powerful next steps in your capabilities, I want you to be brave, be creative, go forth and add elements of data science and visualization to your practice. R and Python are well supported and broadly used for this mission and can definitely help you detect attackers faster, contain incidents more rapidly, and enhance your in-house detection and remediation mechanisms.
All the code as I can share is here; sorry, I can only share the TSR example without the source.
All the best in your endeavors!
Cheers...until next time.

toolsmith #128 – DFIR Redefined: Deeper Functionality for Investigators with R – Part 1

“To competently perform rectifying security service, two critical incident response elements are necessary: information and organization.” ~ Robert E. Davis

I've been presenting DFIR Redefined: Deeper Functionality for Investigators with R across the country at various conference venues and thought it would helpful to provide details for readers.
The basic premise?
Incident responders and investigators need all the help they can get.
Let me lay just a few statistics on you, from Secure360.org's The Challenges of Incident Response, Nov 2016. Per their respondents in a survey of security professionals:
  • 38% reported an increase in the number of hours devoted to incident response
  • 42% reported an increase in the volume of incident response data collected
  • 39% indicated an increase in the volume of security alerts
In short, according to Nathan Burke, “It’s just not mathematically possible for companies to hire a large enough staff to investigate tens of thousands of alerts per month, nor would it make sense.”
The 2017 SANS Incident Response Survey, compiled by Matt Bromiley in June, reminds us that “2016 brought unprecedented events that impacted the cyber security industry, including a myriad of events that raised issues with multiple nation-state attackers, a tumultuous election and numerous government investigations.” Further, "seemingly continuous leaks and data dumps brought new concerns about malware, privacy and government overreach to the surface.”
Finally, the survey shows that IR teams are:
  • Detecting the attackers faster than before, with a drastic improvement in dwell time
  • Containing incidents more rapidly
  • Relying more on in-house detection and remediation mechanisms
To that end, what concepts and methods further enable handlers and investigators as they continue to strive for faster detection and containment? Data science and visualization sure can’t hurt. How can we be more creative to achieve “deeper functionality”? I propose a two-part series on Deeper Functionality for Investigators with R with the following DFIR Redefined scenarios:
  • Have you been pwned?
  • Visualization for malicious Windows Event Id sequences
  • How do your potential attackers feel, or can you identify an attacker via sentiment analysis?
  • Fast Frugal Trees (decision trees) for prioritizing criticality
R is “100% focused and built for statistical data analysis and visualization” and “makes it remarkably simple to run extensive statistical analysis on your data and then generate informative and appealing visualizations with just a few lines of code.”

With R you can interface with data via file ingestion, database connection, APIs and benefit from a wide range of packages and strong community investment.
From the Win-Vector Blog, per John Mount “not all R users consider themselves to be expert programmers (many are happy calling themselves analysts). R is often used in collaborative projects where there are varying levels of programming expertise.”
I propose that this represents the vast majority of us, we're not expert programmers, data scientists, or statisticians. More likely, we're security analysts re-using code for our own purposes, be it red team or blue team. With a very few lines of R investigators might be more quickly able to reach conclusions.
All the code described in the post can be found on my GitHub.

Have you been pwned?

This scenario I covered in an earlier post, I'll refer you to Toolsmith Release Advisory: Steph Locke's HIBPwned R package.

Visualization for malicious Windows Event Id sequences

Windows Events by Event ID present excellent sequenced visualization opportunities. A hypothetical scenario for this visualization might include multiple failed logon attempts (4625) followed by a successful logon (4624), then various malicious sequences. A fantastic reference paper built on these principle is Intrusion Detection Using Indicators of Compromise Based on Best Practices and Windows Event Logs. An additional opportunity for such sequence visualization includes Windows processes by parent/children. One R library particularly well suited to is TraMineR: Trajectory Miner for R. This package is for mining, describing and visualizing sequences of states or events, and more generally discrete sequence data. It's primary aim is the analysis of biographical longitudinal data in the social sciences, such as data describing careers or family trajectories, and a BUNCH of other categorical sequence data. Somehow though, the project page somehow fails to mention malicious Windows Event ID sequences. :-) Consider Figures 1 and 2 as retrieved from above mentioned paper. Figure 1 are text sequence descriptions, followed by their related Windows Event IDs in Figure 2.

Figure 1
Figure 2
Taking related log data, parsing and counting it for visualization with R would look something like Figure 3.

Figure 3
How much R code does it take to visualize this data with a beautiful, interactive sunburst visualization? Three lines, not counting white space and comments, as seen in the video below.