Thursday, 28 June 2018

Web scraping for cycling results - Here be pirates!

In the previous blog post, I went through the steps of downloading data from a number of sequential pages, and saving them on your local system as separate html files. 

I use the tidyverse, and also package chron, since I have data which are pure elapsed times.

I then used the rvest package to parse these files. You need to inspect the html to see the structure of the data as now saved in the .html file. You can just open the file in RStudio, for example, and scroll down until you see some of the data.

In the case of this cycling data, there are rows 'tr' alternating of class 'Odd' and 'Even'. Then within each row, there are a number of data elements, 'td'.

We mimic this structure with two calls to 'html_nodes()', the first picking out the rows by giving the class (with an extra . in front), the second pulling out the data elements. Then we use html_text to strip off the extraneous text.

 
library(tidyverse)
library(rvest) #for web-scraping
library(chron) #for times

allRes <- unlist(lapply(0:4, function(x){
  # read in the html
  res <- read_html(paste0("race",x,".html")) 
  # pick the data we want - this comes with the column headings
  z <- unlist(lapply(c(".Odd",".Even"), 
                     function(x){res %>% 
                         html_nodes(x) %>% 
                         html_nodes("td") %>% 
                         html_text(trim = TRUE)}))                   
}))
  

This is partly a hack in that I'm assuming I know what the files are, and that they have indices 0 to 4. But that's true here, if not elegant.

Finally, format all these data into a dataframe, via a matrix, for clarity. Everything arrives as strings, so I name and change the formats to something more meaningful.


#convert to dataframe
z <- matrix(allRes, byrow=TRUE, ncol=14)
colnames(z) <- c("Pos","ID","Sex","Name","Age","Nat","Team","Time","Tkm","AveSpd","CatPos","Cat","City","x")
results <- as_tibble(z) %>% 
  #reformat some columns
  mutate_at(.vars = vars(Pos, ID, Age, CatPos),
            .funs = funs(as.integer)
         ) %>% 
  #remove non-finishers
  filter(!is.na(Pos)) %>% 
  mutate(AveSpd = as.numeric(AveSpd),
         # make time always HMS
         Time = times(if_else(nchar(Time)<6, paste0("0:",Time), Time)),
         #complete sex
         Sex = if_else(Sex=="F",Sex,"M")) %>% 
  #we still get some NAs, which are the ones whose age is unknown, so their CatPos is also unknown
  drop_na() 
  # manual ordering
  arrange(Cat, AveSpd)

After that, it's hard to resist having a look at the data. I was keen to try out the pirate plot for the first time, from package yarrr. This is an easy way to view data in a number of categories: viewing all of the data points (as jittered points), a distribution density for each category, and some descriptive statistics. For the last of these, I chose the option 'iqr' for interquartile range, which gives me something looking like a box plot on top of the data, especially when coupled with the median function for the central line.

Here, the categories are competition age groups.


library(yarrr) #for pirate plot
#plot of speed
pp <- pirateplot(formula = AveSpd ~ Cat ,
    data = results,
    avg.line.fun = median,
    ylab = 'Average Speed (kmh)',
    xlab = "Competition Group",
    inf.method = 'iqr',
    theme = 2,
    point.o = 0.6
    ) 

The resulting plot is lovely:




Wednesday, 27 June 2018

Web scraping for cycling results

Perhaps it's a bit of overkill to web scrape data when the pages are essentially static, but if you need data that is spread over multiple pages on the web site - when you see '&page=1', '&page=2' etc in the web address - then an easy way to visit each one could be useful.

In the end it was quite quick to do with phantomJS and R, though it didn't work exactly as I had expected. phantomJS is out of support at the time of writing - thanks anyway to Ariya Hidayat and others for the work in setting it up in the first place!

Firstly you need to download the phantomJS into the directory where you're working. Then you need to create a javascript file to tell phantomJS what to do. RStudio is happy with .js files.

This code worked for me - your url will be different. It expects to be passed an index number (0, 1, etc) of the page to be scraped. I tried also to pass the full url, but this seems to be indigestible and I couldn't tell whether indigestible to phantomJS or the system call.

 // racescrape.js
// does not seem to like passing the entire web address as a parameter

// get the argument passed in the command line, this will be just an index number
var args = require('system').args;
var page = require('webpage').create();

// file system
var fs = require('fs');
var index = args[1];

// create the base url (incomplete)
var url = 'http://prod.chronorace.be/Classements/classement.aspx?eventId=1187957889363244&AdminMode=true&master=iframe&mode=large&IdClassement=17591&srch=&scope=All&page=';
// add the index number for the particular page this time
url += index;

// create an index-specific output file
var path = "race" + index + ".html";

// scrape and save
page.open(url, function (status) {
  var content = page.content;
  fs.write(path,content,'w');
  phantom.exit();
});

In R then, the code to scrape a number of pages and save them as separate files is just this. It creates a system command, the first part is the location of your phantomJS installation, the second your .js file, the third the index of the page.

In this case, I know from inspection that the pages are indexed 0 to 4. Out of politeness, having scraped them once, we can then work from the saved .html files, but that will be for a later blog post.

 
library(rvest) #for web-scraping

# load the web pages onto disk
 # better do this only once
z <- lapply(0:4, function(x){
  system(paste("phantomjs-2.1.1-macosx/bin/phantomjs","racescrape.js", x))})