Thursday 28 June 2018

Web scraping for cycling results - Here be pirates!

In the previous blog post, I went through the steps of downloading data from a number of sequential pages, and saving them on your local system as separate html files. 

I use the tidyverse, and also package chron, since I have data which are pure elapsed times.

I then used the rvest package to parse these files. You need to inspect the html to see the structure of the data as now saved in the .html file. You can just open the file in RStudio, for example, and scroll down until you see some of the data.

In the case of this cycling data, there are rows 'tr' alternating of class 'Odd' and 'Even'. Then within each row, there are a number of data elements, 'td'.

We mimic this structure with two calls to 'html_nodes()', the first picking out the rows by giving the class (with an extra . in front), the second pulling out the data elements. Then we use html_text to strip off the extraneous text.

 
library(tidyverse)
library(rvest) #for web-scraping
library(chron) #for times

allRes <- unlist(lapply(0:4, function(x){
  # read in the html
  res <- read_html(paste0("race",x,".html")) 
  # pick the data we want - this comes with the column headings
  z <- unlist(lapply(c(".Odd",".Even"), 
                     function(x){res %>% 
                         html_nodes(x) %>% 
                         html_nodes("td") %>% 
                         html_text(trim = TRUE)}))                   
}))
  

This is partly a hack in that I'm assuming I know what the files are, and that they have indices 0 to 4. But that's true here, if not elegant.

Finally, format all these data into a dataframe, via a matrix, for clarity. Everything arrives as strings, so I name and change the formats to something more meaningful.


#convert to dataframe
z <- matrix(allRes, byrow=TRUE, ncol=14)
colnames(z) <- c("Pos","ID","Sex","Name","Age","Nat","Team","Time","Tkm","AveSpd","CatPos","Cat","City","x")
results <- as_tibble(z) %>% 
  #reformat some columns
  mutate_at(.vars = vars(Pos, ID, Age, CatPos),
            .funs = funs(as.integer)
         ) %>% 
  #remove non-finishers
  filter(!is.na(Pos)) %>% 
  mutate(AveSpd = as.numeric(AveSpd),
         # make time always HMS
         Time = times(if_else(nchar(Time)<6, paste0("0:",Time), Time)),
         #complete sex
         Sex = if_else(Sex=="F",Sex,"M")) %>% 
  #we still get some NAs, which are the ones whose age is unknown, so their CatPos is also unknown
  drop_na() 
  # manual ordering
  arrange(Cat, AveSpd)

After that, it's hard to resist having a look at the data. I was keen to try out the pirate plot for the first time, from package yarrr. This is an easy way to view data in a number of categories: viewing all of the data points (as jittered points), a distribution density for each category, and some descriptive statistics. For the last of these, I chose the option 'iqr' for interquartile range, which gives me something looking like a box plot on top of the data, especially when coupled with the median function for the central line.

Here, the categories are competition age groups.


library(yarrr) #for pirate plot
#plot of speed
pp <- pirateplot(formula = AveSpd ~ Cat ,
    data = results,
    avg.line.fun = median,
    ylab = 'Average Speed (kmh)',
    xlab = "Competition Group",
    inf.method = 'iqr',
    theme = 2,
    point.o = 0.6
    ) 

The resulting plot is lovely:




No comments:

Post a Comment