I use the tidyverse, and also package chron, since I have data which are pure elapsed times.
I then used the rvest package to parse these files. You need to inspect the html to see the structure of the data as now saved in the .html file. You can just open the file in RStudio, for example, and scroll down until you see some of the data.
In the case of this cycling data, there are rows 'tr' alternating of class 'Odd' and 'Even'. Then within each row, there are a number of data elements, 'td'.
We mimic this structure with two calls to 'html_nodes()', the first picking out the rows by giving the class (with an extra . in front), the second pulling out the data elements. Then we use html_text to strip off the extraneous text.
library(tidyverse) library(rvest) #for web-scraping library(chron) #for times allRes <- unlist(lapply(0:4, function(x){ # read in the html res <- read_html(paste0("race",x,".html")) # pick the data we want - this comes with the column headings z <- unlist(lapply(c(".Odd",".Even"), function(x){res %>% html_nodes(x) %>% html_nodes("td") %>% html_text(trim = TRUE)})) }))
This is partly a hack in that I'm assuming I know what the files are, and that they have indices 0 to 4. But that's true here, if not elegant.
Finally, format all these data into a dataframe, via a matrix, for clarity. Everything arrives as strings, so I name and change the formats to something more meaningful.
#convert to dataframe z <- matrix(allRes, byrow=TRUE, ncol=14) colnames(z) <- c("Pos","ID","Sex","Name","Age","Nat","Team","Time","Tkm","AveSpd","CatPos","Cat","City","x") results <- as_tibble(z) %>% #reformat some columns mutate_at(.vars = vars(Pos, ID, Age, CatPos), .funs = funs(as.integer) ) %>% #remove non-finishers filter(!is.na(Pos)) %>% mutate(AveSpd = as.numeric(AveSpd), # make time always HMS Time = times(if_else(nchar(Time)<6, paste0("0:",Time), Time)), #complete sex Sex = if_else(Sex=="F",Sex,"M")) %>% #we still get some NAs, which are the ones whose age is unknown, so their CatPos is also unknown drop_na() # manual ordering arrange(Cat, AveSpd)
After that, it's hard to resist having a look at the data. I was keen to try out the pirate plot for the first time, from package yarrr. This is an easy way to view data in a number of categories: viewing all of the data points (as jittered points), a distribution density for each category, and some descriptive statistics. For the last of these, I chose the option 'iqr' for interquartile range, which gives me something looking like a box plot on top of the data, especially when coupled with the median function for the central line.
Here, the categories are competition age groups.
library(yarrr) #for pirate plot #plot of speed pp <- pirateplot(formula = AveSpd ~ Cat , data = results, avg.line.fun = median, ylab = 'Average Speed (kmh)', xlab = "Competition Group", inf.method = 'iqr', theme = 2, point.o = 0.6 )
The resulting plot is lovely: