Wednesday 27 June 2018

Web scraping for cycling results

Perhaps it's a bit of overkill to web scrape data when the pages are essentially static, but if you need data that is spread over multiple pages on the web site - when you see '&page=1', '&page=2' etc in the web address - then an easy way to visit each one could be useful.

In the end it was quite quick to do with phantomJS and R, though it didn't work exactly as I had expected. phantomJS is out of support at the time of writing - thanks anyway to Ariya Hidayat and others for the work in setting it up in the first place!

Firstly you need to download the phantomJS into the directory where you're working. Then you need to create a javascript file to tell phantomJS what to do. RStudio is happy with .js files.

This code worked for me - your url will be different. It expects to be passed an index number (0, 1, etc) of the page to be scraped. I tried also to pass the full url, but this seems to be indigestible and I couldn't tell whether indigestible to phantomJS or the system call.

 // racescrape.js
// does not seem to like passing the entire web address as a parameter

// get the argument passed in the command line, this will be just an index number
var args = require('system').args;
var page = require('webpage').create();

// file system
var fs = require('fs');
var index = args[1];

// create the base url (incomplete)
var url = 'http://prod.chronorace.be/Classements/classement.aspx?eventId=1187957889363244&AdminMode=true&master=iframe&mode=large&IdClassement=17591&srch=&scope=All&page=';
// add the index number for the particular page this time
url += index;

// create an index-specific output file
var path = "race" + index + ".html";

// scrape and save
page.open(url, function (status) {
  var content = page.content;
  fs.write(path,content,'w');
  phantom.exit();
});

In R then, the code to scrape a number of pages and save them as separate files is just this. It creates a system command, the first part is the location of your phantomJS installation, the second your .js file, the third the index of the page.

In this case, I know from inspection that the pages are indexed 0 to 4. Out of politeness, having scraped them once, we can then work from the saved .html files, but that will be for a later blog post.

 
library(rvest) #for web-scraping

# load the web pages onto disk
 # better do this only once
z <- lapply(0:4, function(x){
  system(paste("phantomjs-2.1.1-macosx/bin/phantomjs","racescrape.js", x))})
  

No comments:

Post a Comment