In the end it was quite quick to do with phantomJS and R, though it didn't work exactly as I had expected. phantomJS is out of support at the time of writing - thanks anyway to Ariya Hidayat and others for the work in setting it up in the first place!
Firstly you need to download the phantomJS into the directory where you're working. Then you need to create a javascript file to tell phantomJS what to do. RStudio is happy with .js files.
This code worked for me - your url will be different. It expects to be passed an index number (0, 1, etc) of the page to be scraped. I tried also to pass the full url, but this seems to be indigestible and I couldn't tell whether indigestible to phantomJS or the system call.
// racescrape.js // does not seem to like passing the entire web address as a parameter // get the argument passed in the command line, this will be just an index number var args = require('system').args; var page = require('webpage').create(); // file system var fs = require('fs'); var index = args[1]; // create the base url (incomplete) var url = 'http://prod.chronorace.be/Classements/classement.aspx?eventId=1187957889363244&AdminMode=true&master=iframe&mode=large&IdClassement=17591&srch=&scope=All&page='; // add the index number for the particular page this time url += index; // create an index-specific output file var path = "race" + index + ".html"; // scrape and save page.open(url, function (status) { var content = page.content; fs.write(path,content,'w'); phantom.exit(); });
In R then, the code to scrape a number of pages and save them as separate files is just this. It creates a system command, the first part is the location of your phantomJS installation, the second your .js file, the third the index of the page.
In this case, I know from inspection that the pages are indexed 0 to 4. Out of politeness, having scraped them once, we can then work from the saved .html files, but that will be for a later blog post.
library(rvest) #for web-scraping # load the web pages onto disk # better do this only once z <- lapply(0:4, function(x){ system(paste("phantomjs-2.1.1-macosx/bin/phantomjs","racescrape.js", x))})
No comments:
Post a Comment