Exploring statistics: October 2016

Monday, 31 October 2016

Cycling into the wind

So now I've merged the wind data that I was using with the thorn plots and the cycling data on the Tableau public dashboard. It uses an average wind speed and direction based on the two metar observations nearest to the mid-time of the ride.

I wondered how to show wind direction, but in the end a simple grouping by quadrant was easy to do and good enough to separate the data into meaningful groups.

The effect of windspeed on the way home, when the wind is in the SW is pretty clear, I think.

I just wish I could get a sensible data between start&end calculated field in the title. Could not get this to work in Tableau.

Sunday, 9 October 2016

Routes of A350 and B787 - updated visualisation

The latest version of the routes map has been updated to use great circle links between airports. There seem to be 2 options for this - getting Tableau or R to do the work. Getting R to do the work was easier for me (just an application of gcIntermediate()) than getting to grips with interpolation functions in Tableau. The cost is that you pass a lot of extra data. I suppose I could have used 2 files and done a join in Tableau instead, but the volumes are not huge.

There are tableau workbooks around to do the work, but from tableau public it doesn't seem to be possible to open and copy the calculations as suggested.

Saturday, 8 October 2016

Tableau visualisation of routes by new aircraft types: B787 and A350

The previous post gave a glimpse of the technical, R, side of processing some data compiled by spotters of where the new aircraft, B787 and A350, are flying. The R took a while to get right, but once the data were compiled, getting them into Tableau was a breeze.

There's a beta version here. Beta, mostly in the sense that there are still 1 or 2 glitches in the geo-coding of the data (so in R), but I thought it might be easier to see them given a Tableau map. I like the way the graph on the left acts as a filter for the map - for example comparing the routes of ANA and Japan Airlines.

And beta too, probably because the viz could also be improved. Comments welcome.

Interpreting loosely formatted aviation text

This is a short, technical account of using the NLP package of R to interpret some 'loosely-formatted' text about services offered by airlines. In the text, airports are sometimes mentioned by IATA 3-letter code, sometimes by a text name, and not always the same name.

It took a while to get my head around how NLP (and OpenNLP) work. The manual entries are accurate, but that's with hindsight. So I hope the following helps.

I built a natural entity labeller, but first had to work out how the entity annotator syntax works. This is an example of perhaps the simplest possible one, that labels everything as the entity 'all'.

identity_tokenizer <- function(s) {
#log all words together as one single entity, type 'all'
#the extent is measured here in words, not characters
Annotation(1, "all", 1L, length(s))
}

identity_entity_annotator <- Simple_Entity_Annotator(identity_tokenizer)

For each line of text, eg "Delhi DEL – HKG Hong Kong – KIX Osaka Kansai (Commenced)" I used the sledgehammer approach of OpenNLP

AnnSentWord <- list(Maxent_Sent_Token_Annotator(),

Maxent_Word_Token_Annotator())
el <- NLP::annotate(as.String(r), AnnSentWord)

to get r annotated as sentence and words. Then I invoked my new annotator, which looks for airports (3-letter IATA codes, or 4-letter ICAO codes, or 1-n word names) and labels these as 'airport', with the ICAO code as a 'feature'.

el2 <- NLP::annotate(as.String(r), AP_entity_annotator, el)

For some reason, adding this to 'pipe' of sentence and word annotation always failed. The annotator itself is fairly mechanical, using lookup environments (see list2env()) since the guidance on-line is that these are relatively fast - and in my own tests they were about 30 times faster than simple subsetting of dataframes.

The answers are deterministic, not probabilistic, but are based on a likelihood order: 4-letter and 3-letter upper case matches are unlikely by chance; longer name matches 'Santiago de Chile' take precedence over shorter ones 'Santiago'; airports with more traffic are more likely than quieter ones. On average, I had about 4 names per airport.

It's not the most sophisticated application of NLP, but you have to start somewhere!

Links to my Tableau public dashboards: