Exploring statistics: Interpreting loosely formatted aviation text

This is a short, technical account of using the NLP package of R to interpret some 'loosely-formatted' text about services offered by airlines. In the text, airports are sometimes mentioned by IATA 3-letter code, sometimes by a text name, and not always the same name.

It took a while to get my head around how NLP (and OpenNLP) work. The manual entries are accurate, but that's with hindsight. So I hope the following helps.

I built a natural entity labeller, but first had to work out how the entity annotator syntax works. This is an example of perhaps the simplest possible one, that labels everything as the entity 'all'.

identity_tokenizer <- function(s) {
#log all words together as one single entity, type 'all'
#the extent is measured here in words, not characters
Annotation(1, "all", 1L, length(s))
}

identity_entity_annotator <- Simple_Entity_Annotator(identity_tokenizer)

For each line of text, eg "Delhi DEL – HKG Hong Kong – KIX Osaka Kansai (Commenced)" I used the sledgehammer approach of OpenNLP

AnnSentWord <- list(Maxent_Sent_Token_Annotator(),

Maxent_Word_Token_Annotator())
el <- NLP::annotate(as.String(r), AnnSentWord)

to get r annotated as sentence and words. Then I invoked my new annotator, which looks for airports (3-letter IATA codes, or 4-letter ICAO codes, or 1-n word names) and labels these as 'airport', with the ICAO code as a 'feature'.

el2 <- NLP::annotate(as.String(r), AP_entity_annotator, el)

For some reason, adding this to 'pipe' of sentence and word annotation always failed. The annotator itself is fairly mechanical, using lookup environments (see list2env()) since the guidance on-line is that these are relatively fast - and in my own tests they were about 30 times faster than simple subsetting of dataframes.

The answers are deterministic, not probabilistic, but are based on a likelihood order: 4-letter and 3-letter upper case matches are unlikely by chance; longer name matches 'Santiago de Chile' take precedence over shorter ones 'Santiago'; airports with more traffic are more likely than quieter ones. On average, I had about 4 names per airport.

It's not the most sophisticated application of NLP, but you have to start somewhere!

Exploring statistics

Links to my Tableau public dashboards:

Saturday, 8 October 2016

Interpreting loosely formatted aviation text

No comments:

Post a Comment