Exploring statistics

Monday, 2 January 2023

Cleaning images with Shiny and ImageMagick

I have a pile of old photos, literally (printed, loose or in albums and slide boxes) and figuratively (image files). Many of these are quite dirty: with dust, hairs and (particularly with scanned slides) black edges. Even fingerprint damage in places.

I also have a bunch of tools for processing photos.

But the tools just don't seem to meet my need for bulk cleaning of images. Look up guidance on how to use them, and it either seems to involve

using some 'heal' brush on each bit of dust (really not practical for a really dusty image),
or applying some 'remove dust' filter, which (for the tools I have) seems to change too much of the image, leaving non-dusty and interesting bits of the photo blurred,
or using a modern scanner with some sort of infrared detection of dust and hairs (and re-scanning everything?).

So, time to see if I can do something that meets my need better.

Using Shiny to build a simple web application and ImageMagick to do the image processing, I think I've got quite close. The main thing I like about my approach is that you can see directly where the dust has been found, and only those pixels get changed. So you don't notice later that the faces, hair or clothes have lost focus.

The code is up at GitHub, which also has links to a shinyapp.io instance (though you're probably better running the code on your own machine).

Not perfect, but the workflow is quite quick for me now. Some of the scans done 15 or 20 years ago are jpegs, which isn't ideal.

In future, a 'batch' version could be implemented pretty easily, but I think I need to experiment with more images before doing that.

Sunday, 4 December 2022

For "tidy-select" read "column", for "data-masking" read "row"

The moment arrived for an update to himach. I thought I was solving a few issues created by newer versions of packages that himach depends on (probably due to how I'd used them in the first place). Instead, I was faced with a host of:

use of .data$ in tidyselect expressions is deprecated

warnings.

It seems that I've been using the tidyverse wrong since I first made the 2speed project that became himach, 3 years ago. I'd understood that, in a package, you need to be specific and use, say, select(.data$x) or mutate(y = .data$x + 1) so that the code can be sure you're referring to the variable x from the dataframe and not another variable. This was one of the tricky steps to get used to when moving from writing 'normal' open code, to writing a package.

But it was more subtle than that, or at least it is now, sometimes that's true and sometimes not. The tidyverse of course improves and develops, and maybe it wasn't clear to the authors then either at the time. So, updating my code wasn't just about a global search and replace, it depends on what function is being used.

There are two types of reference to variables. This seems to be a key reference.

<tidy-select> functions. In my code these were unnest, select, rename, across, pull. For these I did indeed have to replace .data$x with "x". The warning message gives the right guidance. I had one case of .data[[var]]. For that the blog says use all_of(var). Instead I used {{ var }}, because this feels right for passing a variable name as a function parameter.
<data-masking> functions. In my code these were mutate, group_by, filter, arrange, summarise and also (because they're within a mutate?) case_when and if_else. These cases I had to leave as .data$x.

The logic wasn't entirely clear to me. Tidy selection is sort of manipulating columns of the dataframe, while with data-masking you're more interested in manipulating the contents. If that's the case, why does select use tidy selection, but group_by doesn't? Is it that group_by implicitly uses the contents? That has to be it.

In the end, I think it's easier to think of:

<column> functions, not <tidy-select>. These all manipulate the dataframe columns without caring about the rows. (Though tidyverse, I think, would like you to think of these as another sort of variable, not columns of a dataframe.)
<row> functions, not <data-masking>. Since these functions do different things depending on the values in each row.

Detecting objects in Images - with R, yolo and ImageMagick

I wanted to identify moving objects in video (or still) images - preferably using R as that's my tool of choice. I thought this might be challenging, but it proved to be relatively easy.

The basic steps are:

Import the images from the video, discard the 'empty' ones and save the interesting ones. [R, with ImageMagick]
Run the interesting images through a neural net to locate and identify objects. Save the locations for some later stage of the project. [R, yolo]
Stitch the results back together. [R, ImageMagick]

There's an example here, applied to video from a 'camera trap' in the garden.

The rest of the blog gives the full details on how this is done.

purrr magic!

Used purrr::pmap for the first time and it's brilliant!

Run the same function multiple times changing a bunch of parameters, and combine all the results in a single data frame. All in a single call.

Keeping code and data separate is, of course, good practice. But it can be easy to slip into mixing your code up with metadata (in this case data on where to find data and what it is). pmap makes it really easy to keep metadata separate from the code, too.

In my case, I'm pulling sets of data out of spreadsheets, where the blocks start on different rows in each sheet (don't ask!). So I just create a csv file with the parameter names on the first line, and the parameter values on the second line.

sheet,startRow,book,line

1, 53,What Might,Classics

2, 70,Runagates,Classics

3, 43,Desire,Classics

4, 45,Vocations,Classics

etc.

Read this metadata into a data frame, and then run my 'readSales' loading function, is just two lines of code:

classicSets <- read.csv(file = "data/classicSets.csv", stringsAsFactors = F)

classicSales <- pmap_dfr(classicSets, readSales)

'readSales' is just a wrapper around the excellent readxl:read_xls that cleans up my data and adds some identifying columns (book, line).

And you're done. Thanks to Hadley Wickham and Lionel Henry!

Thursday, 17 January 2019

Sustained high speeds along the Warminster Rd, Bath

In the last few posts, I was analysing number plate reader (ANPR) data from B&NES Council and BathHacked, a data activist organisation, looking at how traffic patterns might change if a clean air zone were introduced.

I've gone off on a tangent, analysing data from just two ANPRs, a mile apart on the A36 Warminster Rd, on the outskirts of Bath. They give me average speeds for some 65,000 transits over two weeks.

This is a residential 30mph zone, with a straight section, blind bend and narrow hill, though perhaps you might not think that, judging from this box plot (a smaller sample of 3 days). Each dot is a vehicle along the road. The lowest speeds are where they stop or detour (I'll put a lower limit of 5mph on to eliminate these). November 8 & 9 saw a little queueing heading into town in the morning, but these were the only cases where the speeds are not clustered around 30mph. (That's "around" - little sign of a limit in these data.)

From the data, we can split vehicles into cars, light commercial (LCVs), heavy commercial (HCVs - this is a trunk route with large lorries), public-service vehicles (PSVs) and a rare few others that we'll ignore. The scary thing is that there's no sign that trucks are travelling any more slowly than cars. Only public-service vehicles are noticeably slower, given the 3 bus stops on this section. (All days of data are included here.)

In this large survey, at peak hours there were typically 40 vehicles per hour sustaining 33mph or more over the 1 mile route. The top sustained speed was over 60mph, and one van driver managed to get 3 times into the top 20, each time doing more than 50mph.

Definitely some room for improving safety here.

Code is available on github, data from BathHacked.org.

Tuesday, 4 December 2018

How much traffic might switch to the Toll Bridge to avoid the Bath CAZ?

A high estimate, based on 2017 traffic counts of A36-A46 transits, is that up to 30 vehicles/hour on this route could avoid the Bath Clean Air Zone (CAZ) and join the evening peak traffic (currently 100/hour) on the Toll Bridge. This is a 'high estimate' because it (a) ignores 'retirements' of old cars over the next 3 years (b) assumes all vehicles that currently go through Bath, are liable for the CAZ and could switch to the Toll Bridge, would definitely switch to avoid the CAZ.
In previous posts, I've explored the excellent Bath traffic data provided by BathHacked, looking at transits of Bath (which split into two groups North-East and South-West), and transits using the entry/exit points on the by-pass at Swainswick, and the A36 at Bathampton (Dry Arch).

Now, I finally zoom in to get a first answer to the question of how much more traffic might there be over the Toll Bridge when the CAZ starts. I'm using (just) 2 days of data 31 Oct 2017 and 1 Nov 2017.

From the analysis so far, the main flow likely to generate additional traffic over the Toll Bridge is A36-A46 transiting traffic. I've not shown it so far in the blogs, but I found little evidence in the ANPR data of significant traffic on local flows with a choice between BathwickRoad-LondonRoad and WarminsterRd-TollBridge. There must be Batheaston-Combe Down traffic, for example, but not large compared to the 1,000 per day each way on the A36-A46.

First, here are the A36-A46 flows. In these diagrams traffic, like time, is going from left to right: start at Swainswick heading South, and leave at Bathampton heading East; or vice versa Bathampton West then Swainswick North. The height represents the number of vehicles: the thin 'Swainswick_S' vertical green bar represents around 2,600 (south-bound) transits.

Under half of the vehicles on these transits are detected by the ANPRs on London Road, or Warminster Road. While a few vehicles' number plates might not have been read at either intermediate point, it seems a reasonable assumption that most of the remainder (the light blue) simply took the 'direct' route avoiding the intermediate ANPRs, over the Toll Bridge.

The key question is - how many vehicles currently go through Bath rather than take the Toll Bridge now, which might switch when the CAZ comes in?

To answer this, I first plot the traffic by hour of the day. We've seen that slightly more than half already take the bridge. The bar chart below shows this again, and shows in addition that the bridge traffic is fairly steady through the day (remember this is A36-A46 transits, so won't include school runs), while through Bath there's a more pronounced mid-day peak: less transit traffic in the morning and evening rush, for obvious reasons!

The chart has 2,606 vehicles over the Toll Bridge, 1,983 going through Bath in total. The peak hours on these two days for transits using the bridge are the evening, around 100 vehicles per hour (total in the two directions).

Additionally, the chart shows the split between petrol, diesel and hybrid or pure electric - though this difference isn't essential here.

Then I split the vehicles into:

heavy-commercials (HCVs), which nearly all go through Bath, because they have no choice;
those with high or unknown emissions and therefore likely to be subject to the CAZ charge;
those with low emissions and therefore definitely exempt.

If the CAZ came in today, the vehicles that might change their behaviour are the middle set on the right: those that currently go through Bath and are likely to be liable to the charge. The proportion of these changing is unlikely to be 100%, because some might choose to go through Bath for other reasons (dropping someone off, say), or it might just be that we've classified the car as 'unknown' when they would not be liable, so an estimate here is an upper bound. Let's assume it's 100% for now.

The graph shows that, in the evening peak there's up to 30 cars per hour that could be tempted to make the switch, if the CAZ happened today. There's a little over 30 in the middle of the day, but the bridge is slightly quieter then.

But these are 2017 data. The key is assumptions about 'retirement' of the vehicles between now and the start of the CAZ at the end of 2020. Unfortunately, for the moment I haven't found good data to estimate that retirement rate, though my recollection is that the BreAthe consultation suggested retirement rates of 70-80%. That would bring us to the magic number of 'around 10/hour', but I'm not currently in a position to explore that step further.

Saturday, 1 December 2018

More evidence that most cars and light commercials transiting Bath (Bathampton-Swainswick) already use the toll bridge

In an earlier post, I estimated that nearly half of vehicles transiting Bath between the A36 Warminster Road and the By-pass at Swainswick already used the toll bridge.

In this post, I provide more supporting evidence for this: showing that the split by vehicle types makes sense.

Nearly all heavy commercial vehicles (HCVs) go into town (via Warminster Road and London Road, or vice versa) and then back out again.

Recall that in these plots, a transit of Bath starts on the left and ends on the right. There's just the thinnest of light blue lines at the top of the diagram for HCVs which are not recorded by ANPRs between Swainswick (south-bound) and Bathampton (Dry Arch).

The majority of cars and light commercial vehicles that are heading South (see the "Swainswick_S" node on the left) are next recorded at "Bathampton_E" (Dry Arch, leaving Bath). These, we are assuming, use the toll bridge, although a small percentage might have been missed by 2 ANPRs on the way (previously we estimated this 'missed' rate at less than 5%).

Heading North, the proportions are slightly smaller, between 50% and 60%.