Exploring statistics: tidyverse

Sunday, 4 December 2022

For "tidy-select" read "column", for "data-masking" read "row"

The moment arrived for an update to himach. I thought I was solving a few issues created by newer versions of packages that himach depends on (probably due to how I'd used them in the first place). Instead, I was faced with a host of:

use of .data$ in tidyselect expressions is deprecated

warnings.

It seems that I've been using the tidyverse wrong since I first made the 2speed project that became himach, 3 years ago. I'd understood that, in a package, you need to be specific and use, say, select(.data$x) or mutate(y = .data$x + 1) so that the code can be sure you're referring to the variable x from the dataframe and not another variable. This was one of the tricky steps to get used to when moving from writing 'normal' open code, to writing a package.

But it was more subtle than that, or at least it is now, sometimes that's true and sometimes not. The tidyverse of course improves and develops, and maybe it wasn't clear to the authors then either at the time. So, updating my code wasn't just about a global search and replace, it depends on what function is being used.

There are two types of reference to variables. This seems to be a key reference.

<tidy-select> functions. In my code these were unnest, select, rename, across, pull. For these I did indeed have to replace .data$x with "x". The warning message gives the right guidance. I had one case of .data[[var]]. For that the blog says use all_of(var). Instead I used {{ var }}, because this feels right for passing a variable name as a function parameter.
<data-masking> functions. In my code these were mutate, group_by, filter, arrange, summarise and also (because they're within a mutate?) case_when and if_else. These cases I had to leave as .data$x.

The logic wasn't entirely clear to me. Tidy selection is sort of manipulating columns of the dataframe, while with data-masking you're more interested in manipulating the contents. If that's the case, why does select use tidy selection, but group_by doesn't? Is it that group_by implicitly uses the contents? That has to be it.

In the end, I think it's easier to think of:

<column> functions, not <tidy-select>. These all manipulate the dataframe columns without caring about the rows. (Though tidyverse, I think, would like you to think of these as another sort of variable, not columns of a dataframe.)
<row> functions, not <data-masking>. Since these functions do different things depending on the values in each row.

Saturday, 9 March 2019

purrr magic!

Used purrr::pmap for the first time and it's brilliant!

Run the same function multiple times changing a bunch of parameters, and combine all the results in a single data frame. All in a single call.

Keeping code and data separate is, of course, good practice. But it can be easy to slip into mixing your code up with metadata (in this case data on where to find data and what it is). pmap makes it really easy to keep metadata separate from the code, too.

In my case, I'm pulling sets of data out of spreadsheets, where the blocks start on different rows in each sheet (don't ask!). So I just create a csv file with the parameter names on the first line, and the parameter values on the second line.

sheet,startRow,book,line

1, 53,What Might,Classics

2, 70,Runagates,Classics

3, 43,Desire,Classics

4, 45,Vocations,Classics

etc.

Read this metadata into a data frame, and then run my 'readSales' loading function, is just two lines of code:

classicSets <- read.csv(file = "data/classicSets.csv", stringsAsFactors = F)

classicSales <- pmap_dfr(classicSets, readSales)

'readSales' is just a wrapper around the excellent readxl:read_xls that cleans up my data and adds some identifying columns (book, line).

And you're done. Thanks to Hadley Wickham and Lionel Henry!

Links to my Tableau public dashboards:

Sunday, 4 December 2022

For "tidy-select" read "column", for "data-masking" read "row"

Saturday, 9 March 2019

purrr magic!