Sunday 4 December 2022

For "tidy-select" read "column", for "data-masking" read "row"

The moment arrived for an update to himach. I thought I was solving a few issues created by newer versions of packages that himach depends on (probably due to how I'd used them in the first place). Instead, I was faced with a host of:

use of .data$ in tidyselect expressions is deprecated

warnings.

It seems that I've been using the tidyverse wrong since I first made the 2speed project that became himach, 3 years ago. I'd understood that, in a package, you need to be specific and use, say, select(.data$x) or mutate(y = .data$x + 1) so that the code can be sure you're referring to the variable x from the dataframe and not another variable. This was one of the tricky steps to get used to when moving from writing 'normal' open code, to writing a package.


But it was more subtle than that, or at least it is now, sometimes that's true and sometimes not. The tidyverse of course improves and develops, and maybe it wasn't clear to the authors then either at the time. So, updating my code wasn't just about a global search and replace, it depends on what function is being used.

There are two types of reference to variables. This seems to be a key reference.

  1. <tidy-select> functions. In my code these were unnest, select, rename, across, pull. For these I did indeed have to replace .data$x with "x".  The warning message gives the right guidance. I had one case of .data[[var]]. For that the blog says use all_of(var). Instead I used {{ var  }}, because this feels right for passing a variable name as a function parameter.
  2. <data-masking> functions. In my code these were mutate, group_by, filter, arrange, summarise and also (because they're within a mutate?) case_when and if_else. These cases I had to leave as .data$x. 

The logic wasn't entirely clear to me. Tidy selection is sort of manipulating columns of the dataframe, while with data-masking you're more interested in manipulating the contents. If that's the case, why does select use tidy selection, but group_by doesn't? Is it that group_by implicitly uses the contents? That has to be it.

In the end, I think it's easier to think of:
  1. <column> functions, not <tidy-select>. These all manipulate the dataframe columns without caring about the rows. (Though tidyverse, I think, would like you to think of these as another sort of variable, not columns of a dataframe.)
  2. <row> functions, not <data-masking>. Since these functions do different things depending on the values in each row.



I've also now got rename statements with mixed inverted commas: rename(y = "x"). That feels odd, but being lazy, I haven't added inverted commas to the left-hand side too (though that would work). Feels a bit like a backward step - I liked the minimal punctuation style of earlier syntax (rename(y = "x")), even if I always have to think twice to get the order right. (memo to self: order is like a function y = f(x), not before-then-after)







No comments:

Post a Comment