Closing in on understanding R closures
A common and useful pattern I use in R programming is currying to create closures for later computation. This pattern is the main abstraction of my geex package, for example. At NoviSci we use closures all the time in data pipelines. I think of a closure as a function f
that returns another function g
, where the returned g
function (hopefully) has the data necessary to do its computation:
f <- function(oargs){
odata <- do_stuff_with(oargs)
function(iargs){ do_stuff_with(iargs, odata) }
}
g <- f(...)
# use g later on
This blog post addresses, in part, the “hopefully” in the last sentence. There are environment scoping details to consider to when creating a closure. For one, when calling g(...)
the necessary data needs to be in scope. However, depending on the environments enclosed with g
, the closure can also have more data than is necessary. This can lead to unnecessary memory bloat, which can have dramatic performance costs.
I’ve been programming and developing in R for over ten years now, and I still find myself tripping up over R environments and scoping rules, especially when it comes to closures. Below I show a few ways to handle closure creation and look at their performance in terms of the following:
- size of closure after creation
- size of closure after being called for the first time
- size of closure when written to disk as an RDS
- 1-3 after adding more objects to the Global environment
- 1 & 3 after tidying the Global environment as well as whether the closures still work
My toy case is a function that takes a model object and simply returns its formula
. Toying with a formula
adds complexity as formula
objects include a environment. From the docs:
A formula object has an associated environment, and this environment (rather than the parent environment) is used by model.frame to evaluate variables that are not found in the supplied data argument. Formulas created with the
~
operator use the environment in which they were created. Formulas created with as.formula will use the env argument for their environment.
However, in many cases, we don’t care about the attached environment, we care about the symbolic formula expression. The examples below demonstrate that we often need to think carefully when a closure especially when we want to save a closure to disk for sharing the file and reusing in different sessions.
library(purrr)
# Create a couple of model objects to have in GlobalEnv
m1 <- glm(Sepal.Width ~ 1, data = iris)
m2 <- glm(Sepal.Length ~ 1, data = iris)
The first option f0
implements perhaps the most straightforward approach: do all the work in the “inner” (call it g0
) function. As we’ll see, this is unsatisfactory, as if the model passed to f0
is removed the g0
function will no longer work.
f0 <- function(model){
function(){
formula(model)
}
}
Another approach moves the extraction of the formula from the model into the enclosing environment. This works, but as we’ll see g0
will carry along model
in its environments, which for my purposes is unnecessary.
f1 <- function(model){
fm <- formula(model)
function(){
fm
}
}
The f2
option removes model
when f2
exits.
f2 <- function(model){
fm <- formula(model)
on.exit({ rm(model)} )
function(){
fm
}
}
The f3
and f4
are me just seeing how the update
function does or doesn’t affect environments:
f3 <- function(model){
fm <- update( . ~ ., formula(model))
function(){
fm
}
}
f4 <- function(model){
fm <- update( . ~ ., formula(model))
on.exit({ rm(model)})
function(){
fm
}
}
I consider f6
and f7
poor practice: casting the formula to character then recasting. The f5
function I consider OK: pulling out what I need (the expression) and throwing away the rest.
f5 <- function(model){
fm <- rlang::expr(!! formula(model))
on.exit({ rm(model)})
function(){
as.formula(fm)
}
}
f6 <- function(model){
fm <- rlang::expr_text(formula(model))
function(){
as.formula(fm)
}
}
f7 <- function(model){
fm <- rlang::expr_text(formula(model))
on.exit({ rm(model)})
function(){
as.formula(fm)
}
}
I got the idea for f8
, wherein the environment for g
is explicitly set, from this stackoverflow post.
f8 <- function(model){
fm <- formula(model)
f <- function(){
fm
}
environment(f) <- list2env(list(fm = fm), parent = globalenv())
f
}
fs <- list(f0 = f0, f1 = f1, f2 = f2, f3 = f3, f4 = f4,
f5 = f5, f6 = f6, f7 = f7, f8 = f8)
Let’s look at memory characteristics of each function and resulting closure:
result1 <-
tibble::tibble(
f = names(fs),
# Size of function
fsize = map_dbl(fs, ~ pryr::object_size(.x)),
# Size of closure
csize = map_dbl(fs, ~ pryr::object_size(.x(m1))),
# what happens if called again?
csize2 = map_dbl(fs, ~ pryr::object_size(.x(m1))),
# Size of result information
rsize = map_dbl(fs, ~ pryr::object_size(.x(m1)())),
# Cost of information
loss = rsize/csize,
# Size on disk
dsize = map_dbl(
.x = fs,
.f = ~ {
tmp <- tempfile()
saveRDS(.x(m1), file = tmp)
file.size(tmp)
}
)
)
And what happens after adding more objects to the Global environment:
m3 <- glm(mpg ~ 1, data = mtcars)
m4 <- glm(weight ~ 1, data = ChickWeight)
result2 <-
tibble::tibble(
f = names(fs),
# Size of function
fsize_2 = map_dbl(fs, ~ pryr::object_size(.x)),
# Size of storage
csize_2 = map_dbl(fs, ~ pryr::object_size(.x(m1))),
# size increases when called again!
csize2_2 = map_dbl(fs, ~ pryr::object_size(.x(m1))),
# Size of result information
rsize_2 = map_dbl(fs, ~ pryr::object_size(.x(m1)())),
# Cost of information
loss_ = rsize_2/csize_2,
# Size on disk
dsize_2 = map_dbl(
.x = fs,
.f = ~ {
tmp <- tempfile()
saveRDS(.x(m1), file = tmp)
file.size(tmp)
}
)
)
Now what happens if we remove the model objects and try to evaluate the closure created by each f
on m1
:
gs <- map(
.x = fs,
.f = ~ .x(m1)
)
rm(m1, m2, m3, m4)
result3 <-
tibble::tibble(
f = names(gs),
# Does it even work?
error = map_lgl(gs, ~ is.null(safely(.x)()$result)),
# Size of storage
csize_3 = map_dbl(gs, ~ pryr::object_size(.x)),
# Size on disk
dsize_3 = map_dbl(
.x = gs,
.f = ~ {
tmp <- tempfile()
saveRDS(.x, file = tmp)
file.size(tmp)
}
)
)
Now let’s look at the results:
f | fsize | csize | csize2 | rsize | loss | dsize | fsize_2 | csize_2 | csize2_2 | rsize_2 | loss_ | dsize_2 | error | csize_3 | dsize_3 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
f0 | 9576 | 8128 | 9408 | 672 | 0.08 | 1933 | 11408 | 9408 | 9408 | 672 | 0.07 | 1909 | TRUE | 12976 | 1262 |
f1 | 10576 | 57944 | 58592 | 672 | 0.01 | 5390 | 12600 | 58592 | 58592 | 672 | 0.01 | 5390 | FALSE | 59096 | 5411 |
f2 | 13232 | 9680 | 10408 | 672 | 0.07 | 1230 | 15408 | 10408 | 10408 | 672 | 0.06 | 1230 | FALSE | 10408 | 1230 |
f3 | 15248 | 62168 | 62816 | 50368 | 0.81 | 6071 | 18264 | 62816 | 62816 | 50368 | 0.80 | 6071 | FALSE | 63320 | 6090 |
f4 | 16976 | 12920 | 13648 | 840 | 0.07 | 1753 | 20160 | 13648 | 13648 | 840 | 0.06 | 1753 | FALSE | 13648 | 1753 |
f5 | 20832 | 16552 | 17832 | 672 | 0.04 | 2398 | 24384 | 17832 | 17832 | 672 | 0.04 | 2398 | FALSE | 17832 | 2398 |
f6 | 18824 | 65528 | 66728 | 50544 | 0.77 | 6739 | 22208 | 66728 | 66728 | 50544 | 0.76 | 6739 | FALSE | 67232 | 6756 |
f7 | 20552 | 16112 | 17392 | 1016 | 0.06 | 2372 | 23992 | 17392 | 17392 | 1016 | 0.06 | 2372 | FALSE | 17392 | 2372 |
f8 | 18384 | 14816 | 14816 | 672 | 0.05 | 2008 | 22656 | 14816 | 14816 | 672 | 0.05 | 2008 | FALSE | 14816 | 2008 |
Comments
f0
is out as a candidate since it is not guaranteed to work when retrieved from disk.f2
orf8
which kind of converse of each other.f2
states what information to throw out;f8
states what information to keep.g8
increases after calling it for the first time. I’ve seen the opposite too. At work, I had a closure that reduced over 10x in size just by calling it once. I spent hours trying to figure out why.f2
optional like:Another option is cleaning up the environment by name.