Closing in on understanding R closures

2021-02-05 7-minute read

A common and useful pattern I use in R programming is currying to create closures for later computation. This pattern is the main abstraction of my geex package, for example. At NoviSci we use closures all the time in data pipelines. I think of a closure as a function f that returns another function g, where the returned g function (hopefully) has the data necessary to do its computation:

f <- function(oargs){
   odata <- do_stuff_with(oargs)
   function(iargs){ do_stuff_with(iargs, odata) }
   
}

g <- f(...)
# use g later on

This blog post addresses, in part, the “hopefully” in the last sentence. There are environment scoping details to consider to when creating a closure. For one, when calling g(...) the necessary data needs to be in scope. However, depending on the environments enclosed with g, the closure can also have more data than is necessary. This can lead to unnecessary memory bloat, which can have dramatic performance costs.

I’ve been programming and developing in R for over ten years now, and I still find myself tripping up over R environments and scoping rules, especially when it comes to closures. Below I show a few ways to handle closure creation and look at their performance in terms of the following:

size of closure after creation
size of closure after being called for the first time
size of closure when written to disk as an RDS

1-3 after adding more objects to the Global environment
1 & 3 after tidying the Global environment as well as whether the closures still work

My toy case is a function that takes a model object and simply returns its formula. Toying with a formula adds complexity as formula objects include a environment. From the docs:

A formula object has an associated environment, and this environment (rather than the parent environment) is used by model.frame to evaluate variables that are not found in the supplied data argument. Formulas created with the ~ operator use the environment in which they were created. Formulas created with as.formula will use the env argument for their environment.

However, in many cases, we don’t care about the attached environment, we care about the symbolic formula expression. The examples below demonstrate that we often need to think carefully when a closure especially when we want to save a closure to disk for sharing the file and reusing in different sessions.

library(purrr)

# Create a couple of model objects to have in GlobalEnv
m1 <- glm(Sepal.Width ~ 1, data = iris)
m2 <- glm(Sepal.Length ~ 1, data = iris)

The first option f0 implements perhaps the most straightforward approach: do all the work in the “inner” (call it g0) function. As we’ll see, this is unsatisfactory, as if the model passed to f0 is removed the g0 function will no longer work.

f0 <- function(model){
  function(){
    formula(model)
  }
}

Another approach moves the extraction of the formula from the model into the enclosing environment. This works, but as we’ll see g0 will carry along model in its environments, which for my purposes is unnecessary.

f1 <- function(model){
  fm <- formula(model)
  function(){
    fm
  }
}

The f2 option removes model when f2 exits.

f2 <- function(model){
  fm <- formula(model)
  on.exit({ rm(model)} )
  function(){
    fm
  }
}

The f3 and f4 are me just seeing how the update function does or doesn’t affect environments:

f3 <- function(model){
  fm <- update( . ~ ., formula(model))
  function(){
    fm
  }
}

f4 <- function(model){
  fm <- update( . ~ ., formula(model))
  on.exit({ rm(model)})
  function(){
    fm
  }
}

I consider f6 and f7 poor practice: casting the formula to character then recasting. The f5 function I consider OK: pulling out what I need (the expression) and throwing away the rest.

f5 <- function(model){
  fm <- rlang::expr(!! formula(model))
  on.exit({ rm(model)})
  function(){
    as.formula(fm)
  }
}

f6 <- function(model){
  fm <- rlang::expr_text(formula(model))
  function(){
    as.formula(fm)
  }
}

f7 <- function(model){
  fm <- rlang::expr_text(formula(model))
  on.exit({ rm(model)})
  function(){
    as.formula(fm)
  }
}

I got the idea for f8, wherein the environment for g is explicitly set, from this stackoverflow post.

f8 <- function(model){
  fm <- formula(model)
  f <- function(){
    fm
  }
  
  environment(f) <- list2env(list(fm = fm), parent = globalenv())
  f
}

fs <- list(f0 = f0, f1 = f1, f2 = f2, f3 = f3, f4 = f4,
           f5 = f5, f6 = f6, f7 = f7, f8 = f8)

Let’s look at memory characteristics of each function and resulting closure:

result1 <- 
tibble::tibble(
  
  f = names(fs),
  
  # Size of function
  fsize = map_dbl(fs, ~ pryr::object_size(.x)),
  
  # Size of closure
  csize = map_dbl(fs, ~ pryr::object_size(.x(m1))),
  
  # what happens if called again?
  csize2 = map_dbl(fs, ~ pryr::object_size(.x(m1))),
  
  # Size of result information
  rsize = map_dbl(fs, ~ pryr::object_size(.x(m1)())),
  
  # Cost of information
  loss = rsize/csize,
    
  # Size on disk
  dsize = map_dbl(
    .x = fs,
    .f = ~ {
      tmp <- tempfile()
      saveRDS(.x(m1), file = tmp)
      file.size(tmp)
    }
  )
)

And what happens after adding more objects to the Global environment:

m3 <- glm(mpg ~ 1, data = mtcars)
m4 <- glm(weight ~ 1, data = ChickWeight)

result2 <- 
tibble::tibble(
  
  f = names(fs),
  
  # Size of function
  fsize_2 = map_dbl(fs, ~ pryr::object_size(.x)),
  
  # Size of storage
  csize_2 = map_dbl(fs, ~ pryr::object_size(.x(m1))),
  
  # size increases when called again!
  csize2_2 = map_dbl(fs, ~ pryr::object_size(.x(m1))),
  
  # Size of result information
  rsize_2 = map_dbl(fs, ~ pryr::object_size(.x(m1)())),
  
  # Cost of information
  loss_ = rsize_2/csize_2,
  
  # Size on disk
  dsize_2 = map_dbl(
    .x = fs,
    .f = ~ {
      tmp <- tempfile()
      saveRDS(.x(m1), file = tmp)
      file.size(tmp)
    }
  )
)

Now what happens if we remove the model objects and try to evaluate the closure created by each f on m1:

gs <- map(
  .x = fs,
  .f = ~ .x(m1)
)

rm(m1, m2, m3, m4)

result3 <- 
tibble::tibble(
  
  f = names(gs),
  
  # Does it even work?
  error = map_lgl(gs, ~ is.null(safely(.x)()$result)),
  
  # Size of storage
  csize_3 = map_dbl(gs, ~ pryr::object_size(.x)),
  
  # Size on disk
  dsize_3 = map_dbl(
    .x = gs,
    .f = ~ {
      tmp <- tempfile()
      saveRDS(.x, file = tmp)
      file.size(tmp)
    }
  )
)

Now let’s look at the results:

f	fsize	csize	csize2	rsize	loss	dsize	fsize_2	csize_2	csize2_2	rsize_2	loss_	dsize_2	error	csize_3	dsize_3
f0	9576	8128	9408	672	0.08	1933	11408	9408	9408	672	0.07	1909	TRUE	12976	1262
f1	10576	57944	58592	672	0.01	5390	12600	58592	58592	672	0.01	5390	FALSE	59096	5411
f2	13232	9680	10408	672	0.07	1230	15408	10408	10408	672	0.06	1230	FALSE	10408	1230
f3	15248	62168	62816	50368	0.81	6071	18264	62816	62816	50368	0.80	6071	FALSE	63320	6090
f4	16976	12920	13648	840	0.07	1753	20160	13648	13648	840	0.06	1753	FALSE	13648	1753
f5	20832	16552	17832	672	0.04	2398	24384	17832	17832	672	0.04	2398	FALSE	17832	2398
f6	18824	65528	66728	50544	0.77	6739	22208	66728	66728	50544	0.76	6739	FALSE	67232	6756
f7	20552	16112	17392	1016	0.06	2372	23992	17392	17392	1016	0.06	2372	FALSE	17392	2372
f8	18384	14816	14816	672	0.05	2008	22656	14816	14816	672	0.05	2008	FALSE	14816	2008

Comments

It’s surprising (to me) how much variation there is in (memory) performance in these options.
f0 is out as a candidate since it is not guaranteed to work when retrieved from disk.
My choice would be f2 or f8 which kind of converse of each other. f2 states what information to throw out; f8 states what information to keep.
I have not figured out what the size of all closures except g8 increases after calling it for the first time. I’ve seen the opposite too. At work, I had a closure that reduced over 10x in size just by calling it once. I spent hours trying to figure out why.
It can be useful to keep the extra information around for debugging. A simple way to do this is to make clean up in f2 optional like:

f2_ <- function(model, cleanup = TRUE){
  fm <- formula(model)
  if (cleanup){ on.exit({ rm(model) }) }
  function() { fm }
}

Another option is cleaning up the environment by name.

f2__ <- function(model){
  inputs <- names(as.list(match.call())[-1])
  fm <- formula(model)
  on.exit({ rm(list = ls()[ls() %in% inputs]) })
  function(){ fm }
}