A small collection of tips to make base R code faster.
Preview image coming from: https://trainingindustry.com/magazine/nov-dec-2018/life-in-the-fast-lane-accelerated-continuous-development-for-fast-paced-organizations/
Lately I’ve spent quite some time on packages that require (almost) only base R:
datawizard
, a package belonging to the easystats
ecosystem, whose goal is to
provide tools for data wrangling and statistical transformations;poorman
, whose goal is to reproduce tidyverse
functions (with a strong focus
on dplyr
) using base R only.I’ve used bench::mark()
and profvis::profvis()
a lot to improve code performance
and here are a few things I learnt. By default, bench::mark()
checks that all
expressions return the same output, so we can be confident that the alternatives
I show in this post are truly equivalent.
Before we start, I want to precise a few things.
First, these performance improvements are targeted to package developers. A random user shouldn’t really care if a function takes 200 milliseconds less to run. However, I think a package developer might find these tips interesting.
Second, if you find some ways to speed up my alternatives, feel free to comment.
I know that there are a bunch of packages whose reputation is built on being very
fast (for example data.table
and collapse
). I’m only showing some base R
code alternatives here.
Finally, here’s a small function that I use to make a classic dataset
(like iris
or mtcars
) much bigger.
make_big <- function(data, nrep = 500000) {
tmp <- vector("list", length = nrep)
for (i in 1:nrep) {
tmp[[i]] <- data
}
data.table::rbindlist(tmp) |>
as.data.frame()
}
One easy way to do this is to run length(unique(x)) == 1
, which basically means
that first we have to collect all unique values and then count them. This can be
quite inefficient: it would be enough to stop as soon as we find two different
values.
What we can do is to compare all values to the first value of the vector. Below is
an example with a vector containing 10 million values. In the first case, it only
contains 1
, and in the second case it contains 1
and 2
.
# Should be TRUE
test <- rep(1, 1e7)
bench::mark(
length(unique(test)) == 1,
all(test == test[1]),
iterations = 10
)
# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl>
1 length(unique(test)) =… 161.8ms 185.6ms 5.31 166.1MB 5.31
2 all(test == test[1]) 44.2ms 69.7ms 14.9 38.1MB 4.47
# Should be FALSE
test2 <- rep(c(1, 2), 1e7)
bench::mark(
length(unique(test2)) == 1,
all(test2 == test2[1]),
iterations = 10
)
# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl>
1 length(unique(test2)) … 342.2ms 390.6ms 2.46 332.3MB 2.46
2 all(test2 == test2[1]) 63.2ms 71.6ms 11.5 76.3MB 2.30
This is also faster for character vectors:
# Should be FALSE
test3 <- rep(c("a", "b"), 1e7)
bench::mark(
length(unique(test3)) == 1,
all(test3 == test3[1]),
iterations = 10
)
# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:t> <bch:> <dbl> <bch:byt> <dbl>
1 length(unique(test3)) =… 287.8ms 326ms 3.00 332.3MB 3.00
2 all(test3 == test3[1]) 82.7ms 107ms 8.73 76.3MB 1.75
Sometimes we need to concatenate columns, for example if we want to create a unique id from several grouping columns.
test <- data.frame(
origin = c("A", "B", "C"),
destination = c("Z", "Y", "X"),
value = 1:3
)
test <- make_big(test)
One option to do this is to combine paste()
and apply()
using MARGIN = 1
to
apply paste()
to each row. However, a faster way to do this is to use do.call()
instead of apply()
:
bench::mark(
apply = apply(test[, c("origin", "destination")], 1, paste, collapse = "_"),
do.call = do.call(paste, c(test[, c("origin", "destination")], sep = "_"))
)
# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 apply 7.36s 7.36s 0.136 80.1MB 5.71
2 do.call 128.14ms 139.29ms 7.08 11.4MB 0
This one comes from these StackOverflow question and answer. Manipulating a dataframe can remove some attributes. For example, if I give an
attribute foo
to a large dataframe:
orig <- data.frame(x1 = rep(1, 1e7), x2 = rep(2, 1e7))
attr(orig, "foo") <- TRUE
attr(orig, "foo")
[1] TRUE
If I reorder the columns, this attribute disappears:
We can put it back with:
attributes(new) <- utils::modifyList(attributes(orig), attributes(new))
attr(new, "foo")
[1] TRUE
But this takes some time because we also copy the 10M row names of the
dataset. Therefore, one option is to create a custom function that only copies the
attributes that were in orig
but are not in new
(in this case, only attribute
foo
is concerned):
replace_attrs <- function(obj, new_attrs) {
for(nm in setdiff(names(new_attrs), names(attributes(data.frame())))) {
attr(obj, which = nm) <- new_attrs[[nm]]
}
return(obj)
}
bench::mark(
old = {
attributes(new) <- utils::modifyList(attributes(orig), attributes(new))
head(new)
},
new = {
new <- replace_attrs(new, attributes(orig))
head(new)
}
)
# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 old 68ms 82ms 12.9 38.2MB 4.29
2 new 52.8µs 80.5µs 11188. 24.4KB 8.77
It can be useful to remove empty rows, meaning rows containing only NA
or ""
.
We could once again use apply()
with MARGIN = 1
, but a faster way is to use
rowSums()
. First, we create a data frame full of TRUE
/FALSE
with is.na(test) | test == ""
,
and then we count by row the number of TRUE
. If this number is equal to the number
of columns, then it means that the row only has NA
or ""
.
test <- data.frame(
a = c(1, 2, 3, NA, 5),
b = c("", NA, "", NA, ""),
c = c(NA, NA, NA, NA, NA),
d = c(1, NA, 3, NA, 5),
e = c("", "", "", "", ""),
f = factor(c("", "", "", "", "")),
g = factor(c("", NA, "", NA, "")),
stringsAsFactors = FALSE
)
test <- make_big(test, 100000)
bench::mark(
apply = which(apply(test, 1, function(i) all(is.na(i) | i == ""))),
rowSums = which(rowSums((is.na(test) | test == "")) == ncol(test))
)
# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 apply 2.08s 2.08s 0.480 112.9MB 3.84
2 rowSums 709.59ms 709.59ms 1.41 99.7MB 0
These were just a few tips I discovered. Maybe there are ways to make them even faster in base R? Or maybe you know some weird/hidden tips? If so, feel free to comment below!
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/etiennebacher/personal_website_distill, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Bacher (2022, Nov. 28). Etienne Bacher: Some notes about improving base R code. Retrieved from https://www.etiennebacher.com/posts/2022-11-28-some-notes-about-improving-base-r-code/
BibTeX citation
@misc{bacher2022some, author = {Bacher, Etienne}, title = {Etienne Bacher: Some notes about improving base R code}, url = {https://www.etiennebacher.com/posts/2022-11-28-some-notes-about-improving-base-r-code/}, year = {2022} }