Etienne Bacher: Using property-based testing in R

Etienne Bacher

I had never heard of property-based testing until a few months ago when I started looking at some pull requests in polars (the Python implementation, not the R one) where they use hypothesis, for example polars#17992.

I have contributed to a fair amount of R packages but I have never seen this type of tests before: unit tests, plenty; snapshot tests, sometimes; but property-based tests? Never. And at first, I didn’t really see the point, but I’ve had a couple of situations recently where I thought it could help, so the aim of this post is to explain (briefly) what property-based testing is and to provide some examples where it can be useful.

What is property-based testing?

Most of the time, unit tests check that the function performs well on a few different inputs: does it give correct results? Nice error messages? What about this corner case?

Property-based testing is a way of testing where we give random inputs to the function we want to test and we want to ensure that no matter the inputs, the output will respect some properties. For example, suppose we made a function to reverse the input, so if I pass 3, 1, 4, it should return 4, 1, 3¹. We can pass several inputs and see if the output is correctly reversed. But a more efficient way would be to check that our function respects a basic property, which is that reversing the input twice should return the original input:

rev(rev(c(3, 1, 4)))

[1] 3 1 4

Therefore, property-based testing doesn’t use hardcoded values to check the output but ensures that our function respects a list of properties.

Property-based testing in R

To the best of my knowledge, there are two R packages to do property-based testing in R: hedgehog and quickcheck (which is based on hedgehog). If you already use testthat for testing, then integrating them in the test suite is not hard. Using the example above, we could do:

library(quickcheck)
library(testthat)

test_that("reversing twice returns the original input", {
  for_all(
    a = numeric_(any_na = TRUE),
    property = function(a) {
      expect_equal(rev(rev(a)), a)
    }
  )
})

Test passed 😀

This example generated 100 random inputs and checked that the expect_equal() clause was respected for all of them. We can see that by adding a print() call (I reduce the number of tests to 5 to avoid too much clutter).

test_that("reversing twice returns the original input", {
  for_all(
    a = numeric_(any_na = TRUE),
    tests = 5,
    property = function(a) {
      print(a)
      expect_equal(rev(rev(a)), a)
    }
  )
})

[1]  9474  1816  5096 -5492  6089     0  7687    NA -1505
[1] -1906 -9064    47  9454 -5367  4836    NA
[1] 883085158        NA
[1]    NA     0  5771  3808 -5308 -3313 -6441  2668    NA
[1]          0          0  -36502851  316004917 -237072955
Test passed 🥇

As we can see, a lot of different examples were generated: some have single values while other have multiple, some only have negative values while others have a mix, etc.

The example above checked only on numeric inputs, but we could check on any type of vector using any_atomic():

test_that("reversing twice returns the original input", {
  for_all(
    a = any_atomic(any_na = TRUE),
    tests = 5,
    property = function(a) {
      print(a)
      expect_equal(rev(rev(a)), a)
    }
  )
})

[1] -52659396        NA 134735066
[1]         NA -595851358         NA  716078985
[1]         NA  774043668         NA -540014936         NA  109811027
 [1]         NA         NA -699860696 -112767336 -718748598  378867662
 [7]          0         NA          0          0
 [1] "541L1(9+3"  "R(H"        "=u"         "=Q]Zx&"     NA          
 [6] "Q"          NA           "o5"         "[^+zm$\""   "#;+y\"U@;|"
Test passed 🎊

Finally, if a particular input fails, quickcheck will first try to reduce the size of this input as much as possible (a process called “shrinking”). To illustrate that, let’s say we make a function to normalize a numeric vector to a [0, 1] interval:

normalize <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}

normalize(c(-1, 2, 0, -4))

[1] 0.5000000 1.0000000 0.6666667 0.0000000

One property of this function is that all output values should be in the interval [0, 1]. Does this function pass property-based tests?

test_that("output is in interval [0, 1]", {
  for_all(
    a = numeric_(any_na = TRUE),
    tests = 5,
    property = function(a) {
      res <- normalize(a)
      expect_true(all(res >= 0 & res <= 1))
    }
  )
})

── Failure: output is in interval [0, 1] ───────────────────────────────────────
Falsifiable after 1 tests, and 3 shrinks
<expectation_failure/expectation/error/condition>
all(res >= 0 & res <= 1) is not TRUE

`actual`:   <NA>
`expected`: TRUE
Backtrace:
     ▆
  1. └─quickcheck::for_all(...)
      [TRUNCATED...]
Counterexample:
$a
[1] -4037

Backtrace:
    ▆
 1. └─quickcheck::for_all(...)
 2.   └─hedgehog::forall(...)

Hah-ah! Problem: what happens if the input is a single value? Then max(x) - min(x) is 0, so the division gives NaN. In the error message, we can see:

Falsifiable after 1 tests, and 3 shrinks

Shrinking is the action of reducing as much as possible the size of the input that makes the function fail. Having the smallest example possible is extremely useful when debugging.

Let’s fix the function and try again:

normalize <- function(x) {
  if (length(x) == 1) {
    return(0.5) # WARNING: this is for the sake of example, I don't 
                # guarantee this is the correct behavior
  }
  (x - min(x)) / (max(x) - min(x))
}

test_that("output is in interval [0, 1]", {
  for_all(
    a = numeric_(any_na = TRUE),
    tests = 5,
    property = function(a) {
      res <- normalize(a)
      expect_true(all(res >= 0 & res <= 1))
    }
  )
})

── Failure: output is in interval [0, 1] ───────────────────────────────────────
Falsifiable after 1 tests, and 8 shrinks
<expectation_failure/expectation/error/condition>
all(res >= 0 & res <= 1) is not TRUE

`actual`:   <NA>
`expected`: TRUE
Backtrace:
     ▆
  1. └─quickcheck::for_all(...)
     [TRUNCATED...]
Counterexample:
$a
[1] -2413 -2413

Backtrace:
    ▆
 1. └─quickcheck::for_all(...)
 2.   └─hedgehog::forall(...)

Dang it, now it fails when I pass a two identical values! This is for the same reason as above, max(x) - min(x) will return 0, but I won’t spend more time on this example, you get the idea.

Besides this basic example, where could this be useful?

Ensuring that a package doesn’t crash R

When working with compiled code (C++, Rust, etc.), it can happen that a bug makes the R session crash (== segfault == “bomb icon” in RStudio). This can be extremely annoying as we lose all data and computations that were stored in memory. When we work with compiled code, there’s one property that our code should follow:

Calling a function should never lead to a segfault.

This happened to me a few months ago. I investigated some code that used igraph::cluster_fast_greedy(). I know almost nothing about igraph, I was just playing around with arguments, and suddenly… crash. I reported this situation (igraph#2459), which was promptly fixed (thank you igraph devs!), but one sentence in the explanation caught my eye: “it is a rare use case to only want modularity but not membership, and avoiding membership calculation doesn’t have any advantages.”

I have no particular problem with this sentence or the rationale behind, it makes sense to prioritize fixes that affect a larger audience. But it got me thinking: could we try all combinations of inputs to see if it makes the session crash? We could use parametric testing for this, but then again we need to hardcode at least some possible values for parameters. We could say that we start by testing only TRUE/FALSE values for all combinations of params, but what if the user passes a string?

I think this is a situation where property-based testing would be helpful: we know that no matter the input type, its length, and the value of other inputs, the session shouldn’t crash. Implementing it with quickcheck looks fairly simple:

library(igraph, warn.conflicts = FALSE)

test_that("cluster_fast_greedy doesn't crash", {
  # setup a graph, from the examples of ?cluster_fast_greedy
  g <- make_full_graph(5) %du% make_full_graph(5) %du% make_full_graph(5)
  g <- add_edges(g, c(1, 6, 1, 11, 6, 11))

  for_all(
    merges = any_atomic(any_na = TRUE), 
    modularity = any_atomic(any_na = TRUE), 
    membership = any_atomic(any_na = TRUE), 
    weights = any_atomic(any_na = TRUE),
    property = function(merges, modularity, membership, weights) {
      suppressWarnings(
        try(
          cluster_fast_greedy(g, merges = merges, modularity = modularity, membership = membership, weights = weights),
          silent = TRUE
        )
      )
      expect_true(TRUE)
    }
  )
})

Test passed 😀

I didn’t really know what expectation to put, I don’t care if the function errors or not, I just want it not to segfault. So I put try(silent = TRUE) and added a fake expectation.

Ensuring that a package and its variants give the same results

I have spent some time working on tidypolars, a package that provides the same interface as the tidyverse but uses polars under the hood. This means that there should be the lowest amount of “surprises” for the user: the behavior of functions that are available in tidypolars should match the behavior of those in tidyverse. Once again this can be tedious to check. One example is the function stringr::str_sub(). For instance, we can start with basic examples, such as:

stringr::str_sub(string = "foo", start = 1, end = 2)

[1] "fo"

Easy enough to test. But what happens if string is missing? Or if start > end? Or if end is negative? Or if start is negative and end is NULL and the length of start is greater than the length of string? Manually adding tests for all of those is painful and increases the risk of forgetting a corner case.

It is better here to use property-based testing: we don’t need to check the value of the output of functions implemented in tidypolars, we only need to check that they match the output of functions in tidyverse.

Here, one additional difficulty is that sometimes throwing an error is the correct behavior. Therefore, we need to create a custom expectation that checks that the output of tidypolars and tidyverse is identical, or that both functions error (see the testthat vignette on creating custom expectations):

expect_equal_or_both_error <- function(object, other) {
  polars_error <- FALSE
  polars_res <- tryCatch(
    object,
    error = function(e) polars_error <<- TRUE
  )

  other_error <- FALSE
  other_res <- suppressWarnings(
    tryCatch(
      other,
      error = function(e) other_error <<- TRUE
    )
  )

  if (isTRUE(polars_error)) {
    testthat::expect(isTRUE(other_error), "tidypolars errored but tidyverse didn't.")
  } else {
    testthat::expect_equal(polars_res, other_res)
  }

  invisible(NULL)
}

Conclusion

Property-based testing will not replace all kinds of tests, and is not necessarily appropriate in all contexts. Still, it can help uncover bugs, segfaults, and it adds more confidence in our code by randomly checking that it works even with implausible inputs.

Using property-based testing in R

What is property-based testing?

Property-based testing in R

Ensuring that a package doesn’t crash R

Ensuring that a package and its variants give the same results

Conclusion

Corrections

Reuse

Citation