Membership Tests in Julia, R and Python

Thu, Mar 11, 2021
10-minute read

Membership tests – checking whether an element is in a collection – is a common operation in statistical programming. However, depending on the programming language, the interface and underlying assumptions can vary a lot. After a short introduction to how Python and R implement membership tests, we will take a deeper look into Julia, touching on unique features such as generic programming and multiple dispatch.

1. Membership test operations in Python

Usually, a membership test has its own reserved word or special operation in the language. For example, in Python one could write

>>> 4 in [1, 2, 3]
False

>>> [1, 2] in [1, 2, 3]
False

>>> "a" in "abc"
True

>>> "ab" in "abc"
True

Note that in Python the in keyword is also used inside the for loop (for elem in iterable). If we are testing list_1 in list_2, we will test whether the first list as a whole is contained in the other list. When dealing with strings, in tests whether our first string is a substring of the second string.

If you are working with numpy arrays, the story is a bit different – as usual with numpy. We need the function numpy.isin for membership tests.

>>> import numpy as np
>>> a = np.array(1) # This array has zero dimensions!
array(1) 

>>> num_array = np.arange(1, 4)
array([1, 2, 3])

>>> np.isin(a, num_array) # Same number of dimensions as variable a
array(True)

>>> np.isin(np.array([1]), num_array) # 1 dimension. The input dimensions are kept!
array([ True])

>>> np.isin(np.array([1, 2]), num_array) # Elementwise check
array([ True,  True])

>>> np.isin(np.array([1, 2]), invert=True) # set invert=True to get 'not in'
array([False, False])

First, we notice that the input dimensions are preserved in the membership test. Second, the membership test is done elementwise in numpy. This was not the case when we tested [1, 2] in [1, 2, 3], where [1, 2] as a whole was taken to be a single element. Third, it is not so easy to just write something like not 4 in [1, 2, 3] when dealing with numpy arrays. To test non-membership, we have to set invert=True, because the dimensions of the input array are preserved. The pandas library also has an isin function for Series and DataFrames, which works similarly, but without the invert option.

2. Membership test operations in R

In R, there is the %in% operator which is just another way of using the match function (check the documentation ?%in%). This function is necessary because in R in can only be used in a for-loop. There are some differences with base Python, but the results are very similar to np.isin.

R> 1 %in% c(1, 2, 3)
[1] TRUE

R> c(1, 2) %in% c(1, 2, 3)
[1] TRUE TRUE

R> 1 %in% list(first = 1, second = 2, third = 3)
[1] TRUE

R> "a" %in% "abc"
[1] FALSE

The first example is just checking whether 1 is contained within the vector with the elements 1, 2 and 3. The last three examples, however, are more similar to the way numpy deals with membership tests. In the second example, it is checking whether each one of the elements in c(1, 2) is in the vector c(1, 2, 3). This differs from the base Python version, which takes the whole list [1, 2] to be the element to be searched in the iterable, therefore returning false. In base Python, you would rather use a list comprehension to get the same kind of behavior as R or numpy: [i in (1, 2, 3) for i in (1, 2)].

The third example shows that a list in R behaves pretty much like a vector. The single elements of the list can be named (here: first, second, third), but the names themselves are not taken into account by %in% – not unlike a named tuple. More surprisingly to Python users, the last example with a string returns FALSE. This is because strings are not iterables in R (remember this is a language that in a way is almost as old as C!). There are special functions for this such as grepl in the base package and str_detect from the package stringr, also included in the tidyverse.

3. Julia

Julia is a modern programming language that is mainly focused on scientific computation. It feels more natural for scientific computing than Python and it has a more modern design than R. Julia’s behavior for membership tests is more similar to base Python, but it differs in how it deals with strings.

`in`, \(\in\), \(\ni\) in Julia

The most basic operation is in which tests whether an item is in a given array. Going back to the first example with basic Python we see that the results are the same:

julia> 4 in [1, 2, 3]
false

julia> [1, 2] in [1, 2, 3]
false

However, in is not just a keyword, but a function. As Julia is based on generic programming and multiple dispatch, it actually has 33 different versions of the in function. In contrast to Python and R, Julia is statically typed. Each version has a different signature that depends on the type of the parameters:

julia> in # type the function name to get the number of methods
in (generic function with 33 methods)
    
julia> in(4, [1, 2, 3]) # call in the usual function notation
false
    
julia> @which in(4, [1, 2, 3]) # signature of the method based on the types
in(x, itr) in Base at operators.jl:1055

So when calling in with an integer and an array we get the signature in(x, itr). We have two generic types called x and iterable itr. We can check the code in operators.jl:

function in(x, itr)
    anymissing = false
    for y in itr
        v = (y == x)
        if ismissing(v)
            anymissing = true
        elseif v
            return true
        end
    end
    return anymissing ? missing : false
end

Although we are not stating it explicitly, we assume through duck typing that itr is an iterable, because we are using it in the for loop.

Now, let’s see what happens when we try to do a similar membership test with strings:

julia> "a" in "abc"
ERROR: use occursin(x, y) for string containment
  
julia> 'a' in "abc"
true
  
julia> @which in('a', "abc") # get the method signature
in(c::AbstractChar, s::AbstractString) in Base at strings/search.jl:141

Surprisingly, "a" fails while there is no problem with 'a'. Unlike Python or R, Julia treats single quotation marks as characters and double quotations as strings. They are two different types, like in good old C. Julia is telling us that if we want to test whether "a" is a substring of "abc", we have to use the occursin function. However, we can test whether the character 'a' is in the "abc" string with the in function. For this comparison, we are using a different comparison function from Base which can be found in strings/search.jl:

in(c::AbstractChar, s::AbstractString) = (findfirst(isequal(c),s)!==nothing)

With multiple dispatch, we determine which function to use at runtime depending on the types. This version of the in function in particular is using a specialized function findfirst for strings, which checks for each character in s whether it is equal to c and then checks whether something was found at all with !==nothing. (nothing is similar to None in Python.)

There is also an alias to the in function, \(\in\), which can be used by writing \in and then hitting the <tab> key. Its reverse is \(\ni\), which is just swaps the arguments. The following are equivalent:

julia> 4 ∈ [1, 2, 3] # as infix operator
false

julia> ∈(4, [1, 2, 3]) # as function
false

julia> [1, 2, 3] ∋ 4 
false

As a side note, arrays in Julia do not have to contain elements of the same type. Usually, the element type will be Any when combining types. However, note that in most cases this is less efficient than making sure all the elements are of the same type:

julia> [1, 'c']
2-element Array{Any,1}:
 1
  'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)

Broadcasting in Julia

We still have not addressed the question of whether it is possible to do membership tests as in numpy or R. The first idea would be to try the broadcasting capabilities of Julia. This can be used by adding a dot before an infix operator or after a function name:

julia> [1,2] .∈ [1, 2, 3]
ERROR: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 2 and 3")

That does not work as expected. Julia is trying to do an elementwise comparison, but the array dimensions do not match. Unlike R or numpy, Julia does not recycle or copy arrays nor tries to interpret what you mean. You get what you type. Let’s look at a broadcasting example that would work

julia> [1, 2] .∈ [[1, 4], [3, 0]]
2-element BitArray{1}:
 1
 0
 
julia> in.([1, 2], [[1, 4], [3, 0]]) # this alternative does the same

Julia tests whether 1 is in [1, 4] and whether 2 is in [3, 0]. The result is 1 (true) for the first test and 0 for the second one. But this is not what we are trying to replicate. It turns out that there is no built-in function to replicate the behavior of numpy and R. Again, we need a list comprehension, which is also a familiar feature of Python:

julia> [i in [1, 2, 3] for i in [1, 2]]
2-element Array{Bool,1}:
 1
 1

This is the exact result we got from R and numpy. However, the type is now different: Array{Bool,1} instead of BitArray{1}. This is not a major issue. They are both subtypes of AbstractArray{Bool} and are mostly interchangeable. They only differ in how they store the boolean value.

We now turn to the next major topic: how to do test membership for strings in Julia.

Strings: `occursin` in Julia

Membership tests can be done with the in function for a wide array of types. However, strings their own specialized function: occursin. There are two reasons for this. First, strings are not iterables in Julia, unlike Python. Second, occursin has functionality specific to strings. Third, testing for membership in a collection is not necessarily the same as testing a substring or a regular expression. Let’s take a look at the method signature:

occursin(needle::Union{AbstractString,Regex,AbstractChar}, haystack::AbstractString)

Basically, occursin finds a ’needle’ in the ‘haystack’. The needle can be either a subtype of AbstractString or AbstractCharacter, but also a regular expression.

julia> occursin("a", "abc") # is the needle a substring of the haystack?
true

julia> occursin("ab", "abc") # the needle string can be arbitrary
true

julia> occursin('a', "abc") # when needle is a character it is the same as in('a', in "abc")
true

julia> occursin(r"a.c", "abc") # using a regular expression
true

It is very neat that there is a built-in function for doing substring tests that also allows you to check regular expressions without using a regular expressions library. Moreover, we can use both in and occursin to test character membership in a string. However, we can check with the @code_native macro that the implementations of both functions are different.

julia> @code_native in('b', "abc")
...
julia> @code_native occursin('b', "abc")
...

The output is omitted, but one key difference is that occursin does not use the findfirst function that we showed earlier for the in function.

4. Conclusion

In this post, looked at membership tests in Julia, Python and R. This is a very common operation when programming. However, even with simple functions it is worth looking at the details, as it reveals a lot about design choices of a programming language. Personally, I think that Julia has a much more consistent an modern feel to it. Python is great, but the differences between Python, numpy, pandas, torch, etc. can be frustrating. As for R, it is a bag of surprises. The language is much older and does not always match the intuition you develop with a more modern programming langugae. More often than not you have to try something out, and see if it works as expected!

Follow me on Twitter @mexiamorelli

julia python R programming