# Membership Tests in Julia, R and Python

Membership tests – checking whether an element is in a collection – is a common operation in statistical programming. However, depending on the programming language, the interface and underlying assumptions can vary a lot. After a short introduction to how Python and R implement membership tests, we will take a deeper look into Julia, touching on unique features such as **generic programming** and **multiple dispatch**.

## 1. Membership test operations in Python

Usually, a membership test has its own reserved word or special operation in the language. For example, in Python one could write

```
>>> 4 in [1, 2, 3]
False
>>> [1, 2] in [1, 2, 3]
False
>>> "a" in "abc"
True
>>> "ab" in "abc"
True
```

Note that in Python the `in`

keyword is also used inside the for loop (`for elem in iterable`

). If we are testing `list_1 in list_2`

, we will test whether the first list as a whole is contained in the other list. When dealing with strings, `in`

tests whether our first string is a substring of the second string.

If you are working with `numpy`

arrays, the story is a bit different – as usual with `numpy`

. We need the function `numpy.isin`

for membership tests.

```
>>> import numpy as np
>>> a = np.array(1) # This array has zero dimensions!
array(1)
>>> num_array = np.arange(1, 4)
array([1, 2, 3])
>>> np.isin(a, num_array) # Same number of dimensions as variable a
array(True)
>>> np.isin(np.array([1]), num_array) # 1 dimension. The input dimensions are kept!
array([ True])
>>> np.isin(np.array([1, 2]), num_array) # Elementwise check
array([ True, True])
>>> np.isin(np.array([1, 2]), invert=True) # set invert=True to get 'not in'
array([False, False])
```

First, we notice that the input dimensions are preserved in the membership test. Second, the membership test is done elementwise in `numpy`

. This was not the case when we tested `[1, 2] in [1, 2, 3]`

, where `[1, 2]`

as a whole was taken to be a single element. Third, it is not so easy to just write something like `not 4 in [1, 2, 3]`

when dealing with `numpy`

arrays. To test non-membership, we have to set `invert=True`

, because the dimensions of the input array are preserved. The `pandas`

library also has an `isin`

function for `Series`

and `DataFrames`

, which works similarly, but without the `invert`

option.

## 2. Membership test operations in R

In R, there is the `%in%`

operator which is just another way of using the `match`

function (check the documentation `?%in%`

). This function is necessary because in R `in`

can only be used in a `for`

-loop. There are some differences with base Python, but the results are very similar to `np.isin`

.

```
R> 1 %in% c(1, 2, 3)
[1] TRUE
R> c(1, 2) %in% c(1, 2, 3)
[1] TRUE TRUE
R> 1 %in% list(first = 1, second = 2, third = 3)
[1] TRUE
R> "a" %in% "abc"
[1] FALSE
```

The first example is just checking whether 1 is contained within the vector with the elements 1, 2 and 3. The last three examples, however, are more similar to the way `numpy`

deals with membership tests. In the second example, it is checking whether each one of the elements in `c(1, 2)`

is in the vector `c(1, 2, 3)`

. This differs from the base Python version, which takes the whole list `[1, 2]`

to be the element to be searched in the iterable, therefore returning false. In base Python, you would rather use a list comprehension to get the same kind of behavior as R or `numpy`

: `[i in (1, 2, 3) for i in (1, 2)]`

.

The third example shows that a list in R behaves pretty much like a vector. The single elements of the list can be named (here: `first, second, third`

), but the names themselves are not taken into account by `%in%`

– not unlike a named tuple. More surprisingly to Python users, the last example with a string returns `FALSE`

. This is because strings are *not* **iterables** in R (remember this is a language that in a way is almost as old as C!). There are special functions for this such as `grepl`

in the `base`

package and `str_detect`

from the package `stringr`

, also included in the `tidyverse`

.

## 3. Julia

Julia is a modern programming language that is mainly focused on scientific computation. It feels more natural for scientific computing than Python and it has a more modern design than R. Julia’s behavior for membership tests is more similar to base Python, but it differs in how it deals with strings.

`in`

, \(\in\), \(\ni\) in Julia

The most basic operation is `in`

which tests whether an item is in a given array. Going back to the first example with basic Python we see that the results are the same:

```
julia> 4 in [1, 2, 3]
false
julia> [1, 2] in [1, 2, 3]
false
```

However, `in`

is not just a keyword, but a function. As Julia is based on generic programming and multiple dispatch, it actually has 33 different versions of the `in`

function. In contrast to Python and R, Julia is statically typed. Each version has a **different signature** that depends on the type of the parameters:

```
julia> in # type the function name to get the number of methods
in (generic function with 33 methods)
julia> in(4, [1, 2, 3]) # call in the usual function notation
false
julia> @which in(4, [1, 2, 3]) # signature of the method based on the types
in(x, itr) in Base at operators.jl:1055
```

So when calling `in`

with an integer and an array we get the signature `in(x, itr)`

. We have **two generic types** called `x`

and iterable `itr`

. We can check the code in `operators.jl`

:

```
function in(x, itr)
anymissing = false
for y in itr
v = (y == x)
if ismissing(v)
anymissing = true
elseif v
return true
end
end
return anymissing ? missing : false
end
```

Although we are not stating it explicitly, we assume through duck typing that `itr`

is an iterable, because we are using it in the `for`

loop.

Now, let’s see what happens when we try to do a similar membership test with strings:

```
julia> "a" in "abc"
ERROR: use occursin(x, y) for string containment
julia> 'a' in "abc"
true
julia> @which in('a', "abc") # get the method signature
in(c::AbstractChar, s::AbstractString) in Base at strings/search.jl:141
```

Surprisingly, `"a"`

fails while there is no problem with `'a'`

. Unlike Python or R, Julia treats single quotation marks as characters and double quotations as strings. They are two different types, like in good old C. Julia is telling us that if we want to test whether `"a"`

is a substring of `"abc"`

, we have to use the `occursin`

function. However, we can test whether the character `'a'`

is in the `"abc"`

string with the `in`

function. For this comparison, we are using a different comparison function from `Base`

which can be found in `strings/search.jl`

:

```
in(c::AbstractChar, s::AbstractString) = (findfirst(isequal(c),s)!==nothing)
```

With **multiple dispatch**, we determine which function to use at runtime depending on the types. This version of the `in`

function in particular is using a specialized function `findfirst`

for strings, which checks for each character in `s`

whether it is equal to `c`

and then checks whether something was found at all with `!==nothing`

. (`nothing`

is similar to `None`

in Python.)

There is also an alias to the `in`

function, \(\in\), which can be used by writing `\in`

and then hitting the `<tab>`

key. Its reverse is \(\ni\), which is just swaps the arguments. The following are equivalent:

```
julia> 4 ∈ [1, 2, 3] # as infix operator
false
julia> ∈(4, [1, 2, 3]) # as function
false
julia> [1, 2, 3] ∋ 4
false
```

As a side note, arrays in Julia do not have to contain elements of the same type. Usually, the element type will be `Any`

when combining types. However, note that in most cases this is less efficient than making sure all the elements are of the same type:

```
julia> [1, 'c']
2-element Array{Any,1}:
1
'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
```

### Broadcasting in Julia

We still have not addressed the question of whether it is possible to do membership tests as in `numpy`

or R. The first idea would be to try the broadcasting capabilities of Julia. This can be used by adding a dot before an infix operator or after a function name:

```
julia> [1,2] .∈ [1, 2, 3]
ERROR: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 2 and 3")
```

That does not work as expected. Julia is trying to do an elementwise comparison, but the array **dimensions do not match**. Unlike R or `numpy`

, Julia does not recycle or copy arrays nor tries to interpret what you mean. You get what you type. Let’s look at a broadcasting example that would work

```
julia> [1, 2] .∈ [[1, 4], [3, 0]]
2-element BitArray{1}:
1
0
julia> in.([1, 2], [[1, 4], [3, 0]]) # this alternative does the same
```

Julia tests whether `1`

is in `[1, 4]`

and whether `2`

is in `[3, 0]`

. The result is `1`

(true) for the first test and `0`

for the second one. But this is not what we are trying to replicate. It turns out that there is no built-in function to replicate the behavior of `numpy`

and R. Again, we need a **list comprehension**, which is also a familiar feature of Python:

```
julia> [i in [1, 2, 3] for i in [1, 2]]
2-element Array{Bool,1}:
1
1
```

This is the exact result we got from R and `numpy`

. However, the type is now different: `Array{Bool,1}`

instead of `BitArray{1}`

. This is not a major issue. They are both subtypes of `AbstractArray{Bool}`

and are mostly interchangeable. They only differ in how they store the boolean value.

We now turn to the next major topic: how to do test membership for strings in Julia.

### Strings: `occursin`

in Julia

Membership tests can be done with the `in`

function for a wide array of types. However, strings their own specialized function: `occursin`

. There are two reasons for this. First, strings are *not* iterables in Julia, unlike Python. Second, `occursin`

has functionality specific to strings. Third, testing for membership in a collection is not necessarily the same as testing a substring or a regular expression. Let’s take a look at the method signature:

```
occursin(needle::Union{AbstractString,Regex,AbstractChar}, haystack::AbstractString)
```

Basically, `occursin`

finds a ’needle’ in the ‘haystack’. The needle can be either a subtype of `AbstractString`

or `AbstractCharacter`

, but *also* a regular expression.

```
julia> occursin("a", "abc") # is the needle a substring of the haystack?
true
julia> occursin("ab", "abc") # the needle string can be arbitrary
true
julia> occursin('a', "abc") # when needle is a character it is the same as in('a', in "abc")
true
julia> occursin(r"a.c", "abc") # using a regular expression
true
```

It is very neat that there is a built-in function for doing substring tests that also allows you to check **regular expressions** without using a regular expressions library. Moreover, we can use both `in`

and `occursin`

to test character membership in a string. However, we can check with the `@code_native`

macro that the implementations of both functions are different.

```
julia> @code_native in('b', "abc")
...
julia> @code_native occursin('b', "abc")
...
```

The output is omitted, but one key difference is that `occursin`

does not use the `findfirst`

function that we showed earlier for the `in`

function.

## 4. Conclusion

In this post, looked at membership tests in Julia, Python and R. This is a very common operation when programming. However, even with simple functions it is worth looking at the details, as it reveals a lot about design choices of a programming language. Personally, I think that Julia has a much more consistent an modern feel to it. Python is great, but the differences between Python, `numpy`

, `pandas`

, `torch`

, etc. can be frustrating. As for R, it is a bag of surprises. The language is much older and does not always match the intuition you develop with a more modern programming langugae. More often than not you have to try something out, and see if it works as expected!

Follow me on Twitter @mexiamorelli