Cluster comparisons with ARI

A bit of background

A common need for researchers that rely on clustering algorithms, such as the organization of networks into cohesive node communities, is to evaluate the similarity of the partitions produced. In my case this problem takes the form of comparing the distribution of brain networks across individuals. While many tools have been developed to tackle the challenge (see Fortunato & Hric, 2016 for an initial survey), here I’ll give a superficial view on the adjusted rand index (ARI), hoping to better understand its behavior and ideal case usage. Since I’m more familiar with the usage of this measure in networks, I’ll base my terminology on this framework.

The Rand Index

The original Rand Index (RI) was introduced by William Rand in 1975, and it aimed to determine whether two clustering algorithms grouped every pair of nodes in a similar fashion. For example, it could be that nodes 3 and 25 from your network were grouped together in partitions from algorithms A and B, but separately by method C (you can probably already think of uses for this index in regards to examining the effects of experimental network perturbations in their organization). This agreement is computed by the following simple formula:

\[\frac{a + b}{a + b + c + d}\]

Where \(a\) is the number of pairs of nodes that were grouped together in both partitions, \(b\) is the number that were grouped separately, and \(c\) and \(d\) denote the number grouped together (separately) in one partition, but separately (together) in the other. In short, the Rand Index just gives the proportion of nodes equally paired in both partitions.

At this point you might have already thought: what is the advantage of doing this instead of directly checking if the node labels are identical across partitions? (i.e. partitionA == partitionB) For some of us the answer is not quite evident, which is what motivated me to write this post. To begin exploring this, let’s look at a toy example of two hypothetical binary clustering algorithms (A and B) applied to a small network (I will mainly focus on this type of classification for simplicity):

# Toy clusterings
table <- data.frame(A = c(1,1,1,0,0),
                    B = c(0,0,0,1,1))

# Output table
table %>% 
  knitr::kable("html") %>%
  kable_styling(full_width = FALSE)
1 0
1 0
1 0
0 1
0 1
# Compute agreement (arandi comes from the package 'mcclust')
EQ <- mean(table$A == table$B)
RI <- arandi(table$A, table$B, adjust = F)

In this case, a simple equivalence (EQ) would of course yield a value of 0, as none of the values match. On the other hand, RI yields a value of 1. This is because the relationship among all nodes remains unchanged across partitions. This portrays the first advantage of metrics like RI: they evaluate the underlying relationship among your clustered nodes while being agnostic to the labeling system.

Now let’s tweak the clusters a bit, so that there is one rogue node that switches clusters.

0 0
0 1
0 0
1 1
1 1

In this case, checking the equivalence of the vectors yields 0.8, while the Rand Index gives 0.6. This is because RI doubles the dimensionality by considering the underlying interconnectedness of the nodes, such that the total number of pairwise combinations becomes factorial(5) / (factorial(2) * factorial(5-2)), or 10. By changing the allegiance of a single node, we basically change the interaction of that node with the other four members, giving us an agreement level between partitions of 6/10 (i.e., the RI value we got). To clarify this idea, let’s now suppose that cluster labels can be 0, 1, or 2, with the rogue node now being part of 2:

0 0
0 2
0 0
1 1
1 1

In this case, both the vector equivalence and RI give the same value (0.8). This is because the rogue node was originally tied to two others in cluster 0, whereas its unrelatedness to cluster 1 remains unchanged (unlike before, where it became part of it). This now gives a ratio of similarities of 8/10. Hopefully this provides a better sense of the advantages of RI over simple equivalence, especially once you increase both the number of nodes and clusters.

The Adjusted Rand Index

So far so good, but as researchers we always want to control for the possibility that our results are defined by chance! That’s where the adjustment to RI introduced by Hubert & Arabie (1985) comes in. The Adjusted Rand Index controls for the expected distribution of labels given by chance, and compares the original RI to this random model–while still guaranteeing that perfect clustering matches get a value of 1. Others have already done a great job in explaining the mathematical nuances of the ARI, so I will instead focus on understanding how the three metrics (i.e. EQ, RI, and ARI) behave under similar circumstances.

Let’s say we have an original vector of 100 nodes classified as part of either cluster 1 or 0. I will progressively switch the allegiance of each node, and on each case I will estimate the agreement between the new vector and the original one. In other words, I will check the values given by these metrics as the modified vector slowly becomes more disimilar from the original one. The plot summarizes the results.

# Produce a random vector with 100 binary classifications
n <- 100
v1 <- rbinom(n = n, size = 1, prob = 0.5)

# Data frame to store evaluations
evals <- data.frame(Diffs = seq(n),
                    ARI = rep(0,n),
                    RI = rep(0,n),
                    EQ = rep(0,n))

# Check the degree of agreement per metric for each incremental switch in affiliation
for (i in seq(n)) {
  # Switch affiliation up to the ith node
  ifelse(i < n, v2temp <- c((1 - v1[seq(i)]), v1[(i+1):n]), v2temp <- v1)
  # Evaluate
  evals$ARI[i] <- arandi(v1,v2temp)
  evals$RI[i] <- arandi(v1,v2temp, adjust = F)
  evals$EQ[i] <- mean(v1 == v2temp)

# Make the data.frame long instead of wide for ggplot
evals <- melt(evals[seq(n-1),], id.vars = "Diffs")

# And plot
ggplot(data = evals, aes(Diffs, value, group = variable, color = variable)) + 
  geom_line() +
  scale_color_discrete(name = "Method") +

So, progressively switching the affiliations of the original vector has the expected negative linear trend on EQ, as the vectors become increasingly dissimilar. On the other hand, RI heavily penalises initial clustering disagreements, but notice that once the differences pass 50% the vectors are deemed increasingly similar. This parabolic function is a result of node pairs once again being grouped in the same/different clusters once direct differences pass chance levels (think back to the first example, and how inverting the values still maintains the underlying relationships). Finally, we can see that ARI drops to 0 when the vector differences reach this chance level. In accord with its definition, positive ARI values thus denote agreements above those expected by chance1.


As I said in the introduction, I use the ARI to check how similar the distribution of functional brain networks is across individuals. However, you could imagine using this metric as a way to evaluate the accuracy of neural nets as they are being trained (provided that you have the ground truth of what’s being clustered). It is also worth mentioning that the ARI is not the ultimate way to evaluate clusterings. As Fortunato & Hric (2016) discuss, metrics based on mutual information tend to be more robust (i.e. variation of information), and are also included in statistical packages (like mcclust for R, which I use here).

Regardless, I hope this can eventually be as useful for someone to read as it was for me to write it.

  1. If you are interested in getting the statistical significance of this deviation from chance, see Fortunato & Hric (2016) who discuss the computation of Z-scores based on permutations. I might expand this post in the future to show how this works.