Heatmaps done right

A note on clustering distance matrices before plotting. Think sns.clustermap using altair.

Visualisation

How-to

Published

March 19, 2022

Given a distance matrix, i.e. a symmetric matrix of distances between observations, if the indices of the observations are arbitrary, or more generally there is no variable by which we want to order the observations, then plotting it as a heatmap using the default ordering (of the indices) is often not very useful. If we do so, we get a heatmap where the rows and columns are ordered abritrarily and the plot itself may be hard to interpret. In many cases, a better approach is to cluster the observations and use the new ordering This will lead to a heatmap in which similar observations are grouped. This is what sns.clustermap offers, as seen in the docs.

Question: Is there another better way to re-order the indices? This probably depends on the context and the meaning of the heatmap, but anything purely algebraic would probably not work, as the before and after objects are ultimately different matrices.

Note that the data itself is not the focus here. We just need a distance matrix in order to show how it can be reordered. We will use the USA presidential speeches, and consider distances between texts, to walk through the clustering and then do the plotting using altair for interactivity - as much as we like seaborn (and it’s next generation API looks very cool), interactivity is great! Then again from the source code, it looks like a lot of thought has been put into clustermap, so there might be other reasons to use it.

Set up

!pip install -r requirements.txt

Imports

import altair as alt
import nltk
from nltk.corpus import inaugural
import pandas as pd

# from scipy.spatial.distance import pdist
from scipy.cluster import hierarchy

Data

Let us quickly get the presidential addresses from nltk, and then compute the Jaccard distance on the 5-shingles¹.

¹ Shingles are substrings of a certain length that are created by passing a moving window along the text. This choice of representation of the texts is common when working with web data, and can deal with mispellings and the sequential nature of text (to some extent). See e.g. (Schütze, Manning, and Raghavan 2008, sec. 19.6).

First we download and import the data.

nltk.download("inaugural")

ids = inaugural.fileids()
data = [
    {
        "id": i,
        "year": id.split("-")[0],
        "president": (id.split("-")[1]).split(".")[0],
        "text": inaugural.raw(fileids=id).lower(),
    }
    for i, id in enumerate(ids)
]
df = pd.DataFrame(data)

Then we shingle the text.

def get_shingles(x, size=5):
    x = x + (size * " ")
    shingles = [x[i : i + size] for i in range(0, len(x) - size)]
    return shingles

df["shingles"] = df["text"].apply(get_shingles)

And finally we can compute the Jaccard similarity.

def get_similarity(x, y, precision=3):
    a = set(x)
    b = set(y)
    return round(len(a.intersection(b)) / len(a.union(b)), precision)

df_pairs = df.copy()
df_pairs["key"] = 0
df_pairs = df_pairs.merge(df_pairs, on="key").drop(columns=["key"])

df_pairs["similarity"] = df_pairs.apply(
    lambda row: get_similarity(row["shingles_x"], row["shingles_y"]), axis=1
)

df_pairs.drop(columns=["text_x", "text_y", "shingles_x", "shingles_y"], inplace=True)

df_pairs.head()

	year_x	president_x	id_y	year_y	president_y	similarity
0	1789	Washington	0	1789	Washington	1.000
1	1789	Washington	1	1793	Washington	0.071
2	1789	Washington	2	1797	Adams	0.231
3	1789	Washington	3	1801	Jefferson	0.219
4	1789	Washington	4	1805	Jefferson	0.218

Clustering

Clustering is straightforward. We are after the “optimal” ordering, i.e. the re-ordering with similar values placed close to each other.

mat_pairs = df_pairs.pivot(index="id_x", columns="id_y", values="similarity").to_numpy()

Z = hierarchy.linkage(mat_pairs, optimal_ordering=True)

reordering = hierarchy.leaves_list(Z)

Plotting

By date

Note that our indices are actually ordered by date. This might be an interesting dimension and plotting the distances by date might reveal some insight from the data.

w = 500

(
    alt.Chart(df_pairs)
    .mark_rect()
    .encode(
        x="id_x:O",
        y="id_y:O",
        color=alt.Color("similarity:Q", scale=alt.Scale(scheme="reds")),
        tooltip=["president_x", "year_x", "president_y", "year_y", "similarity"],
    )
    .properties(width=w, height=w)
    .interactive()
)

Figure 1: Heatmap using default order.

Indeed, it seems like there are some interesting groups of two or three consecutive terms/presidents with similar inaugural speeches. One could probably have fun checking political party and second terms in office.

Often however, indices are randomly assigned, in which case there is never a good reason to order by the original indices when plotting.

By distance

(
    alt.Chart(df_pairs)
    .mark_rect()
    .encode(
        x=alt.X("id_x:O", sort=alt.Sort(reordering)),
        y=alt.Y("id_y:O", sort=alt.Sort(reordering)),
        color=alt.Color("similarity:Q", scale=alt.Scale(scheme="reds")),
        tooltip=["president_x", "year_x", "president_y", "year_y", "similarity"],
    )
    .properties(width=w, height=w)
    .interactive()
)

Figure 2: Heatmap using re-ordering from clustering.

Unfortunately, with our arbitrary choice of data, the reordering doesn’t seem to add much. However, we can still see a single main cluster of similar values that might deserve further inspection and reveal something, as well as at least one other smaller and less homogeneous cluster.

References

Schütze, H, CD Manning, and P Raghavan. 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge.

Reuse

https://creativecommons.org/licenses/by/4.0/