!pip install -r requirements.txt
Heatmaps done right
A note on clustering distance matrices before plotting. Think sns.clustermap using altair.
Given a distance matrix, i.e. a symmetric matrix of distances between observations, if the indices of the observations are arbitrary, or more generally there is no variable by which we want to order the observations, then plotting it as a heatmap using the default ordering (of the indices) is often not very useful. If we do so, we get a heatmap where the rows and columns are ordered abritrarily and the plot itself may be hard to interpret. In many cases, a better approach is to cluster the observations and use the new ordering This will lead to a heatmap in which similar observations are grouped. This is what sns.clustermap
offers, as seen in the docs.
Question: Is there another better way to re-order the indices? This probably depends on the context and the meaning of the heatmap, but anything purely algebraic would probably not work, as the before and after objects are ultimately different matrices.
Note that the data itself is not the focus here. We just need a distance matrix in order to show how it can be reordered. We will use the USA presidential speeches, and consider distances between texts, to walk through the clustering and then do the plotting using altair for interactivity - as much as we like seaborn (and it’s next generation API looks very cool), interactivity is great! Then again from the source code, it looks like a lot of thought has been put into clustermap
, so there might be other reasons to use it.
Set up
Imports
import altair as alt
import nltk
from nltk.corpus import inaugural
import pandas as pd
# from scipy.spatial.distance import pdist
from scipy.cluster import hierarchy
Data
Let us quickly get the presidential addresses from nltk, and then compute the Jaccard distance on the 5-shingles1.
1 Shingles are substrings of a certain length that are created by passing a moving window along the text. This choice of representation of the texts is common when working with web data, and can deal with mispellings and the sequential nature of text (to some extent). See e.g. (Schütze, Manning, and Raghavan 2008, sec. 19.6).
First we download and import the data.
"inaugural") nltk.download(
= inaugural.fileids()
ids = [
data
{"id": i,
"year": id.split("-")[0],
"president": (id.split("-")[1]).split(".")[0],
"text": inaugural.raw(fileids=id).lower(),
}for i, id in enumerate(ids)
]= pd.DataFrame(data) df
Then we shingle the text.
def get_shingles(x, size=5):
= x + (size * " ")
x = [x[i : i + size] for i in range(0, len(x) - size)]
shingles return shingles
"shingles"] = df["text"].apply(get_shingles) df[
And finally we can compute the Jaccard similarity.
def get_similarity(x, y, precision=3):
= set(x)
a = set(y)
b return round(len(a.intersection(b)) / len(a.union(b)), precision)
= df.copy()
df_pairs "key"] = 0
df_pairs[= df_pairs.merge(df_pairs, on="key").drop(columns=["key"])
df_pairs
"similarity"] = df_pairs.apply(
df_pairs[lambda row: get_similarity(row["shingles_x"], row["shingles_y"]), axis=1
)
=["text_x", "text_y", "shingles_x", "shingles_y"], inplace=True) df_pairs.drop(columns
df_pairs.head()
id_x | year_x | president_x | id_y | year_y | president_y | similarity | |
---|---|---|---|---|---|---|---|
0 | 0 | 1789 | Washington | 0 | 1789 | Washington | 1.000 |
1 | 0 | 1789 | Washington | 1 | 1793 | Washington | 0.071 |
2 | 0 | 1789 | Washington | 2 | 1797 | Adams | 0.231 |
3 | 0 | 1789 | Washington | 3 | 1801 | Jefferson | 0.219 |
4 | 0 | 1789 | Washington | 4 | 1805 | Jefferson | 0.218 |
Clustering
Clustering is straightforward. We are after the “optimal” ordering, i.e. the re-ordering with similar values placed close to each other.
= df_pairs.pivot(index="id_x", columns="id_y", values="similarity").to_numpy()
mat_pairs
= hierarchy.linkage(mat_pairs, optimal_ordering=True)
Z
= hierarchy.leaves_list(Z) reordering
Plotting
By date
Note that our indices are actually ordered by date. This might be an interesting dimension and plotting the distances by date might reveal some insight from the data.
= 500 w
Indeed, it seems like there are some interesting groups of two or three consecutive terms/presidents with similar inaugural speeches. One could probably have fun checking political party and second terms in office.
Often however, indices are randomly assigned, in which case there is never a good reason to order by the original indices when plotting.
By distance
Unfortunately, with our arbitrary choice of data, the reordering doesn’t seem to add much. However, we can still see a single main cluster of similar values that might deserve further inspection and reveal something, as well as at least one other smaller and less homogeneous cluster.