Topology and You: What the Future of NLP has to do with Algebraic Topology
Updated: Jan 3, 2020
The size of NLP datasets has grown exponentially over the last few years. It was only a handful of years ago that 100-dimension (non-contextualized) word2vec [Mikolov et al.] embeddings were cutting edge.
As sizes continue to exponentially increase, it would be beneficial to be able to know what new information we are actually capturing rather than relying on "Oh well it improves results downstream." This interest is strongly validated by the recent work into DistilBERT [Sanh et al.]. Models cannot continue to grow at the exponential rate they currently are, so it seems advantageous to use feature engineering from classical statistics to enrich and enhance our modern approaches.
Traditionally, when datasets got too large or too wide, it is useful to utilize their high level feature set rather than every dimension. Techniques to tackle this problem range from PCA, to learning filters like those seen in CNNs.
A more experimental technique, although one that has been around for decades now, is persistent homology. As shown in [Michel et al.] enrichment via topological features do not aid in downstream performance improvements directly; but, as shown in [Zhu13] they can not only provide a set of constraints to existing word embedding approaches (potentially allowing in a dimension reduction), they can shown previously unseen patterns.
Consider the quote "For instance, a good essay may have a conclusion paragraph that “ties back” to the introduction paragraph. Thus the starting point and the ending point of the curve may be close in the space. If we further connect all points within some small diameter, the curve may become a loop with a hole in the middle."
Later on within the paper, the author conjectures that adolescent essay writing is significantly more structured than that of children. It would then follow that this "structure" would be directly observable in the topological features. As directly stated in the paper "... We thus create a third “adolescent truncated” data set, where we keep the first 11 sentences in each adolescent essay to match child writing. This perhaps removed many later tie-backs in the essays. The third column in Table 1, however, still shows some differences compared to child writing: more truncated adolescent essays contain holes (p = 0.03, Fisher’s test)."
From there, the author concludes that great classification results of a child's essay vs an adult's essay does not require most of the complex machinery seen in modern deep learning. Sufficiently adequate feature engineering is still an amazing tool for dealing with very high dimension data.
While the above article is a fascinating case study, the main issue is that it only applies first order homology. As we increase or homological order (e.g. we seek high dimension holes), the amount of computation grows exponentially. Realistically, it is only practical to compute 1st and 2nd order holes. That sucks!
How much does this suck? High order holes tell us insane amounts about the data we're dealing with! See [Reimann et al.]. In this paper, researchers aimed to correlate the topological features of a biological neural network to its function.
I quote "We found a remarkably high number and variety of high-dimensional directed cliques and cavities, which had not been seen before in neural networks, either biological or artificial, and in far greater numbers than those found in various null models of directed networks .... In fact, we found a hierarchy of correlated activity between neurons even within a single directed clique. During activity, many more high-dimensional directed cliques formed than would be expected from the number of active connections, further suggesting that correlated activity tends to bind neurons into high-dimensional active cliques."
As it stands right now, there is no easy way to integrate persistent homology with modern NLP. Computing higher order homology groups of word embeddings can often take days. Almost all implementations of filtration, the technique utilized in computing homology groups, lack integration with PyTorch and CUDA. Opening topological data analysis (TDA) to modern GPGPU techniques, as well as TPUs would usher a new era of data visualization and perhaps let us get a glimpse inside the blackbox that is DL-NLP.
Recently, PyTorch released "Tensor comprehensions" which aims to reduce the complexity of performing scientific computations in PyTorch. This is a promising step in the right direction.
We might find structures previously entirely unknown to the scientific community just sitting under our noses in BERT.
What does this have to do with narratives?
Everything! Consider some form of embedding where we take every plot point within a book, and embed some latent vector describing where the narrative can progress to next (given the narrative prior/scaffold). Similar embeddings to this have been seen in [Tambwekar et al.].
Following an identical technique as presented in [Zhu13], one can observe that first order holes correlate to a plot referring back to earlier parts of the story, namely expanding on prior developments (whether that is a plot twist or character developments is unknown as of right now).
The long term aim is that high order holes directly correlate with the complexity of the narrative, or namely how "intricate" the plot is. This metric is useful as a threshold when performing Monte-Carlo integration over a narrative to produce an incoherency score Or more precisely, a narrative is to be marked "incoherent" if its "incoherency score" is above the threshold that is a function of the narrative complexity.
I hope to prove this point over the coming months. Stay tuned for updates.
Aside: Rebuttals are more than welcome! I'll happily write a follow up piece to answer questions :)
Written by Louis C. Twitter
Co-authored by Arthur R. Twitter