Joint Representations of Connectionism vs. Symbolism via Attractor Networks and Self Attention
Updated: Apr 10, 2020
In most modern fields of cognitive computing, as it pertains to natural language processing, both symbolism and connectionism have their own unique benefits. Symbolism relates to the idea that most cognitive processes can be computed as the manipulation of symbols. Alternatively, in connectionism, computation is done by a metaphorical “black box” in which the representation and computation are spread over many, potentially millions, nodes in a highly complex network. In this essay, I hope to present a hypothesis that large scale connectionism lends itself well to abstract symbolism. That is to say, highly connected networks in some regard are performing symbolic computation- but perhaps not in a notion we are familiar with. I will review work by IBM Blue Brain as well as EPFL, in hopes to show that modern deep learning is indeed creating some notion of symbolism via its architectures. That is to say, it is the topology of my connectome that encodes some notion of symbols, rather than the network working with symbols directly. This idea is highly prevalent in neural coding and is often discussed with in relation to Professor Eliasmith’s work with CNRG. I will conclude by showing this notion of symbolism vs connectionism lead to identical representations.
Brief Discussion on Topology
Discussing connectionism without a formal definition of connections is highly nonsensical. As such, I will give a brief summary of some of the related concepts from graph theory as well as what it means to discuss the structure of a graph as it pertains to an information flow model, like that seen in biological and artificial neural networks
A few basics
Neural Networks and Topology
From this point on, W_V will be referred to as the “activation space” and W_E will be referred to as the “parameter space.” This is in direct relationship to their true meanings, within the realm of information flow models. Firstly, the simplest non-trivial neural network is a multi-layer perceptron or MLP for short. In an MLP, every neuron of a layer connects to all neurons of the next layer. This relationship is constant for all neurons in the network. The connections of such graph are not really of interest in most symbolic cases, as vanilla MLPs contain no clique subgraphs. Namely, due to the lack of recurrent connections, a vanilla MLP cannot effectively store short term information. This is immediately seen by the fact that chaos theory dictates a recurrent parameter must be present for an attractor point and is justified by Eliasmith’s own work into attractor networks- where a feed forward attractor network requires a bidirectional connection. [Eliasmith2005] However, a point to make is that a sufficiently large MLP does have short term memory if one effectively partitions the activation space, as seen in techniques akin to multi-head attention. This, in the context of a network called a transformer, will be the focus of our discussions. [Vaswani2017]
For all practical intents and purposes, an attractor network is a weighted n-clique, where n determines the number of bits our attractor network stores. Assuming that our parameter space is kept constant, an attractor network stores a plethora of attractor points- which can be reached by changing the activation space and waiting for convergence assuming perfect efficiency of neurons.
A key idea with attractor networks, and why they are of great use, has to do with the geometry of their activation space. Similar to the cell state of an recurrent neural network, the geometry of such a space is many nontrivial manifolds that intersect at various points. After the first iteration, an attractor network more or less chooses which sub-manifold it which to rest on. Each sub-manifold has some number of attractor points that the attractor network then slowly converges towards.
The main take-away is that the sub-manifold is selected as a function of the initialized activation values. In that regard, the attractor network is performing a rule based reduction as it slowly approaches a discrete steady state of that sub-manifold.
Pre-DNNs, some of the best natural language artificial intelligence systems relied on heavy knowledge engineering and rule-based systems. We will consider the case of SysTran, a rule-based translation model from the early 1970s constructed to translate German to English (and vice versa). In SysTran, an expert would engineer transformation rules between German and English. Given a large enough set of rules, one could “effectively” translate between the two languages. Because of the symbolic nature of language, this does indeed hold true; however, in practice to get near flawless translations- a claim only recent language models can make- one would require a near infinite number of rules. Language, having a discrete structure, lends itself best to two possible optimization tasks. Namely, rule-based optimization and integer programming. Hence, it would make sense if translation often included rule based learning or symbolic manipulation. So, why have modern language models entirely forgone rule-based learning?
Rules in DNNs.
Recall Paul Thagard’s definition of spreading activation concept maps, which we shall abbreviate as SACM. A SACM is a weighted graph, (V, E, W_E, W_V), where every vertex has a symbol that represents a concept. [Thagard2005] Edges in a SACM represent some relationship between concepts, whether it is correlation, causality, or similarity is up to the implementation. Our activation space is either binary values or scalars in the range [0,1], depending on our logic system. An alternative representation to such a model is by drawing analogies to Bayesian networks. Professor Judea Pearl likened SACMs to Bayesian cognitive models in his book "Probabilistic reasoning in intelligent systems" [Pearl1998]. Rather than vertices directly representing binary symbols or scalars, vertices are functions with a range over random variables. That is to say, every vertex random variable that is conditioned on its input. Thus, assuming the parameter space of a SACM is constant, we can express this as
Where v is, WLOG, our selected vertex, RV is a set of random variables, S is some state space of our selected vertex, and f(.) is a function that takes a vertex and determines its state. That is to say, depending on the weights of the neighbouring vertices, our vertex conditions its associated random variable on said prior. However, NNs do not operate directly over random variables- for sake of brevity, rather than constructing lifts directly, I denote the following.
This actually correlates quite well with Slinderman’s 2017 paper on recurrent switching dynamics [Slinderman2017] - given some neuron and its respective input, the neuron then expresses its output as a Markov Chain which must be sampled an adequate number of times to find the true post-synaptic value. If one considers that a neuron perhaps only can recall/associate at most countably many Markov Chains, an assumption that lends itself well to Bayesian learning via softmax or stick-breaking, then these chains define a relationship between the pre-synaptic prior and some discrete state of the neuron. This latent discrete state from a smooth process is what we need for this formalization. A similar latent discrete state exists in attractor networks, as we will soon see.
As discussed earlier, self-attention mechanisms are attractor networks. That is to say, they are an n-clique, where n is the number of tokens. This attractor network is initialized to the word embeddings that we provide it. If one were to consider word embeddings as lying on a SACM, then the edges would refer to similarity and the edge weight would refer to the distance between two word vectors. Since such a SACM is all-to-all connected, our attractor network is a subgraph of this SACM- but for only the first layer of our neural network. Our attractor network takes all of the tokens we’ve provided it with and with some rule that it had previously learned, provides a new set of symbols. This is repeated as many times as deemed necessary, until we take said set of tokens and convert it back to text. This behaviour can be seem in the generalized form of attention mechanisms. Namely given q,K,V (query, Key, Vector) the post attention values of some tokens are
where the key defines some precondition, the query defines a prior on our rule, and the value defines the post condition. Every layer has its own associated SACM, and as such can be thought of doing its own respective step of symbolic reduction/manipulation. The edge weights of these SACMs are the attention weights between tokens.
After computing the attention scores, the network then moves every token (WLOG, considered for the ith token) embedding somewhere on the constrained cone defined by
where v_i refers to the ith token embedding on our current layer, n is the number of tokens, and where the coordinate vector, x, lies in the unit simplex. This runs parallel to attractor networks, where every bit follows some trajectory in its respective sub-manifold and the query is defined as the (reduced) condition of its neighbours. Perhaps an alternative analogy is that the query value defines the sub-manifold(s) that the attractor currently lays in, while the key and value vectors define how to move around said sub-manifold(s). These sub-manifolds and attractor points are the discrete states earlier discussed in the prior section. What is currently unknown is the relationship between attractors and self attention due to the restriction of not allowing continuous time simulation in ANNs. There are a few possible explanations of this:
Selecting the sub-manifold is enough to infer a discrete latent state, given that movement is restricted to unit convex combinations.
Due to dropping out attention mechanisms, adjacent attention layers act [relatively] inclusively on prior embeddings. Evidence for this can be seen here [Kovaleva2019], noted by the high cosine similarity of attention scores between layers.
Due to the noise reduction introduced by applying softmax rather than leaving connection weights unconstrained, convergence is much faster to reach as the oscillations seen in attractor networks are significantly dampened.
Attention mechanisms are polysemantic, meaning that they act equivariantly over all embeddings in an orbit. Evidence for this can be seen by how attention scores are defined relative to all input embeddings rather than a single embedding. Namely, the orbit is defined as the set of all vectors that one embedding can be exchanged for such that the attention scores remain relatively constant. The exploitation of this relationship might significantly aid in the convergence of self attention.
Said being, while self attention networks are not repeated/recurrent/tied, large models like BERT have enough layers such that this isn’t inherently a problem.
I conjecture that self attention mechanisms induce a pseudo-rule based structure which is used to perform token reduction in modern language models- hence, explaining part of their potency with symbolic manipulation like that seen in translation. Given table 7 in "What does Attention Pay Attention To" [Ghader2017], it is evident that at some level of abstraction there are rules being employed. Namely, nouns and verbs have significantly varying attention vectors- sharing almost no major attention percentage attribute. Similarly, conjugations point towards punctuation almost 30 percent of the time- not a proportion to scoff at Furthermore, more traditional symbolic models like "Analyzing Mathematical Reasoning Abilities of Neural Models" [Saxton2019] suggests that while single headed non-self attention performs significantly worse than most vanilla models, self attention performs significantly above the baseline- over a 13 percent improvement above baseline. This implies that the pseudo-recurrent nature of self attention, induced by in place manipulation, might significantly aid in symbolic reduction.
Appendix: Biological Influences.
It has been known for quite a while [Riemann2017] that cliques of neurons are highly prevalent in the brain. [Riemann2017] takes slices of the neocortex, varying in layer depth from 4 – 6 layers and shows that later layers have more n-order cavities than earlier layers for most large values of n. This varies drastically from randomly constructed graphs, which for the most part have a linearly decreasing relationship between layer depth and n-order cavities. A logical conclusion to make is that as the neural network lifts to more abstract representations it can more easily perform complex logical operations by applying increasing degrees of complexity attractor networks. Evidence that artificial network networks are performing the same task can be seen in [Roy2020] where introducing sparsity into earlier self-attention mechanisms improves performance significantly- hence by allowing the network to reach more complex abstractions before applying complex attractor networks.
Special thanks to:
Professor Chris Eliasmith @ uWaterloo, as this essay was written originally for his course.
Dr. Bastian Rieck @ ETH, for assisting in conjecturing why large oscillations in self attention networks are not seen- unlike biologically feasible attractor networks.
And many others.