Paper, Tags: #nlp
Overview of the history of computational linguistics, the language problems and the development of neural networks.
In this paper we identify contributions that the nature of language has made to the development of neural network architectures. We focus on the importance of variable binding and its instantiation in attention-based models, and argue that Transformer is not a sequence model but an induced-structure model.
Attention-based models, which replace hand-coded structures with induced structures, represent language with multiple levels of structured representations. We identify remaining challenges in learning language from data, and its possible limitations.
The ability to learn powerful vector-space representations from data led man yconnectionist to argue that the complex discrete structured representations o fcomputational linguistics were neither necessary nor desirable. Learning from data made linguistic theories irrelevant.
Although researchers in computational linguistics did not want to abandon their representations, they did recognise the importance of learning from data. The first successes in this direction came from learning rules with statistical methods, such as part-of-speech tagging with hidden Markov models. For syntactic parsing, the development of the Penn Treebank led to many statistical models whic hlearned the rules of grammar. These statistical methods still required linguistically-motivated designs to work well, in particular feature engineering.
Results demonstrate the incredible effectiveness of inducing vector-space representations with neural networks, relieving us from the need to do feature engineering. But neural networks do not relieve us of the need to understand the nature of language when designing our models. Instead of feature engineering, these results show that the best accuracy is achieved by engineering the inductive bias of deep learning models through their model structure.
With attention-based models, the model structure can now be learned. BERT representations also encode syntactically-relevant relations in pairs of token embeddings.
The great success of neural networks in NLP has not been because they are radically different from pre-neural computational theories of language, but because they have succeeded in replacing hand-coded components of those models with learned components which are specifically designed to capture the same generalisations.
We predict that there is at least one more hand-coded aspect of these models which can be learned from data, but question whether they all can be. Transformer can learn representations of entities and their relations, but current work (to the best of our knowledge) all assumes that the set of entities is a predefined function of the text. Given a sentence, a Transformer does not learn how many vectors it should use to represent it. The number of positions in the input sequence is given, and the number of token embeddings is the same as the number of input positions.
There are models which induce the set of entities in an input text, but these are (to the best of our knowledge) not learned jointly with a downstream deep learning model. Common examples include BPE and unigram LM. The resulting subwords become the entities for a DL model, but they do not explicitly optimize the performance of this downstream model.
Successful deep learning architectures for natural language currently still have many handcoded aspects. The levels of representation are hand-coded, based on linguistic theory or available resources. Often deep learning models only address one level at a time, whereas a full model would involve levels ranging from the perceptual input to logical reasoning. Even within a given level, the set of entities is a pre-defined function of the text.
This analysis suggests that an important next step in deep learning architectures for natural language understanding will be the induction of entities.