Compositionality Decomposed: How do Neural Networks Generalise? (Extended

Despite a multitude of empirical studies, little consensus exists on whether neural networks are able to generalise compositionally . As a response to this controversy, we present a set of tests that provide a bridge between, on the one hand, the vast amount of linguistic and philosophical theory about compositionality of language and, on the other, the successful neural models of language. We collect different interpretations of compositionality and translate them into ﬁve theoretically grounded tests for models that are formulated on a task-independent level. To demonstrate the usefulness of this evaluation paradigm, we instantiate these ﬁve tests on a highly compositional data set which we dub PCFG SET, apply the resulting tests to three popular sequence-to-sequence models, and provide an in-depth analysis of the results.


Introduction
Most current models of natural language processing use artificial neural networks (ANNs).The architectural design of such models is not motivated by knowledge about linguistics or human processing, but they are nevertheless more successful than earlier-age (sub)symbolic models on a variety of natural language processing tasks.However, it remains difficult to assess if the composition functions that ANNs implement are truly appropriate for natural language and, importantly, to what extent they are in line with the vast amount of knowledge and theories about semantic composition from formal semantics and (psycho)linguistics.
One particular question that has recently attracted the attention of several researchers is whether ANNs are capable of learning compositional solutions.Despite a multitude of empirical studies on this topic, little concensus exists.One issue standing in the way of more clarity on this matter is that different researchers have different interpretations of what exactly it means to say that a model is or is not compositional, a point exemplified by the vast number of different tests that exist for compositionality [Lake and Baroni, 2018;Hupkes et al., 2018;Johnson et al., 2017;Bahdanau et al., 2018;Saxton et al., 2019;Loula et al., 2018;Dessì and Baroni, 2019;Liška et al., 2018;Bowman et al., 2015;Mul and Zuidema, 2019].We argue that to empirically test models for compositionality, it is necessary to first establish what is to be considered compositional behaviour.
With this work, we aim to contribute to clarity on this point, by presenting a study in which we collect different aspects of and intuitions about compositionality of language from linguistics and philosophy.We translate them into concrete tests that provide insight into the composition functions learned by neural models trained end-to-end on a downstream task.

Testing Compositionality
We individuate five aspects of compositionality that are explicitly motivated by theoretical literature on this topic.Systematicity.The first property we test for is systematicity.The term was introduced by Fodor and Pylyshyn, who used it to denote that "[t]he ability to produce/understand some sentences is intrinsically connected to the ability to produce/understand certain others" [Fodor and Pylyshyn, 1988].This ability concerns the recombination of known parts and rules: anyone who understands a number of complex expressions also understands other complex expressions that can be built up from the constituents and syntactical rules employed in the familiar expressions.
Here, we ask not only if a model infers a systematic solution, but also whether the rules and constituents the model uses are in line with what we believe to be the actual rules and constituents underlying a particular data set or language.We test for systematicity by testing if a model can recombine constituents that have not been seen together during training.
In particular, we focus on combinations of words a and b that meet the requirements that the model has only been familiarised with a in contexts excluding b and vice versa but the combination a b is plausible given the rest of the corpus.Productivity.A notion closely related to systematicity is productivity, which concerns the open-ended nature of natural language: language appears to be infinite, but has to be stored with finite capacity.Hence, there must be some productive way to generate new sentences from this finite storage Non-terminal rules

Unary functions F U :
Binary functions F B : The context free grammar that describes the entire space of grammatical input sequences in PCFG SET (left) and the interpretation functions describing how the meaning of PCFG SET input sequences is formed (right).[Chomsky, 1956;von Humboldt, 1836].
Both systematicity and productivity rely on the recombination of known constituents into larger compounds.To separate systematicity from productivity, in our productivity test we specifically focus on the aspect of unboundedness.We test whether a model can understand sentences that are longer than the ones encountered during training.Substitutivity.A principle closely related to the principle of compositionality (here, we consider the version of [Partee, 1995]) is the principle of substitutivity, which states that if an expression is altered by replacing one of its constituents with another constituent with the same meaning, this does not affect the meaning of the expression [Pagin, 2003].
We test for substitutivity by probing under which conditions a model considers two atomic units to be synonymous.To do so, we artificially introduce synonyms and consider how the prediction of a model changes when an atomic unit in an expression is replaced by its synonym.We consider two different cases.Firstly, we analyse the case in which synonymous words occur equally often and in comparable contexts.Secondly, we consider pairs of words in which one of the words occurs only in very short sentences, which we call primitive contexts.Localism.The principle of compositionality does not impose any restrictions on how different elements should be combined.As a consequence, the interpretation of the principle of compositionality depends on the type of constraints that are put on the semantic and syntactic theories involved (see e.g.[Janssen, 1983;Zadrozny, 1994]).In global or weak compositionality, the meaning of an expression follows from its global structure and the meanings of its atomic parts.In this interpretation, a compound can have a different meaning, depending on the larger expression that they are a part of (for some examples, see [Carnap, 1947]).
We test if a model's composition operations are local or global by comparing the meanings the model assigns to standalone sequences to those it assigns to the same sequences when they are part of a larger compound.More specifically, we compare a model's output when it is given a composed se-quence X, built up from two parts A and B with the output the same model gives when it is forced to first separately process A and B in a local fashion.
Overgeneralisation.Lastly, we include also a notion that concerns the acquisition of the language by a model: we consider if models exhibit overgeneralisation when faced with non-compositional phenomena.Overgeneralisation is a language acquisition term, which refers to the scenario in which a language learner applies a general rule in a case that forms an exception to this rule.The relation of overgeneralisation with compositionality comes from the supposed evidence that overgeneralisation errors provide for the presence of symbolic rules in the human language system (see e.g.[Penke, 2012]).We follow this line of reasoning and take the application of a rule in a case where this is contradicted by the data as evidence that the model in fact internalised this rule.
We propose an experimental setup where a model's tendency to overgeneralise is evaluated by monitoring its behaviour on exceptions.We identify samples that do not adhere to the rules underlying the data distribution -exceptions -in the training data sets and assess a model's tendency to overgeneralise by observing how they respond to these exceptions during training.

Data
We consider an artificial task, which we dub PCFG SET.
Input sequences: syntax.The input alphabet of PCFG SET contains three types of words: words for unary and binary functions that represent string edit operations, elements to form the string sequences that these functions can be applied to, and a separator to separate the arguments of a binary function.The input sequences formed with this alphabet describe how a series of such operations are to be applied to a string argument.We generate input sequences with a PCFG, shown in Figure 2 (production probabilities are omitted).
Output sequences: semantics.Data construction.We use the probabilistic nature of the PCFG SET input grammar to enforce a distribution of lengths and parse tree depth of a more natural data set (WMT2017, [Bojar et al., 2017]).We set the size of the string alphabet to 520 and create a base corpus of around 100 thousand distinct input-output pairs, limiting the length of the string arguments given to the functions to 5. We use 85% of this corpus for training, 5% for validation and 10% for testing.

Experiments and Results
We compare three currently popular neural architectures for sequence-to-sequence language processing tasks: a recurrent architecture (LSTMS2S) [Sutskever et al., 2014], a convolution-based architecture (ConvS2S) [Gehring et al., 2017] and a transformer model (Transformer) [Vaswani et al., 2017].For every architecture, we train three models. 1A summary of the results is shown in Table 1.

Systematicity
The task success results for PCFG SET (Table 1, row 1) already reflect whether models can recombine functions and input strings that were not seen together during training.In the systematicity test, we focus explicitly on models' ability to interpret pairs of functions that were never seen together while training.We select four pairs of functions to evaluate and redistribute the training and test data such that the training data does not contain any input sequences including these specific four pairs and all sequences in the test data contain at least one.
Results.Following the overall task accuracy, also for the systematicity test, Transformer obtains higher scores than both LSTMS2S and ConvS2S.Intriguingly, the systematicity scores of all models are substantially lower than their overall task accuracies.This large difference is surprising, since PCFG SET is constructed such that a high task accuracy requires systematic recombination.As such, these results serve 1 All data, code and models are available online at https://github.com/i-machine-think/am-i-compositional as a reminder that models may find unexpected solutions, even when the data is very carefully constructed.

Productivity
Longer sequences are more difficult for all models, even if their length falls within the range of lengths observed during training (See Figure 3, red lines).With our productivity test, we test if this is due to an inherent difficulty of longer sequences or is related to models' inability to extrapolate to unseen lengths.We redistribute the train and test data such that there is no evidence at all for longer sequences in the training set.Sequences containing up to eight functions are collected in the training set, while input sequences containing at least nine functions are used for evaluation.Results.All models have great difficulty with extrapolating to sequences with a higher length than those seen during training.Figure 3 depicts the performance of the three models in relation to the length of the input sequences (blue lines) compared with the task accuracy of the standard PCFG SET test data for the same lengths.For all models, the productivity scores are lower for almost all sequence lengths.With the difficulty of longer sequences factored out, we can conclude that this decrease in performance is solely caused by the decrease in evidence for such sequences and that models in fact struggle to productively generalise to longer sequences.

Substitutivity
To test for substitutivity, we select two binary and two unary functions, for which we artificially introduce synonyms (F syn ), during training.The introduced synonyms have the same interpretation functions as the terms they substitute, and are thus semantically equivalent to their counterparts.We consider two different conditions that differ in the syntactic distribution of the synonyms in the training data.
In the first condition, we randomly replace half of the occurrences of the chosen functions F with F syn , keeping the target constant.In this test, F and F syn are distributionally similar.In the second, more difficult condition, we introduce F syn only in primitive contexts, where F is the only function call in the input sequence.In this primitive condition, the function F and its synonymous counterpart F syn are distributionally not equivalent.We evaluate models on how robust they are to the meaning-invariant synonym substitutions in the input sequence.We quantify this with a consistency score, which expresses a pairwise equality between the model's output before and after the synonym substitution.
Results.For the substitutivity experiment where words and synonyms are equally distributed, the scores of Transformer and ConvS2S are nearly on par.Furthermore, both architectures put words and their synonyms closely together in the embedding space (not shown).Surprisingly, even in this relatively simple condition where the words are distributionally identical, words and synonyms are at very distinct positions in the LSTMS2S embedding space.
In the primitive substitutivity test, all scores decrease substantially, although all models do still pick up that there is a similarity between a word and its synonym.This is reflected not only in the consistency scores but is also evident from the distances between words and their synonyms, which are substantially lower than the average distances to other function embeddings (not shown here).For LSTMS2S, the average distance is very comparable to the average distance observed in the equally distributed setup.Its consistency score, however, goes down substantially, indicating that word distances (computed between embeddings) give an incomplete picture of how well models can account for synonymity when there is a distributional imbalance.

Localism
We test for localism by considering models' behaviour when a subsequence in an input sequence is replaced with its meaning.More specifically, we compare the output sequence that is generated by a model for a input sequence with the output sequence that the same model generates when we explicitly unroll the processing of the input sequence (for an example, see Figure 4).Results.None of the evaluated architectures obtains a high consistency score for this experiment.In a small manual analysis, we observe that the most common mistakes involve unrolled samples that contain function applications to string inputs with more than five characters.

Overgeneralisation
To test for overgeneralisation, we manually add exceptions to the data set.We select four pairs of functions that are assigned a new meaning when they appear together in an input sequence.We monitor the accuracy of both the original and the exception targets during training and compare how often Results.We test overgeneralisation for several different exception percentages, which indicate the number of occurrences of a function that is replaced by an exception.The results indicate that all architectures have a tendency to overgeneralise, but the degree of overgeneralisation depends strongly on the number of exceptions present in the data.All architectures show overgeneralisation behaviour for low exception percentages lower than 0.5%, but hardly any overgeneralisation is observed when 0.5% of a function's occurrence is an exception.When the percentage of exceptions becomes too low all models have difficulties memorising them at all.LSTMS2S, in general, appears to find it difficult to accommodate both rules and exceptions at the same time.

Conclusion
We proposed an evaluation framework containing a series of tests that translate theoretical concepts related to compositionality of language into behavioural tests for models of language.Our evaluation framework contains five independent tests that consider complementary aspects of compositionality that are frequently mentioned in the literature.We instantiated the five tests on a compositional artificial data set we dub PCFG SET.This data set is designed such that modelling it adequately should require a compositional solution, and it is generated such that its length and depth distributions match those of a natural corpus of English.We compared three popular sequence-to-sequence architectures: an LSTM-based, a convolution-based and an all-attention model.For each test, we conducted a number of auxiliary tests that can be used to further increase the understanding of how this aspect is treated by a particular architecture.
While the overall accuracy on PCFG SET was relatively high for all models, a more detailed picture is given by the five compositionality tests, which indicate that, despite our careful data design, high scores do still not necessarily imply that the trained models fully represent the true underlying generative system.These results illustrate well that to test for compositionality in neural networks, it does not suffice to consider an accuracy score on a single downstream task, even if this task is designed to be highly compositional.As such, the results themselves demonstrate the need for the more extensive set of evaluation criteria that we aim to provide.

Figure 1 :
Figure 1: A schematic depiction of our five compositionality tests.

Figure 3 :
Figure 3: Task accuracy (in red) and productivity scores (in blue) of the three architectures as a function the length of the input sequence.

Figure 4 :
Figure 4: An example of the unrolled computation of the meaning of the sequence echo append C , prepend B , A for the localism test.

Table 1 :
The meaning of a PCFG SET input sequence is constructed by recursively applying General task accuracy and performance per test for PCFG SET, averaged over three runs.Two performance measures are used: sequence accuracy, indicated by * , and consistency score, indicated by †.