Can Neural Nets do Meaning?
The pandemic has been hard on many of us. It has been a long time since I traveled to an in person conference, or blogged about my experiences. The plan is to create a blog post for each of the three days, but let’s see– I am a little out of practice. Today I concentrate on the invited panel on computational semantics. There were other talks in the main session today but they will have to wait for another blog post.
The day started with a panel discussion on computational semantics. See the listing on the programme here. The three invited speakers, it turned out had different research goals, which was interesting, and I wonder how representative it is of the field. The question I posed to the panel after their (all very) interesting talks, was whether they considered themselves to be pursuing the goal of making the performance of computers on language related tasks better because it would lead to better functionality in various applications, or whether they were interested in modeling meaning tasks computationally in order to understand the human mind better. Marie Catherine de Marneffe said she was was unequivocally in the former camp, Aaron White in the latter, while Ellie Pavlick was somewhere in transition— she started off being more interested in the former kinds of problems but was getting increasingly interested in the latter.
De Marneffe was interested in getting computers to perform in a human like way with respect to judgements about speaker commitment to the truth of certain embedded propositions. As is well known, the new deep learning systems, trained on mountains of data (available for languages like English), end up doing stunningly well on standard benchmarks for performance. The speaker commitment judgement is no different, performance is strikingly good. The neural network gets given simple parametric information about the lexical embedding verb (whether is is factive , or whether it lexical entails speaker commitment in principle), but also gets exposed to the distributional data, since linguistic context such as the presence of negation and other embeddings are necessary to make the judgements in question. It turns out that these kinds of neural networks perform extremely well for example on neg raising contexts, generating human equivalent judgements for sentences like
I don’t think he is coming.
However, there are a few kinds of sentence where the neural networks fail spectacularly. These are instructive. Two examples from the talk are given below, with the clause for which speaker commitment judgement fails shown as underlined.
(1) I have made many staff plans in my life and I do not believe I am being boastful if I say that very few of them needed amendment.
(2) I was convinced that they would fetch up at the house, but it appears that I was mistaken.
De Marneffe pointed out these examples and speculated that the problem for the neural nets is pragmatics and/or real world knowledge. (2) is striking because even the smallest most ignorant child would get this one right, so it seems to show that whatever the neural net is doing, it really is not doing anything remotely human like. Maybe having a real embodied life and connections to truth in the world is necessary to fix (2). But the problem with (1) seems to me not to be not so much about pragmatics as about embedding and hierarchical structure, which the neural net simply is not tracking or using as part of its calculation. Personally, I think the `problem’ with pragmatics, in terms of inferential strategies is overstated. I am pretty sure you can teach neural nets some inferential algorithms, but compositional structure and real grounding for meaning both seem to be the real sticking points. But we only see this in cases when the linear distance cooccurrence data is non informative of the actual meaning. It is sobering to notice how seldom those cases actually come up, and how often the simplistic heuristic delivers as a proxy for the more complex reality. How worried you are about the existence of these examples really depends on which of the two issues outlined above you are trying to solve.
With regard to Being in the World, Ellie Pavlick presented her work on trying to teach meaning grounding to neural nets, as a way of probing whether such training on physical properties of events denoted by motion verbs would help in acquiring the right behaviours and underlying representations. The evidence seems to be that modest gains in performance are indeed possible in certain domains based on this kind of training. But here one wonders whether we can follow up those gains in all other domains without fully recreating the learning environment of the child in all its gory and glorious detail. The reductio of this approach would be a situation where you require so much data and nuance that it would be impossible to construct short of birthing your own small human and nurturing it in the world for five years. As Ellie rightly pointed out in discussion however, the great advantage and excitement of being able to program and manipulate these neural nets is the controlled experiments you can do on the information you feed it, and how you can potentially selectively interrogate the representations of a successful model to try to come up with a decomposition of a complex effect, which might in the end be relevant to understanding the cognitive decomposition of the effect in humans.
Aaron White’s talk was on an experiment in training a neural net to match acceptability ratings leading to the induction of a type structure for different constructions. The basic model was a combinatory categorial grammar with standard basic types and modes of combination. The intermediate interchange format was vector space representations, which are flexible and don’t require prejudging the syntax or the compositional semantics. The point of the training is to somehow see what gets induced when you try to create a system that best predicts the behavioural data. The test case presented was clausal embedding, and peering under the hood afterwards, we can ask what kinds of `types’ were assigned to clausal complements of different varieties, and with different embedding verbs. The types induced for clausal complements were very varied and not always comprehensible. Some seemed to make sense If you were thinking in Inquisitive Semantics terms, but others were harder to motivate. All in all, it seems like the job of interpreting why the model came up with what it did is as hard as the original problem, and moreover bearing an ill understood and equally complicated relationship to the original problem of how humans `do’ meaning composition. There are a lot of details that I clearly do not understand here.
All in all, it was a fascinating panel raising a lot of big picture issues in my own mind. But I come away with the suspicion that while BERT and his descendents are getting better and better at performing, their success is like the equivalent of getting the answer 42 to meaning of Life the Universe and Everything. It still does not help if we don’t know what exactly their version of the question was.