Language Comprehension

Alan Garnham. Handbook of Cognition. Editor: Koen Lamberts & Robert L Goldstone. Sage Publication. 2005.

Language permeates our lives. For most people, on most days, language comprehension and language production are central activities, not only because of the time they occupy, but also because of the role they play in people’s lives. For the most part we take our linguistic abilities for granted, though they may come to our attention when we encounter a word whose meaning we do not know or, more seriously, when we hear, or attempt to learn, a language with which we are not familiar. Indeed, people often show a remarkable lack of curiosity about the mental representations, mechanisms and processes that underlie their linguistic abilities. This chapter details some of the mental capacities on which the ability to understand language is based. For more detailed information about specific aspects of comprehension, see the following chapters by McQueen and by Pollatsek and Rayner.

For ordinary language users, comprehension is either listening or reading. Listening is the more basic form of comprehension. It is both phylogenetically and ontogenetically prior to reading, and it does not have to be taught. There are many special questions about reading, which are addressed in Chapter 12 by Pollatsek and Rayner. However, many of the processes of comprehension are assumed to be common to listening and reading, and much of the work reported in this chapter has used visual materials.

Comprehension is a process that unfolds in time. Speech is inherently temporal. Written and printed material is read in a sequential manner (see Pollatsek & Rayner, Chapter 12, this volume). Not only that, but as we shall see, there is no simple single process of comprehension that is applied to each part of a text or discourse, but rather a series of processes that themselves unfold through time and become less specifically tied to individual parts of the text.

Traditionally (e.g. Forster, 1979) these processes are divided into three groups: lexical, syntactic and discourse (message) level. The listener or reader has to identify the words being heard or read, to group them into phrases and clauses, and to use those groupings, together with other information, to determine their meaning. These processes are not necessarily wholly sequential, when we consider the processing of a single stretch of a discourse. Indeed the temporal (and indeed the logical) connections between these different processes is an empirical and unresolved question.

Word Identification

In order to characterize the process of language comprehension, we first need to consider the nature of the incoming stimulus. McQueen’s Chapter 11 deals with this problem for the case of speech. Important obstacles to identifying individual words in speech are the largely continuous nature of the speech signal (there are no breaks between most words) and the non-canonical pronunciation of words in context (e.g. ‘green’ might be pronounced ‘greem’ in a context such as ‘green mitten’). Printed materials do not present such problems, typos excepted. Spaces between words are clear, and canonical spelling is the norm. Handwriting is more complex, but the identification of handwritten words has been the subject of comparatively little research. Manso de Zuniga, Humphreys, and Evett (1991) suggest that the differences in the recognition of handwriting and printed text lie primarily in very early processes that ‘normalize’ the representation of handwritten letters, by eliminating differences from one occurrence of a letter to the next.

The identification of what words are present in a particular utterance or written sentence depends on the use of a mental store of knowledge about the words in the language one knows–the mental lexicon. This knowledge, as well as more general knowledge about the sound structure of words in the language, is also used to segment the continuous speech signal into words, as McQueen’s Chapter 11 explains. At a broad level, the mental lexicon contains the same kinds of information as an ordinary (printed) lexicon or dictionary: information about how a word is spelled, about its canonical pronunciation, about its part of speech and about its meaning. Exactly how this information is represented, and what the overall organization of the mental lexicon is, are, however, matters of debate. Is the pronunciation of a word encoded as a sequence of phonemes, for example? Or, what corresponds to the alphabetic principle of organization of entries in printed lexicons, which allows for the systematic location of entries?

Finding words in printed dictionaries is usually a case of systematic search in a small part of the dictionary, found by using knowledge of the general principles of alphabetic organization, and page headers. Broadly similar search mechanisms have been proposed for locating items in the mental lexicon (Forster, 1976). However, they have generally been rejected in favour of detector models, in which detectors for individual words accrue evidence that their word has been presented until one is deemed to have won the ‘race’ among them: the detection of the word. Morton’s (1969) logogen model was an early model of this kind, in which the detectors were independent of each other, and were effectively watched over by a device that noted which one reported first that its word had been detected. More recently, interactive activation models (McClelland & Rumelhart, 1981) have been influential. In such models, detectors are linked together, with detectors of the same type (e.g. word detectors) inhibiting each other. The logic behind this mechanism is that any evidence that a particular word is such-and-such (say, ‘table’) can be thought of as indirect evidence that it is not any other word.

Interactive activation models are typically localist in their representations. They retain discrete detectors for individual words. Connectionist models with distributed representations became popular when learning procedures, such as ‘back propagation’, were developed for them (see Lamberts, Chapter 18, this volume). They have been proposed for word identification (e.g. Seidenberg & McClelland, 1989). However, after learning has taken place, the operation of such models can be difficult to understand, and it has been argued that, for this reason, they are unsatisfactory as models of cognitive processes (see, e.g., Grainger & Jacobs, 1998, for the particular case of word recognition).

Over the last thirty years a very large number of experiments on the recognition of individual words, and in particular individual printed words, has been reported. Early findings included the effects of length, frequency, degradation and context on identification. Short words are easier to recognize, as are common ones, and ones that are easy to see. Words are recognized more quickly in lexical semantic contexts (e.g. the word ‘butter’ is recognized faster if it is preceded by the related word ‘bread’ than if it is preceded by the unrelated word ‘brake’: Meyer & Schvaneveldt, 1971) and in discourse contexts (Schuberth & Eimas, 1977). Different models made different predictions about how these basic effects might interact with one another (Garnham, 1985). In the late 1970s and early 1980s, before the interactive activation model became dominant, findings related to these effects led to a series of proposals that were variants on or hybrids of search and (logogen-type) detector models (e.g. Becker, 1980; Norris, 1986).

English orthography bears a complex relation to sound, and another early discovery was that regular words (i.e. ones that follow the most common spelling-to-sound patterns) are easier to recognize than irregular words, other things being equal. A further finding (Glushko, 1979) was that consistency also has an effect. A regular spelling-to-sound pattern may also be consistent, if all words with that pattern are pronounced in the same way. Alternatively it may be inconsistent, if there is a minority of words with a different pronunciation (e.g. the ending‘-int’ is regularly pronounced with a short ‘i’ as in ‘mint’, ‘hint’, ‘dint’, etc., but is not consistently pronounced that way, because of the inconsistent ‘pint’). This finding led to a more general consideration of the kind of neighbourhood in which a word found itself (orthographic and/or phonological) and of how that affected its recognition (e.g. Grainger, 1990). (See Pollatsek & Rayner, Chapter 12, this volume, for a more detailed discussion of the role of sound in the recognition of printed words.)

As findings on word identification became ever more complex, attention began to focus on the tasks used to study this process, and how they were related to it. Perhaps the most important, and at the same time the most controversial, task is the lexical decision task. In this task, subjects have to decide whether a letter string is a word of the language they know. Logically, the task can be performed simply by detecting whether the string has an entry in the mental lexicon. Whether this is actually how the task is performed is another question, and a cause of much of the controversy about the interpretation of results from lexical decision tasks. In particular, it is thought that the results of lexical decision experiments might be contaminated by factors that affect decision-making rather than lexical access.

Other important issues in word identification are the role of morphology and the processing of ambiguous words. Many English words have prefixes and suffixes and the question is whether they are identified as wholes or by decomposing them and using the parts to identify them. A further complication is that some affixes, such as plural endings on nouns and endings on verbs indicating person or tense, have no effect or entirely predictable effects on meaning. They are referred to as inflectional. Others (e.g. ‘-er’, ‘-ness’, ‘un-’) have only partly systematic effects, and are referred to as derivational. Inflexional endings are best treated as part of the grammar, but derivational endings, being idiosyncratic, are lexical. Taft and Forster (1975) provided evidence for morphological decomposition by showing that pseudo-prefixes (like re- in rejuvenate) can affect word recognition. However, later research has suggested a more complex picture, with results depending on the semantic and phonological transparency of the complex words (Marslen-Wilson, Tyler, Waksler, & Older, 1994). Definite research remains to be carried out in this area.

The question about ambiguous words is what happens when an ambiguous word form (such as ‘bank’, which can mean a riverside or a financial institution) is identified. Intuitively, since people typically do not notice ambiguity, it might be thought that only the intended meaning is accessed. However, it is not clear in general how such selective access could occur. An alternative view is that all meanings are initially accessed and the one that fits with the context is retained (Swinney, 1979). A third idea is that the most frequent meaning is considered first, and only if it does not fit with the context are other meanings considered (Hogaboam & Perfetti, 1975). More recent accounts focus on context occurring before the ambiguous word and look at its effects on access of dominant and non-dominant meanings. Rayner, Pacht, and Duffy’s (1994) reordered access model and Vu, Kellas, and Paul’s (1998) context-sensitive model take different views of the nature of the interaction between context and dominance. However, it is difficult to decide between these models, because it is difficult to distinguish empirically between a meaning being very weakly activated (reordered access model) and not being activated at all (context-sensitive model).

Syntactic Processing

As words are identified, they have to be linked to other words that have just been identified. The dominant Chomskian approach to linguistics of the past forty years claims that words within sentences are grouped into phrases and clauses (and then sentences) according to rules that can be specified independently of what the groups mean, but which are then important in determining their meaning (syntactic rules). On this view, once words have been recognized, they have to be grouped syntactically with what has just gone before.

The psycholinguistic study of this aspect of language comprehension has a rather strange history. Early versions of Chomsky’s syntactic theory (1957, 1965) held that the surface syntactic structure of a sentence was derived (in an analytic sense) from an underlying structure or structures by a series of operations called transformations. These operations, for example, created questions from assertions, or passive sentences from actives (and, in early versions of the theory, complex sentences from simple sentences). The derivational theory of complexity, formulated by George Miller, claimed that the difficulty of understanding a sentence depended on the complexity of the derivation of its surface structure from its underlying structure, but did not deal directly with how its surface structure was computed. In fact, in comprehension, the transformations would have to be reversed, so that the underlying structure could be computed from the surface structure. By 1965, Chomsky had explicitly proposed a version of his theory in which all the information relevant to determining the meaning of a sentence was contained, in a simple form, in its underlying structure. This idea lent support to the notion that to understand a sentence was to derive its underlying structure.

Purported arguments against the derivational theory of complexity (see Fodor, Bever, & Garrett, 1974, for a summary, but see Garnham, 1983) led to the suggestion that meaning had to be computed more directly from surface structure. For example, Bever (1970) suggested the use of heuristics such as that the first noun…verb…noun sequence in a sentence introduced the actor, the main action, and the person or object acted upon.

The first clear proposal for how detailed information about surface structure might be computed was put forward by Kimball (1973). This account and most subsequent accounts have the following basic form. They assume that the syntactic category (part of speech), or the set of possible syntactic categories, of a word is retrieved from the mental lexicon and used together with information about possible surface structure configurations for the language to determine the surface structure of the current sentence. As Kimball realized, such information alone is not enough, and additional parsing principles are required. Such principles may be syntactic in nature or, as we shall see later, may be semantically based.

Kimball’s piecemeal proposal for seven principles was replaced by the more systematic proposals of Frazier and Fodor’s (1978) sausage machine, which later transformed itself (Frazier, 1979) into the hugely influential garden path theory with its two main syntactic principles of minimal attachment and late closure. Empirical research focused on preferences between two structures at points of ambiguity in the analysis of sentences. The garden path theory claimed: (1) that such ambiguities would always be resolved, at least initially, in the same way; and (2) that the way the ambiguities were resolved could be explained by the two principles, along with the general claim that structures were built using information about syntactic categories and syntactic rules.

For example, the word ‘that’ can, among other things, introduce both complement clauses and relative clauses. A that complement to the verb ‘tell’, for example, presents the content of what was told, as in (1).

• (1). The fireman told the woman that he had rescued many people in similar fires.

A relative clause, on the other hand, provides further information, maybe identifying information, about a person or thing, as in (2).

• (2). The fireman told the woman that he was worried about to install a smoke detector.

The garden path theory’s first principle, minimal attachment, favours simpler structures over more complex ones. It predicts that the complement clause analysis will be favoured over the relative clause analysis at the point at which ‘that’ is analysed and incorporated into the structure constructed so far. Thus, at the point at which the two sentences above diverge (‘many people’ versus ‘to install’), readers should have difficulty in the ‘to install’ version. This result is indeed found (Frazier & Rayner, 1982; Altmann, Garnham, & Dennis, 1992).

The second principle, late closure, favours incorporation of the current word into the local phrase, and operates when minimal attachment fails to establish a preference. For example, it explains why ‘yesterday’ in (3) is taken to qualify ‘left’ and not ‘said’.

• (3). John said Bill left yesterday.

The garden path theory was highly influential for about fifteen years, but it came under increasing attack on two fronts. First, there were claims that parsing preferences could be overridden by non-syntactic information, in a way that the theory did not allow. Second, there were claims that the principle of late closure was not universal, but that although some languages, including English, favour late closure, others including Spanish favour the opposite: early closure. Crain and Steedman (1985) suggested that the preference for complement clause over relative clause readings of ‘that’ clauses might arise because, with no context, the relative clause was unexpected–no additional information was required about the woman, so why not just say (4)?

• (4). The fireman told the woman to install a smoke detector.

However, in a context in which there were two women, one of whom the fireman was worried about and one whom he was not, the relative clause is justified. A series of studies suggested that, given the right context, the difficulty of the relative clause reading could be completely eliminated (see especially Altmann et al., 1992). This finding suggests that referential information influences parsing decisions directly.

Another type of information that appears to have an early effect on parsing decisions incompatible with the garden path theory is information associated with verbs and the sentence frames in which they prefer to occur. Many verbs can appear in more than one type of construction. The most common alternation in English is probably that between transitive and intransitive uses of the same verb (‘John is eating his breakfast’ vs. ‘John is eating’). Some verbs that show this alternation are more commonly used transitively and others are more commonly used intransitively. This kind of frequency information has been shown to influence parsing decisions (e.g. Clifton, Frazier, & Connine, 1984).

A generalization from the results just discussed would be that many types of information (e.g. syntactic, referential, lexical) could work together to produce a syntactic analysis of a sentence. Such a view leads to so-called constraint-satisfaction theories (e.g. MacDonald, Pearlmutter, & Seidenberg, 1994; McRae, Spivey-Knowlton, & Tanenhaus, 1998), which claim that different types of information can act as concurrent constraints on the final analysis.

The second line of evidence against the garden path theory began with a study by Cuetos and Mitchell (1988) in which they examined the interpretation of Spanish sentences such as (5).

• (5). Someone shot the servant of the actress who was on the balcony

Late closure suggests that the relative clause should be taken as qualifying ‘actress’. However, Spanish speakers took it to qualify ‘servant’.

This result, and a whole series on a variety of languages that followed from it (for a summary, see Mitchell, Brysbaert, Grondelaers, & Swanpoel, 2000: 494-6), led the principal proponents of the garden path theory to formulate a new account of syntactic analysis called construal (Frazier & Clifton, 1996). On this account, the attachment of so-called non-primary phrases, such as relative clauses, can be determined by a variety of considerations, which need not be purely syntactic.

Mitchell, on the other hand, suggested that parsing decisions were affected by previous knowledge of which structures were more common in a particular language, the so-called tuning hypothesis (e.g. Mitchell, 1994), another generalization of the results on verb preferences. Mitchell argued that different languages might show different statistical patterns with respect to, say, early versus late closure. The precise predictions of this hypothesis depend on the level of detail (or ‘grain’) at which statistical information is accumulated. For example, Mitchell concluded that the hypothesis is not supported for Dutch (Mitchell & Brysbaert, 1998), but a more recent analysis with a finer grain size has questioned this conclusion (Desmet, Brysbaert, & De Baecke, 2002).

We have already seen that information associated with particular verbs and the structures in which they occur can influence parsing decisions. Related views (e.g. Abney, 1989) suggest that, once a verb has been encountered, the parser tries to fill thematic roles (such as theme or instrument) associated with it, and that it prefers analyses in which these roles are filled. ‘Put’ has three roles. So, ‘the table’ in ‘put the block in the basket on the table’ is preferentially interpreted as the place where the block is put, not the original location of the block (as in ‘put the block in the basket on the table into the box on the floor’).

A somewhat neglected issue in syntactic processing (but see papers in Fodor & Ferreira, 1998) is the type of reanalysis that must take place if the initial analysis is found to be incorrect, as it will reasonably often be, according to the garden path theory and other accounts. However, Pickering and colleagues have recently proposed a model, the unrestricted race model, in which reanalysis is the principal determinant of parsing difficulty (van Gompel, Pickering, & Traxler, 2000).

Finally, it should be mentioned that readers and listeners might not always perform a full syntactic analysis of what they are hearing or reading (see Ferreira, Ferraro, & Bailey, 2002, for a recent version of this view). Caramazza and Zurif (1976) argued that agrammatic aphasics could understand many sentences simply by identifying the main content words in them and constructing a plausible message. Such processes probably also occur in comprehension by normal subjects. The misunderstanding of phrases such as ‘fills a much needed gap’ and of questions such as ‘After an air crash on the border between France and Spain, where should the survivors be buried?’ (Barton & Sanford, 1993) may be at least partly explained by the lack of a complete syntactic analysis. It seems unlikely, however, that such obvious constituents as simple noun phrases (e.g. ‘the red book’) are not recognized in normal comprehension. Thus, a notion of minimal commitment to syntactic structure, or underspecification of syntactic structure, has often been suggested (e.g. Weinberg, 1994). Unfortunately, it is difficult to provide a principled and empirically satisfactory account of the conditions under which underspecification occurs.

Discourse-Level Processing

The notions of word identification and syntactic processing in language comprehension are relatively constrained. The notion of discourse-level processing is much less clearly defined. Syntax is generally regarded as a sentence-level phenomenon. Thus, syntactic analysis groups together words within sentences ready for interpretation. Interpretation is a much more complex phenomenon and is not restricted to processes operating on single sentences.

Some basic aspects of meaning are determined by within-sentence syntactic structure. Each clause in a sentence usually presents one eventuality (event, action, state or process). The finite verb in the clause specifies what kind of eventuality and the other parts of the clause present the participants in the eventuality, so that the clause as a whole conveys local ‘who did what to whom’ information. One view, popular among formal semanticists, is that this kind of information follows from syntactic structure, with each rule of syntactic combination paired with a rule of syntactic interpretation (the so-called rule-to-rule hypothesis of Bach, 1976). So, a very simple sentence might be formed by having a subject noun phrase (itself comprising a proper name) followed by an intransitive verb (e.g. ‘John walks’). The interpretation is that the person denoted by the proper name performs the action described by the verb. Note that the syntactic and semantic rules, though paired, are very different in content: one is about structure, the other about meaning conveyed by structure.

The rule-to-rule hypothesis implies compositionality of semantics. However, it deals only with the literal meaning of sentences, and not with other aspects of their meaning, those that depend on context, for example, or non-literal aspects of meaning. Some cross-sentence aspects of interpretation have been analysed in the same kind of formal semantics framework as within-clause interpretation, in theories such as discourse representation theory (Kamp, 1981; Kamp & Reyle, 1993). In particular, this approach has been applied to the interpretation of anaphoric expressions (e.g. pronouns and ellipses, which take their meaning from a previous part of the text).

Formal, compositional, semantics characterizes both word meanings and combinatorial operations in a highly abstract way. Because of its roots in philosophical logic it recognizes the distinction between sense (roughly speaking, meaning as provided by definitions) and denotation (the thing or things in the world that a word stands for). Hence, it indirectly recognizes what has been called in cognitive science the grounding problem (Harnad, 1990). The grounding problem is the problem of explaining how language links with the world and hence how understanding language is understanding information about the world. This problem is particularly acute for certain types of approach to both language and language understanding, those that treat language as merely a formal symbolic system and model language understanding as the manipulation of symbols. In formal semantics, the denotation of a common noun, such as ‘table’, might be modelled as a set of tables, or rather as a function from possible words to sets of tables. Thus, ‘table’ means (in the sense of denotes) all those things that are tables in every possible state of affairs. However, formal semanticists are not primarily concerned with word meaning, so that the abstractness of this definition is not a problem for them. Also, since sets of tables are sets of things in the world, this type of analysis does link language and the world, and ‘solves’ the grounding problem, at least in principle.

However, a growing number of researchers feel that a more concrete solution to the grounding problem is needed. They suggest that the meaning of ‘table’, for example, has to be spelled out in terms of the way human bodies interact with parts of the world to perform certain functions, relatively formal acts of eating, for example (see, e.g., Glenberg, 1997). This approach takes a very specific view of what concepts are. However, it is not necessarily incompatible with a formal semantic approach. A table still has to be something that can be sat at, for example, and the meaning of ‘sat at the table’ must derive in some way from the meanings of, among other things, ‘sit’ (which will also be characterized in terms of human bodies and how they interact with objects for specific functions) and ‘table’. Proponents of this view, which is part of the embodied cognition movement, often extend these ideas from the interpretation of individual concepts (such as table) to the interpretation of constructions such as TRANSITIVE VERB + OBJECT. However, it is not clear what all examples of such structures have in common from the embodied cognition perspective, and hence what corresponds in this framework to the formal semanticists’ claim that there is a uniform account of the semantics of those structures.

We have already seen that there have been attempts to formalize intersentential relations, such as the relations of coreference indicated by certain anaphoric expressions, in particular definite pronouns (in English, ‘he’, ‘she’, ‘it’, ‘they’ and their variants, such as ‘him’). There is also a long-standing tradition of identifying discourse relations as holding between the propositions (or similar parts) of a text (e.g. Hobbs, 1983; Mann & Thompson, 1986). Unfortunately there is no precise agreement about the set of discourse relations that exist, how they are to be identified and how they are to be interpreted. It is, however, clear that texts depict temporal and causal relations between events, they elaborate on previous descriptions and they present arguments whose parts bear various relations to one another, and that these are among the kinds of relation that must be described by a theory of discourse relations.

A Theoretical Framework for Studying Discourse Processing: Mental Models Theory

Discourse and text play many roles in our lives. Conversation, in particular, has many social functions. These social functions are studied primarily, though not exclusively, by social psychologists rather than cognitive scientists. Whether this division of labour is a sensible one is a moot point. Cognitive scientists who study discourse processing think of language primarily as a system for conveying information. The information that is conveyed is primarily about some aspect of a world, whether it be the real world (as in newscasts), a fictional world (as in novels) or an abstract world (as in many academic texts–those describing theories of text comprehension, for example). This information is conveyed using a complex systems of signs (a language), which itself has a structure that can be described (e.g. in academic texts about linguistics). Early psycholinguistic theories of comprehension failed to make a clear distinction between information being conveyed by language and information about the linguistic structure conveying that information. In the 1960s, Chomsky’s ideas about language were beginning to have a major impact, and psychologists such as George Miller correctly recognized that Chomsky’s ideas were of crucial importance for theories of language processing. However, the proposal, implicit in many of the early psycholinguistic theories, that comprehension was basically the extraction of syntactic deep structure, failed to recognize the difference between the linguistic structures and the information conveyed.

Chomsky’s ideas, as imported into psychology by Miller and others, were developed and modified by Fodor et al. (1974) and their colleagues. However, by 1970 a number of authors, most influentially John Bransford, were suggesting a different perspective on language comprehension, one that was eventually to develop into modern theories based on the notion of situation models or mental models.

Bransford (Bransford, Barclay, & Franks, 1972; Bransford & Franks, 1971) made three main claims about language comprehension. The first is based on the distinction between information conveyed and structures in the system used to convey it. This claim is that the mental representation of the content of a text (what a person had in their mind when they had understood the text) is not a representation of any of the text’s linguistic structures. Rather, it is a representation of the situation in the world that the text is about. Bransford illustrated this idea in an experiment (Bransford et al., 1972) in which people confused two sentences (6) and (7), which probably describe the same situation.

• (6). Three turtles sat on a floating log and a fish swam beneath it.

• (7). Three turtles sat on a floating log and a fish swam beneath them.

However, they did not confuse two other sentences, and (9), which differ linguistically in exactly the same way, but which probably do not describe the same situation.

• (8). Three turtles sat beside a floating log and a fish swam beneath it.

• (9). Three turtles sat beside a floating log and a fish swam beneath them.

Bransford’s other two ideas were also put forward to contrast with theories of comprehension based on the idea of extraction of syntactic deep structure. They were that comprehension is (a) an integrativeprocess and (b) a constructive process. By saying that comprehension is an integrative process, Bransford was stressing that comprehension requires the combination (or integration) of pieces of information from different parts of a text. Because syntax has typically been assumed to be a sentence-level phenomenon, theories of comprehension based on syntax tend to focus on the comprehension of individual sentences, and thus fail to give consideration to the complexities of how mental representations of different sentences might be combined.

By saying that comprehension is a constructive process, Bransford was pointing out that comprehension requires the combination of explicitly presented information with relevant background information. This idea is illustrated in the turtles, logs and fish sentences above. It is only by using our mundane knowledge that we can work out what situations are probably being described. And it is this mundane knowledge, about the likely (relative) sizes of logs and turtles and about how logs float on water, for example, that allows us to come to the conclusion that the first two sentences probably describe much the same situation.

Construction and integration often work hand in hand. That is to say, links between different parts of a text are often made, or at least made concrete, by background knowledge. Consider, for example, a very brief text such as (10a) followed by (10b):

• (10a). John’s car left the road at high speed and hit a tree.

• (10b). The nearside front tyre had burst.

The use of a definite noun phrase (‘the nearside front tyre’) in the second sentence suggests a reference to an object whose existence is already established in some way. In other words the linguistic form of the text suggests one way in which information in the two sentences of the text might be integrated. However, the first sentence does not mention a tyre (but rather John, a car, a road and a tree). It is our background knowledge that cars have tyres, or more specifically that each car usually has one nearside front tyre, that allows us to make specific the link between the information in the two sentences. The full understanding of this text also further illustrates the use of mundane knowledge (about the link between tyres bursting and drivers losing control of cars) to compute a full interpretation of a text.

Although construction and integration often work hand in hand in this way, they do not always do so. In some cases integration is based only on linguistic information. For example in (11), the link between the first sentence and the second is established by the fact that the English pronoun ‘he’ refers to a single male person. The previous sentence, which is the only other sentence of the discourse, introduced one male person (a man), one object (a room) and a relation between them (entering). Thus ‘he’ in the second sentence must refer to the man in the first sentence:

• (11). The man entered the room. He sat down.

Conversely, if we are told that a delicate glass pitcher fell on to a hard floor (Johnson, Bransford, & Solomon, 1973), we can infer, using our background knowledge, that the pitcher (most probably) broke. However, that inference is not needed, at the point when the pitcher is described as falling, to link bits of information in different sentences of the text. At that point it merely elaborates or expands on one piece of information that is explicit in the text.

A crucial question about text comprehension is, therefore, how much integration and construction occurs and when. Obviously if we do not integrate all the information in a text we will not be able to create a fully coherent interpretation of the text. It does not follow that such integration does occur discourse and text is not always fully understood. However, such integration should occur if comprehension is to be achieved, and integration, whether based on purely textual features (as in the ‘… the man… he…’ example) or on constructive processes as well (as in the ‘… the car… the nearside front tyre…’ example), is likely to be demonstrated experimentally. Indeed, many experimental studies, from Haviland and Clark (1974) onwards, have provided evidence that is consistent with this conclusion, at least for cases where the integration is of information in adjacent sentences or clauses. A strong view is that inferences necessary to establish a coherent interpretation of a text (necessary inferences for short) are always made (e.g. Garnham, 1989). However, it is possible that such inferences are not made, or are less likely to be made, if they link distant pieces of information in a text (McKoon & Ratcliff, 1992).

A related issue is whether inferences that are not necessary for integration, sometimes called merely elaborative inferences, are made in the normal course of comprehension. Bransford claimed that many of them were. For example, when people heard about a delicate glass pitcher falling on a hard floor, they later responded positively when asked if a sentence about the pitcher breaking had appeared in the text. However, such responses to later questions do not necessarily show that the inference was made during reading. And indeed, the way these questions were asked effectively made them leading questions.

A strong view (e.g. Thorndyke, 1976) is that no merely elaborative inferences are made as texts are understood. Two main ideas underlie this claim. The first is that any text supports many possible inferences. Since most inferences of this kind are not certain (maybe the pitcher did not break) and it is difficult to set a threshold for how likely the conclusion of the inference must be, given the information in the text, it is hard to justify making them. Second, inferences presumably require cognitive effort, and that effort should only be expended if it has a payoff. The payoff for a necessary inference is a coherent interpretation of a text. But the payoff for a merely elaborative inference may never come. Maybe it is better to wait and see if the inference is needed. What would be an elaborative inference at one point in a text could become a necessary inference later on. For example, when reading (12), it could be inferred, elaboratively, that the car had a nearside front tyre.

• (12). John’s car left the road at high speed and hit a tree.

Indeed, it could be inferred that the car had any of the very many normal parts of an automobile. The same inference becomes necessary when the passage continues as it did when this example was previously discussed in (13):

• (13). The nearside front tyre had burst.

Are no elaborative inferences made, then? Perhaps some require so little effort or have such certain conclusions that it is worth making them. McKoon and Ratcliff (1992) have made a claim of this kind. They suggest that inferences based on readily available knowledge are made in the normal course of comprehension (automatically, in their terms, although there are problems with the use of this term: see, e.g., Graesser, Singer, & Trabasso, 1994). If this claim is to have scientific content, there has to be an independent way of defining what knowledge is readily available, one that is independent of any test for whether an inference has been made. Intuition is not a good guide here. For example, it might seem that information about typical instruments for everyday actions would be readily available. Everyone knows that coffee is usually stirred with a spoon, so that if a text merely mentions coffee being stirred, it could be inferred that a spoon was (in all probability) used. However, there is evidence that instrument inferences are not routinely made in comprehension (Corbett & Dosher, 1978; Singer, 1979). It may nevertheless be true that inferences based on readily available information are made, but we do not yet have a good definition of readily available information.

A different suggestion for a class of inferences that might be made when they are not necessary to integrate information in different parts of a text is: inferences based on the presence of a single word in the text. So, if a text mentions dressing, it can be inferred that there are clothes, or if a text mentions a nurse it can be inferred that that person is (probably) female. The reason why such inferences might be made is that in constructing a mental model of the information in a text, information about the meanings, in the broadest sense, of the words in the text must be used. That information is accessed from long-term memory (mental lexicon/ semantic memory) in the course of comprehension. To the extent that lexical entries are complex, information may be included from those entries in the mental model (the mental representation of the content of the text), and that information may be inferential.

The two examples given above appear to be different in kind. The fact that ‘to dress’ means ‘to put clothes on’ implies that ‘clothes’ is part of the definition of ‘dress’, in the strict sense. However, it is not part of the definition of ‘nurse’ that a nurse has to be, or is likely to be, female. That is a fact about how our society is organized. Nevertheless, there is evidence that both kinds of inference are made. Garrod and Sanford (1981) showed that a sentence such as (14) was no more difficult to understand following (15) than following (16).

• (14). The clothes were made of pink wool.

• (15). Mary dressed the baby.

• (16). Mary put the clothes on the baby.

This result contrasted with the standard Haviland and Clark (1974) result that (17) did take longer to read after (18) than after (19).

• (17). The beer was warm.

• (18). We checked the picnic supplies.

• (19). We got some beer out of the trunk.

Similarly, in a variety of published and unpublished studies we have provided a range of evidence to suggest that inferences from stereotyped role names to the genders of the people they refer to are made as the role names are read (see Garnham, 2001: chap. 10 for a summary).

However, it cannot be assumed that because some classes of inference based on the presence of single words in a text are made, they all are. One class for which we could not find any evidence is that based on the implicit causality of verbs. For some verbs, such as ‘confess to’, a simple sentence such as (20) suggests an action that tells us something about the subject of the sentence, John. He in all probability instigated the confession, perhaps because he was feeling guilty.

• (20). John confessed to Bill.

For other verbs, such as ‘blame’, the object is implicated. In (21) it is likely (but not necessary) that Bill has done something wrong.

• (21). John blamed Bill.

This implicit indication of causality may either be supported, as in (22), or countered, as in (23), by the explicit statement of a cause, so that the explicit cause may be congruent with the implicit cause or incongruent.

• (22). John confessed to Bill because he wanted a reduced sentence. (congruent)

• (23). John confessed to Bill because he offered a reduced sentence. (incongruent)

Incongruent endings are typically understood more slowly than congruent endings (e.g. Garnham & Oakhill, 1985), but this finding in itself does not show what happens when the verb is read. It could arise as part of the process of integrating the information in the two clauses. Garnham, Traxler, Oakhill, and Gernsbacher (1996) suggested that if a causality inference was made as the verb was read, it should place the implicit cause in focus and hence speed responding to the name (presented as a probe immediately after the first clause) of that cause. However, in a series of probe word recognition studies they found no evidence for this idea. Instead the effect was confined to the processing of the subordinate clause.

Although Bransford stressed the role of integration in text comprehension, he did not consider in detail the ways in which the need for integration is signalled in a text and the ways in which it is achieved. Indeed, his best-known demonstration of integration is the somewhat bizarre ‘ants in the kitchen’ experiment (Bransford & Franks, 1971), which gave rise to the much-studied Bransford and Franks linear effect. Bransford and Franks constructed a set of complex sentences each expressing four ‘ideas’, for example (24), which expresses the four ideas stated in (25)-(28).

• (24). The ants in the kitchen ate the sweet jelly that was on the table.

• (25). The ants were in the kitchen.

• (26). The ants ate the jelly.

• (27). The jelly was sweet.

• (28). The jelly was on the table.

In addition, they constructed sentences containing two or three ideas from each set, for example (29) and (30).

• (29). The ants ate the sweet jelly. (TWO)

• (30). The ants in the kitchen ate the jelly (THREE) that was on the table.

They then presented a mixed-up collection of sentences expressing either one, two or three (but never four) ideas from many sets and later asked people whether they remembered sentences containing one, two, three or four ideas. The results were that people were most confident that they had seen sentences with all four ideas from a set, even though they never had. Furthermore, both the probability of recognizing a sentence and the confidence with which it was recognized increased in a roughly linear manner with the number of ideas. Thus, Bransford and Franks claimed, the four ideas in a set were integrated in memory, even though they were never presented together.

In normal text comprehension, integration does not usually comprise the reassembling of sets of ideas that are randomly interspersed with other sets of ideas. Rather, it requires the use of textual cues, and often (as we have seen) relevant background knowledge, to link the different parts of a text. At a local, or fairly local, level, a very important set of links are those signalled by anaphoric expressions, such as pronouns and verbal ellipses of various kinds. Anaphoric expressions themselves have little, or in the case of some ellipses, no semantic content, which makes it somewhat surprising that they succeed in referring to specific people, things and eventualities. For example, the semantic content of the English pronoun ‘she’ specifies merely that it should refer to a single female entity. However, in the context of a text or a discourse it will usually refer to one specific person. Bransford did not consider this aspect of integration, but it is addressed in detail in mental models theory. According to that theory, the model that has been constructed when the anaphoric expression is encountered must be structured in such a way that there is either only one relevant item for the interpretation of the anaphor, or a very few items, among which a choice can readily be made. For example, in (31) there is only one appropriate referent for ‘he’, whereas in (32) there are two.

• (31). The man entered the room. He sat down.

• (32). John confessed to Bill because he wanted a reduced sentence.

However, background knowledge about confessing and about whether someone who is confessing is likely to be given a prison sentence or in a position to sentence someone else allows a choice between John and Bill to be made in favour of Bill.

The typically local link that an anaphoric device creates is just one of the ways in which integration occurs. Another type of link has been studied under the head of coherence relations. The pieces of information in (again usually adjacent) parts of a text bear various relations to one another, such as event and sub-event, cause and effect, premise of argument and conclusion. For example, Mann and Thompson (1986), who used the term relational predicates in their rhetorical structure theory, to mean coherence relations, identified the following types of relation that could hold: solutionhood, evidence, justification, motivation, reason, sequence, enablement, elaboration, restatement, condition, circumstance, cause, concession, background, thesis-antithesis. One notable aspect of this list is that, despite the tendency of psycholinguists to focus on narrative texts, comparatively few of the relations are narrative relations, and many more are relations between parts of arguments.

The kinds of relations that hold between parts of a text are signalled by, among other things, the order in which information is presented, the tense and aspects of finite verbs, and explicit connectives, such as ‘and’, ‘and then’, ‘but’, ‘so’ and ‘however’. For example, two adjacent past tense sentences, as in (33), often describe two events in sequence, whereas an imperfect followed by a past, as in (34), describes a state that holds while an event takes place (Kamp, 1979; Kamp & Rohrer, 1983).

• (33). The man came into the room. He sat down.

• (34). The man was feeling exhausted. He sat down.

However, past-past can have other interpretations, such as event and sub-event, in (35).

• (35). John drove to London. His car broke down.

It may be that background knowledge plays a role in determining the particular relation that holds (Caenepeel, 1989).

Where explicit markers or other properties of text indicate relations, it is relatively easy to see how they are extracted. One complication is that the details of the relation may be underpinned by knowledge about the world, and that knowledge may or may not be available. For example, ‘because’ often signals a causal relation, but in an unfamiliar domain the basis of the cause may be unknown. So, someone reading (36) can compute that there is a causal relation between there being little wind and using Kevlar sails, but may not know what property of Kevlar makes it appropriate for conditions in which there is little wind.

• (36). Connors used Kevlar sails because he expected little wind.

Noordman and Vonk (1992) showed that people typically do not compute these relations while reading. Providing explicit information about the relation between wind and Kevlar sails had no effect on the time taken to read the ‘because’ clause, but it did affect subsequent verification of this relation.

Texts are also structured at a more global level and this structure should be an aid to comprehension. However, how this structure is computed and how it contributes to comprehension is not well understood. Some types of text have highly regimented structures, which may be explicitly flagged. Reports of experiments in cognitive psychology journals would be a typical example. Other texts have less systematic shifts from topic to topic or subtopic. Sometimes these shifts can be explicitly flagged, by phrases such as ‘another aspect of…’ or by paragraph structure. Sometimes detecting the change in topic may depend on background knowledge.

As the topic of a text shifts, the reader or listener must follow those shifts and use them to aid comprehension. It is usual to distinguish between local and global focusing effects in text. Global focus changes from one ‘discourse segment’ to another (Grosz & Sidner, 1986), and the relations between these segments are determined by the intentions of the person producing the discourse. These relations could be those listed in Mann and Thompson’s rhetorical structure theory (see above), though Grosz and Sidner themselves resist any attempt to provide a definite list of relations. Local focusing effects operate within discourse segments and determine which of the people or things just mentioned is likely to be mentioned again. Thus, even more than having a mental model for a particular text, local focusing effects further narrow the range of possible referents for referring expressions with little semantic content, such as simple noun phrases (e.g. ‘the tree’) and pronouns (e.g. ‘it’). The main account of local focus for about fifteen years has been centering theory (see Grosz, Joshi, & Weinstein, 1995, for a recent version). Centering theory recognizes a set of forward-looking ‘centers’ for each utterance, primarily the things actually mentioned in that utterance. These centers are ranked according to various criteria, such as subjecthood, and the highest ranked is the most likely to be mentioned again and is the primary candidate for pronominal reference in a subsequent utterance. The single backward-looking center of each utterance must be identified with one of the forward-looking centers of the previous utterance. Backward-looking centers sometimes remain the same from one utterance to the next, though shifting is permitted so that the topic can move on.

A recent alternative to centering theory is Almor’s (1999) informational load hypothesis, which is more closely grounded in such psychological concepts as short-term memory. According to this hypothesis, the conceptual distance between words such as ‘robin’ and ‘bird’ or ‘robin’ and ‘it’ determines how difficult it is to establish that they refer to the same thing. However, this hypothesis is, if anything, even less precise than centering theory in its account of how focusing actually works.

Alternatives to Mental Models Theory?

The basic principles of mental models theory appear incontrovertible, and Garnham (1996) has argued that they arise from a logical (‘task’) analysis of comprehension and are, therefore, part of the computational theory of comprehension, in the sense of Marr (1982; see Hayward & Tarr, Chapter 2, this volume). However, partly for this reason, they do not specify in detail the psychological processes that underlie comprehension. Thus, some theories that might at first appear to be alternatives to mental models might be better interpreted as addressing somewhat different issues. For example, Kintsch’s (1988) construction-integration theory postulates two main stages to the process of comprehension. In the first stage, construction, information from the text and from relevant background knowledge is extracted and linked into a network. This stage is largely ‘bottom-up’: any apparently relevant information is used (e.g. all meanings of an ambiguous word) without much in the way of a ‘top-down’ attempt to limit interpretation. In the second stage, integration, related pieces of information support each other in the overall interpretation, whereas isolated bits of information (e.g. the unintended meaning of an ambiguous word, which is unlikely to be related to much else in the text) are suppressed. This account does not address directly the questions that are central to the mental models theory, such as how coreference relations are established, but it does address legitimate processing questions about which the mental models theory itself is agnostic.

Much the same can be said about the recently proposed ‘memory-based’ theories of comprehension (see papers in O’Brien, Lorch, & Myers, 1998). These theories claim that the integrative aspects of comprehension depend on the incoming piece of text sending a passive signal to all of memory, including text memory, which will resonate with relevant information that is needed to understand the text. Again, such an account does not deal with processes such as reference resolution, but it does begin to address the difficult problem of how stored knowledge is used rapidly (and usually correctly) to interpret text.

Pragmatic Interpretation

Mental models theory tends to focus on the literal meaning of text. It emphasizes the construction of mental representations of the information conveyed by the text. However, there are other aspects to text and discourse meaning. One aspect of language use that is easy to overlook within this perspective is that language is used to do things (e.g. ask questions, make promises, tie [marital] knots), not just to describe them. The ordinary language philosopher J.L. Austin emphasized this idea in his book How to do things with words (1962), and it was elaborated into the notion of speech acts by another philosopher, John Searle (1969). Some speech acts are direct (‘I bet you $5 that the Yankees will win the World Series’) but some are indirect. Understanding that someone who says ‘I feel cold’ actually wants you to close the window is an important aspect of comprehension.

One idea about how conversations are understood is that people follow a general principle of cooperation (Grice, 1975). Apparent deviations from this principle are resolved by conversational implicatures, which are a kind of inference. Thus, it should be obvious that a window is open, and it cannot be useful to mention that fact to someone who already knows it. So, the other person looks for a different interpretation of the description–as an indirect request to close the window. However, other researchers, in particular Sperber and Wilson (1986), argue that an assumption of co-operation is not necessary and that people will determine the most relevant interpretation of an utterance without any necessary presumption of co-operation.


Language comprehension depends on the interplay of processes at three main levels: lexical, syntactic and message. Lexical processes identify individual words in the speech stream or on the printed page and determine their meanings. They use a store of knowledge about words called the mental lexicon and information about how complex words are built up from simpler words (strictly, morphemes) and affixes. Syntactic processes group the words into larger units. These units are based on recurrent patterns of word types, and can therefore be recognized before the meaning of a particular pattern has been determined. If they could not be, comprehension would not be possible unless the message had already been understood! Message-level processes not only compute the meanings of these groups of words, but determine other, broader, aspects of meaning that depend on, among other things, relations to context and background knowledge and conventions about how language is used to achieve things in the world.

Despite great progress in the last thirty or forty years, none of these processes is fully understood. Indeed, many fundamental issues about the overall structure of the language comprehension system and that of its parts remain to be definitively answered.