Read Of Minds and Language Online
Authors: Pello Juan; Salaburu Massimo; Uriagereka Piattelli-Palmarini
Where a lattice-based learner clearly excels over an enumeration learner is that, although it considers grammars in the right sequence to satisfy SP, it is not otherwise constrained by a rigid pre-determined ordering of all the grammars. For any input sentence, the learner must postulate a smallest language, but it has a free choice of which smallest language to postulate. Its choice could be made by trial and error, if that is all that is available. But a learner with decoding capabilities could do it much more effectively, because the input guides a decoding learner towards a viable hypothesis. And happily, for this purpose
full
decoding is not essential. Once decoding is used just to speed up learning, not for the application of the EM, partial decoding is good enough, because a lattice-based learner doesn't need knowledge of
all
the grammars that could license a sentence in order to be able to choose one that is free of subsets; instead, the lattice
offers
the learner only grammars that are free of subsets. This is the heart of the lattice solution to the problem of applying EM. The evaluation metric is inherent in the representation of the language domain, so the question of which of a collection of grammars best satisfies EM doesn't need to be resolved by means of online computations, as had originally seemed to be the case. The whole cumbersome grammar-comparison process can be dispensed with, because EM's preferred languages are now pre-identified. The Gold-type enumeration, despised though it may have been on grounds of psychological implausibility, has thus taught us a valuable lesson: that evaluation of the relative merits of competing hypotheses does not inevitably require that they be compared.
We seem to be on the brink of having a learning model that is feasible in all departments: learners' hypotheses are input-guided by parametric decoding but only as much as the parsing mechanism can cope with; SP applies strictly but not over-strictly; neither online computation nor memory is overtaxed. But there are two final points that I should flag here as deserving further thought.
First, the appeal of the lattice representation in contrast to a classic enumeration is that it permits constructive grammar selection procedures, like decoding, to step in wherever rigid ordering of grammars is not enforced by EM. But I want to post a warning on this. We are in the process of running simulation tests to make sure that this ideal plan doesn't spring nasty leaks when actually put to work. The most important thing to check is that we can integrate the two parts of the idea: using the lattice to identify the smallest languages, and using partial decoding to choose among them. We think this is going to work out, but there's an empirical question mark still hovering over it at the moment.
6
Finally, there's that nagging question of whether it is plausible to suppose that we are all born with a grammar lattice inside our heads. There's much to be said about this and about the whole issue of what could or couldn't be innate. It would be very exciting to be able to claim that the lattice is just physics and perfectly plausible as such, but I don't think we're there yet. In lieu of that, we would gladly settle for a rationalization that removes this huge unwieldy mental object from our account of the essential underpinnings of human language. If the lattice could be
projected
in a principled way, it would not have to be wired into the infant brain. It might be dispensed with entirely, if the vertical relations in the lattice could be generated as needed rather than stored. To do its job, the learning mechanism needs only (a) access to the set of smallest languages at the active edge of the lattice, and (b) some means of renewing this set when a member of it is erased and languages that were above it take its place. We are examining ways in which the lattice might be projected, holding out our greatest hopes for the system of default parameter values proposed by Manzini and Wexler (1987). But at least in our CoLAG language domain, which is artificial and limited but as much like the natural language domain as we could achieve despite necessary simplifications, we have found exceptions â thousands of exceptions â to the regular patterning of subset relations that would be predicted on the assumption that each parameter has a default value which (when other parameters are held constant) yields a subset of the language licensed by the non-default value. Many subset relations between languages
arise instead from unruly “conspiracies” between two or more parameters, and they can even run completely counter to the default values.
7
If these exceptions prove to be irreducible, it will have to be concluded that as-needed projection of the lattice is not possible and that the lattice must indeed be biologically inscribed in the infant brain. We hold out hope that some refinement of the principles that define the defaults may eventually bring the exceptions under control. What encourages this prospect is the realization that the languages that linguists are aware of may be a more or less haphazard sampling from a much larger domain that is more orderly. SP concerns relations between
languages
, which do not closely map relations between grammars. So the innate
grammar
domain may be highly systematic even if the
language
domain is pitted by gaps. Gaps would arise wherever the innately given lattice contains a superset-generating grammar lower than a subset-generating grammar. The subset grammar would be UG-permitted but unlearnable because its position in the lattice happens to violate SP (or some other aspect of EM). Such grammars would be
invisible
to us as linguists, whose grasp of what is innate is shaped by observation of the languages that human communities do acquire. In that case, the priority relations among grammars in the innate domain may be much better-behaved than they seem at present, and may after all be projectable by learners on a principled basis. And there would be no need to suppose that the grammar lattice was intricately shaped by natural selection to capture just exactly the subset relations between languages.
C
HOMSKY
: When the child has learned topicalization and set the topicalization parameter, why can that knowledge not be retained?
F
ODOR
: The culprit is the ambiguity of triggers. Because the triggers are ambiguous, any parameter setting the learner adopts on the basis of them could be wrong. So the learner has to be always on the alert that sentences she projected on the basis of some past parameter setting may not in fact be in the target language. But you are right that there was a missing premise in the argument I presented. It assumed that the learner has no way to tell which triggers are ambiguous and which are not. That's important, because clearly the learner
could
hold onto her current setting for the topicalization parameter if she knew
she had adopted it on the basis of a completely unambiguous trigger. In most current models the learner cannot know this â even if it were the case. This is because the model parses each sentence with just one new grammar (when the current grammar has failed to parse it). But parametric ambiguity can be detected only by testing more than one grammar; and
non
-ambiguity can be detected only by testing
all
possible grammars. A learner capable of full decoding would be able to recognize a sentence as parametrically unambiguous. The more psychologically plausible Structural Triggers Learners that do partial decoding can also recognize unambiguity, if they register every time they encounter a choice point in the parse. Even though the serial parser is unable to
follow up
every potential analysis of the sentence, it can tell when there are multiple possibilities. If such a learner were to set a parameter indelibly if its trigger was unambiguous, could it avoid the retrenchment problem? The data from our language domain suggest that there are so few unambiguous triggers that this would not make a big dent in the problem (e.g., 74 percent of languages have one parameter value or more which lack an unambiguous trigger). However, we are currently testing a cumulative version in which parameters that are set unambiguously can then help to disambiguate the triggers for other parameters, and this may be more successful.
P
ARTICIPANT
: I was wondering whether any statistical measures would come in, because I think Robin Clark has suggested something of this kind in his earlier work: entropy measures, for example.
8
Also David LeBlanc at Tilburg tried to build in parameter setting in a connectionist network: there was a statistical measure before a parameter was set.
F
ODOR
: Yes, the Structural Triggers learning model that we have developed at CUNY is actually a family of models with slight variations. The one we like best is one that has some statistics built into it.
9
What we have discovered, though, is the importance of using statistics over linguistically authentic properties. Statistical learning over raw data such as word strings without structure assigned to them has not been successful, so far anyway. Even very powerful connectionist networks haven't been proved to be capable of acquiring certain syntactic generalizations, despite early reports of success (Kam 2007). In our model â and Charles Yang's model has a similar feature â we do the statistical counting over the parameter values. A parameter value in a grammar that parses an input sentence has its activation level increased. This gives it a slight edge in the future. Each time the learner needs to postulate a new grammar, it can pick
the one with the highest activation level, that is, the one that has had the most success in the past. In the lattice model we have extended this strategy by projecting the activation boost up through the lattice, so that all the supersets of a successful grammar are incremented too, which is appropriate since they can license every sentence the lower grammar can license. Then, if a grammar has been quite successful but is eventually knocked out, all of its supersets are well activated and are good candidates to try next. Preliminary results (see footnote 6 above) indicate that this does speed acquisition.
B
OECKX
: I am interested in knowing what the main differences are between the model that you sketched and the model that Charles Yang has been pursuing.
10
One of the things that Charles has been trying to make sense of is the ambiguity of triggers. In particular, it was obvious from the very beginning of the principles and parameters approach that if triggers were completely unambiguous, acquisition of syntax would be extremely fast. It wouldn't take three years, but three minutes, basically. That is, if all the switches are there and everything is unambiguous, it would be done almost instantaneously. We know that while it is actually fairly fast, it does take a couple of years, so one of the things that Charles has been trying to do is play on this ambiguity of triggers and the fact that there will be some sentences that will be largely irrelevant to setting the switches, so that the learner has to keep track of the complex evidence that he or she has. Therefore, the model uses the ambiguity or complexity of triggers as an advantage, to explain a basic fact, namely that it takes time to acquire syntax. Could you comment on that?
F
ODOR
: First of all, I don't think it is true that if there were unambiguous triggers learning should be instantaneous, because there is so much else the learner has to do. At CUNY we assume that children don't learn the syntax from a sentence in which they don't know all the words; that would be too risky. So the child has to have built up some vocabulary, and as Lila Gleitman says, this can be quite slow. So that takes time, and then there is also the interaction problem â that is, the learner might not be able to recognize a trigger for one parameter until she has set some other parameter. So I doubt that parameter setting could be instantaneous anyway. However, I agree with you that it is interesting to explore the impact of the ambiguity of triggers, and this is what we have been doing for some years. My first approach to this (J. D. Fodor 1998) was to say that in order to model parameter-setting so that it really is the neat, effective, deterministic process that Noam envisaged, there
must
be unambiguous triggers; and we have got to build a model of the learner that is capable
of finding the unambiguous triggers within the input stream. As I mentioned in my paper here, a learner would have to parse all the analyses of a sentence in order to detect the ambiguities in it; but one can detect
that
it is ambiguous just by noting the presence of a choice of analysis at some point in the parse. Then the learner could say, “I see there are two potential ways of analyzing this sentence. It is ambiguous with respect to which parameter to reset, so I will throw it away. I will learn only from fully unambiguous, trustworthy triggers.” We have modeled that strategy, and we have found â disappointingly â that it doesn't always work. It is very fast, as you imply, when it does work, but it often fails (Sakas and Fodor 2003). The reason is that there just isn't enough unambiguous information in the natural language domain. As far as we can tell (of course we haven't modeled the whole language world, only 3,000 or so languages), natural language sentences aren't parametrically unambiguous enough to facilitate a strategy of insisting on precise information. I think this is a puzzle. I mean, why
isn't
the natural language domain such that it provides unambiguous information for acquisition? Is there some reason why it couldn't be? Or is it just testament to the robustness of the learning mechanism that it can get by without that assistance? In any case, it suggests that Charles Yang is right to model parameter setting as a nondeterministic process, as we do too in our current models.
Now to your other point, about how our model relates to Charles's. We have worked quite closely with Charles and we do have very similar interests. However, in comparative tests on the CoLAG domain we have found that Charles's Variational Model runs about 50â100 times slower than ours. We measure learning time in terms of the number of input sentences the learner consumes before converging on the target grammar. The Variational Model is really very slow. In fact, our estimates of its efficiency are more positive than his own. In his book (Yang 2002) he says that in a simulation of the setting of ten parameters the model took 600,000 input sentences to converge. Though he doesn't describe the details of the experiment, this does seem excessive, showing signs of the exponential explosion problem that is a constant danger in parameter setting. I think the reason is that Charles was building on a seriously weak model that he inherited from the work of Gibson and Wexler (1994). The Gibson and Wexler model is a trial-and-error system. It guesses a grammar without guidance from the input as to which grammar to guess; the input serves only to provide positive feedback when a lucky guess has been made. In creating his model, Charles grafted statistical processing and the notion of grammar competition into this inherently weak model. By contrast, when we add a statistical component to our Structural Triggers models, it enhances the parametric decoding that the model engages in to identify a grammar that
fits the novel input sentence. This has the property that triggering has always been thought to have, which is that the input sentence guides the learner's choice of a next grammar hypothesis. Charles has drawn interesting theoretical consequences from the statistical aspect of his model, showing how it predicts gradual learning rather than instant setting of parameters, and variable performance by children and also by adults. This is all very interesting, but we believe it deserves to be implemented in a basic learning mechanism that is closer to what Noam had in mind in proposing triggering as opposed to trial-and-error exploration through the array of all possible grammars.