Thursday, October 31, 2013

Lifted from the comments on 'Mother knows best.'

There has been an intense discussion of the APL paper in the comments to my very negative review of this paper (here). I would like to highlight one interchange that I think points to ways that UG like approaches have combined with (distributional) learners to provide explicit analyses of child language data. In other words, there exists concrete proposals dealing with specific cases that fruitfully combine UG with explicit learning models. APL deals with none of this kind of material, despite its obvious relevance to its central claims and despite citing papers that deal with such work in its bibliography. It makes one (e.g. me) think that maybe the authors didn't read (or understand) the papers APL cites.

One more point: Jeff observes that "it is right to bring to the fore the question of how UG makes contact with data to drive learning," and that APL are right to raise this question. I would agree that the question is worth raising and worth investigating. What I would deny is that APL contributes to advancing this question in any way whatsoever. There is a tendency in academia to think that all work should be treated with respect and politesse. I disagree. All researchers should be so treated, not their work. Junk exists (APL is my existence proof were one needed) and identifying junk as junk is an important part of the critical/evaluative process.  Trying to find the grain of trivial truth in a morass of bad argument and incoherent thinking retards progress. Perhaps the only positive end that APL might serve is to be a useful compendium of junk work that neophytes can go to to practice their critical skills. I plan to use it  with my students for just this purpose in the future.

At any rate, here is a remark by Alex C that generated this informative reply (i.e. citations for relevant work) by Jeff Lidz.

Alex Clark:

So consider this quote: (not from the paper under discussion)

"It is standardly held that having a highly restricted hypothesis space makes
it possible for such a learning mechanism to successfully acquire a grammar that is compatible with the learner’s experience and that without such restrictions, learning would be impossible (Chomsky 1975, Pinker 1984, Jackendoff 2002). In many respects, however, it has remained a promissory note to show how having a well-defined initial hypothesis space makes grammar induction possible in a way that not having an initial hypothesis space does not (see Wexler 1990 and Hyams 1994 for highly relevant discussion).
The failure to cash in this promissory note has led, in my view, to broad
skepticism outside of generative linguistics of the benefit of a constrained initial hypothesis space."

This seems a reasonable point to me, and more or less the same one that is made in this paper: namely that the proposed UG don't actually solve the learnability problem.

Jeff Lidz:

Alex C (Oct 25, 3am ([i.e. above] NH])) gives a quote from a different paper to say that APL have identified a real problem and that UG doesn't solve learnability problems.

The odd thing, however, is that this quote comes from a paper that attempts to cash in on that promissory note, showing in specific cases what the benefit of UG would be. Here are some relevant examples.

Sneed's 2007 dissertation examines the acquisition of bare plurals in English. Bare plural subjects in English are ambiguous between a generic and an existential interpretation. However, in speech to children they are uniformly generic. Nonetheless, Sneed shows that by age 4, English learners can access both interpretations. She argues that if something like Diesing's analysis of how these interpretation arise is both true and innate, then the learner's task is simply to identify which DPs are Heim-style indefinites and the rest will follow. She then provides a distributional analysis of speech to children does just that. The critical thing is that the link between the distributional evidence that a DP is indefinite and the availability of existential interpretations in subject position can be established only if there is an innate link between these two facts. The data themselves simply do not provide that link. Hence, this work successfully combines a UG theory with distributional analysis to show how learners acquire properties of their language that are not evident in their environment.

Viau and Lidz (2011, which appeared in Language and oddly enough is cited by APL for something else) argues that UG provides two types of ditransitive construction, but that the surface evidence for which is which is highly variable cross-linguistically. Consequently, there is no simple surface trigger which can tell the learner which strings go with which structures. Moreover, they show that 4-year-olds have knowledge of complex binding facts which follow from this analysis, despite the relevant sentences never occurring in their input. However, they also show what kind of distributional analysis would allow learners to assign strings to the appropriate category, from which the binding facts would follow. Here again, there is a UG account of children's knowledge paired with an analysis of how UG makes the input informative.

Takahashi's 2008 UMd dissertation shows that 18month old infants can use surface distributional cues to phrase structure to acquire basic constituent structure in an artificial language. She shows also that having learned this constituent structure, the infants also know that constituents can move but nonconstituents cannot move, even if there was no movement in the familiarization language. Hence, if one consequence of UG is that only constituents can move, these facts are explained. Distributional analysis by itself can't do this.

Misha Becker has a series of papers on the acquisition of raising/control, showing that a distributional analysis over the kinds of subjects that can occur with verbs taking infinitival complements could successfully partition the verbs into two classes. However, the full range of facts that distinguish raising/control do not follow from the existence of two classes. For this, you need UG to provide a distinction. 

In all of these cases, UG makes the input informative by allowing the learner to know what evidence to look for in trying to identify abstract structure. In all of the cases mentioned here, the distributional evidence is informative only insofar as it is paired with a theory of what that evidence is informative about. Without that, the evidence could not license the complex knowledge that children have.

It is true that APL is a piece of shoddy scholarship and shoddly linguistics. But, it is right to bring to the fore the question of how UG makes contact with data to drive learning. And you don't have to hate UG to think that this is valuable question to ask.


  1. Just to clarify: that quote is from a paper *by* Jeff Lidz: called "Language Learning and Language Universals" in Biolinguistics which is freely available, and much of which which I agree with.

  2. Maybe it would be worth discussing some of these examples. I think Misha Becker's work is maybe the best to talk about, for example Mitchener and Becker (2011) (DOI 10.1007/s11168-011-9073-6) -- this actually appeared in a special issue that I edited so obviously I think it is a good paper, and I have discussed these issues with Misha, but I don't want to claim that she agrees with me.

    So this paper concerns the acquisition of the raising/control distinction; broadly speaking on the acquisition of verb subcategorisation frames. And it is a modeling paper which is what we need in this case.
    The paper looks at various learning algorithms that could be used to acquire this distinction from certain inputs. In particular the learner knows already about what verbs and nouns are and has some other syntactic and semantic knowledge.
    In the context of this discussion the question then is
    a) what are the learning approaches that acquire that other syntactic knowledge? i.e. how do they learn which words are nouns and which are verbs?
    b) are those learning mechanisms capable of learning the raising/control distinction?

    Now neither of these issues are discussed really in the paper, if I recall correctly.
    So this paper is vulnerable to the APL critique as well; or at least it certainly doesn't offer any counterarguments.

    The Takahashi paper is an AGL paper not a modelling paper so I don't think that is relevant either (I have only read Takahashi and Lidz 2007, so apologies if this is very different from the thesis?).

  3. I don't think I see which criticism of APL applies to the Mitchener & Becker paper. The thing that I think you are missing is that what that paper shows is that there are learning mechanisms that can identify the two classes. What it doesn't show is that anything follows from those classifications, which is the UG contribution. As Norbert said in an earlier post, in order to identify the UG contribution you have to see what the distinction is used for. In the case of raising/control, there are a host of properties that are related to the distinction, yielding the following contrasts (which you could find in any decent intro to syntax textbook):

    1) Idiom chunk asymmetries:
    The shit is likely to hit the fan
    *The shit is trying to hit the fan

    2) "there-insertion" asymmetries
    There is likely to be a riot
    *There is trying to be a riot

    3) Synonymy under passive asymmetries
    CBS is likely to interview John = John is likely to be interviewed by CBS
    CBS is trying to interview John ≠ John is trying to be interviewed by CBS

    4) Use of expletive 'it'
    John is likely to leave.
    It is likely that John will leave
    John is trying to leave
    *It is trying that John will leave

    5) +/- selectional restrictions on surface subject
    The rock is likely to roll down the hill
    *The rock is trying to roll down the hill

    The fact that a learning mechanism can identify that there are two classes does not guarantee that this set of properties are diagnostic of the classes. For that, you need a syntactic theory that connects all the facts up, which is what UG provides. On the UG story that (I think) Becker is offering, the learner is equipped with two classes from which these facts follow. What the distributional evidence does is to help the learner identify which verbs fall into which classes. This is precisely the sort of thing that a theory of UG is supposed to be good for, i.e., (a) making facts that aren't in the experience of the learner fall out from facts that are in the experience of the learner and (b) making clusters of facts that wouldn't necessarily have to covary covary.

  4. And, Takahashi's thesis also contains a connectionist model that fails to learn what 18-month-olds infants learned.