Sunday, January 27, 2013

Reply to Alex (on parameters)

Commenting on my post, Alex asked "was is the alternative? Even if there is no alternative theory that you like, what is the alternative 'research paradigm'? What do *you* think researchers, who think like us that the central problem is language acquisition, should work on? What is the right direction, in your opinion?"
I had to start a new post, because I could not find a way to insert images in the 'reply to comment' option. You'll see I needed 2 images.

So, what's the alternative? I don't have a grand theory to offer, but here are a few things that have helped me in my research on this topic. I hope this can be of help to others as well.

I think the first thing to do is to get rid of the (tacit) belief that the problem is easy ('essentially solved', I'm told), and to take seriously the possibility that there won't be an adequate single-level theory for Plato's problem. Here's what I mean: In "What Darwin got wrong", Fodor and Piattelli-Palmarini rightly point out (end of the book) that unlike single-level theories in physics, single-level theories in biology don't work very well. Biology is just too messy. Theories that assume it's all selection (or it's all parameter fixation) are just not the right kind of theory. We've got to be pluralistic and open-minded. Biologists made progress when they realized that mapping the genotype to the phenotype was not as easy as the modern synthesis had it. Bean bag genetics is no good. Bean bag linguistics is not good either.

I once attended a talk by Lila Gleitman (who, like Chomsky, is almost always right) where she said something that generative grammarians (and, of course, all the others, to they extent they care about Plato's problem) ought to remember at all times: learning language is easy (it's easy because you don't learn it, it's 'innate'), but learning a language is really hard and you (the child) throw everything you can at that problem. I agree with Lila: you throw everything you can. So for Plato's problem, you resort to all the mechanisms you have available. (The prize will be given to those who figure out the right proportions.)
For us, generativists, this means, learning from the other guys: I personally have learned a lot from Tenenbaum and colleagues on hierarchical Bayesian networks and from Jacob Feldman's work on human concept learning. I think the work of Simon Kirby and colleagues is also very useful. Culbertson's thesis from Hopkins is also a must-read.
All of these guys provide interesting biases that could add structure to the minimal UG some of us entertain.
Add to that the sort of pattern detection mechanisms explored by Jacques Mehler, Gervain, Endress, and others to help us understand what the child uses as cues.
None of this is specifically linguistic, but we just have to learn to lose our fear about this. If UG is minimal, we've got to find structure somewhere else. Specificity, modularity, ... they'll have to be rethought.

The second thing to do is to try to figure out the right kind of grammatical priors to get these biases to work in the right way. Figure out the points of underspecification in (minimal) UG: what are the things about which UG does not say anything? (For example, syntax does not impose linear order, something else does) Since a growing number of people bet on variation being an 'externalization' issue (no parameter on the semantic side of the grammar), it would be great to have a fully worked out theory of the morphophonological component in the sense of Distributed Morphology (what are the operations there, what's structure of that component of the grammar?).
Halle and Bromberger said syntax and phonology are different (Idsardi and Heinz have nice work on this too). Would be nice to be clear about where the differences lie. (Bromberger and Halle put their fingers on rules (yes, for phononology, no for syntax). I think they were right about that difference. Curiously enough, when Newmeyer talks about rules, those who defend parameters go crazy, saying rules are no good to capture variation, but no one went crazy at Halle when he talked about phonological rules, and boy does phonology exhibit [constrained] variation ...)

The third thing is to take lessons from biology seriously. Drop the idealization that language acquisition is 'instantaneous' and (like biologists recognized the limit of geno-centrism --- in many ways, the same limits we find with parameters) take development seriously ("evodevo"). There is good work by a few linguists in this area (see the work by Guillermo Lorenzo and Victor Longa), but it's pretty marginalized in the field. We should also pay a lot of attention to simulations of the sort Baronchelli, Chater et al. did (2012, PLOS) (btw, the latter bears on Neil Smith's suggestions on the blog.)

The fourth thing (and this is why I could not use the 'reply to comment' option) is to develop better data structures. Not better data (I think we always have too many data points), but better data *structures*. Here's what I mean. Too many of us continue to believe that points of variation (parameters, if you want) will relate to one another along the lines of Baker's hierarchy. Nice binary branching trees, no crossing lines, no multi-dominance, like this (sorry for the resolution) [Taken from M. Baker's work]

Such representations are plausible with toy parameters. E.g., "pro-drop": Does your language allow pro? No, then your language is English. Yes., then next question: does it allow pro only in subject position? No, then your language is Chinese. Yes, then your language is Italian.
We all know this is too simplistic, but this is ALWAYS the illustration people use. It's fine to do so (like Baker did) in popular books, but it's far from what we all know is the case.
But if it's not as simple, how complex is it?
As far as I know, only one guy bothered to do the work. For many years, my friend Pino Longobardi has worked on variation in the nominal domain. He's come up with a list of some 50+ parameters. No like my toy 'pro-drop' parameter. Are there more then 50? You bet, but this is better than 2 or 3. More realistic. Well, look what he found: when he examined how parameters relate to one another (setting P1 influences setting P2, etc. ), what you get is nothing like Baker's hierarchy, but something far more complex (the subway map in my previous post) [Taken from G. Longobardi's work]

As they say, an image is worth a thousand words.
But the problem is that we only have this detailed structure for one domain of the grammar (my student Evelina Leivada is working hard on other domains, as I write, so stay tuned). Although we have learned an awful lot about variation in the GB days, when it comes to talking about parameter connectivity, we somehow refuse to exploit that knowledge (and organize it like Longobardi did), and we go back to the toy examples of LGB (pro-drop, wh-in-situ). This is an idealization we have to drop, I think, because when we get our hands dirty, as Longobardi did, and as Leivada is doing, we get data structures that don't resemble what we may have been led to expect from P&P. This dramatically affects the nature of the learning problem.

The fifth thing to do (this is related to the point just made) is to stop doing 'bad' typology. The big categories (Sapir's "genius of a language") like 'analytic, synthetic, etc' are not the right things to anticipate: there are no ergative language, analytic language, or whatever. So let's stop pretending there are parameters corresponding to these. (I once heard a talk about "[high analyticity] parameter" ... If you say 'no' to that parameter, do you speak a [less analytic] language? Is this a yes/no or more-or-less issue?) These categories don't have the right granularity, as my friend David Poeppel would say.

Most importantly, we should be clear about whether we want to do linguistics or languistics. Do we care about Plato's problem, or Greenberg's problem? These are not the same thing. IMHO, one of the great features of minimalism, compared to GB, is that it forces you to choose between the language faculty or languages. Lots of people still care about getting the grammar of English right (sometimes, they even say, I-English), but how about getting UG right? It's time we worry about the biological 'implementation' of (I-)language, as Paul (Pietroski) would say.

To conclude, step 0 of the alternative boils down to recognizing we have been wrong (that's the best thing we can aspire to, Popper would say, so why not admit it?).
Alex, I hope this answers your question.


  1. So how would this be truly different from the approach taken in the approaches to grammar than Ivan Sag calls 'FS' (formal syntax, as opposed to Universal Grammar, and Typology)? FS people have some idea for an invariant architecture (eg the levels of c-structure, f-structure and some kind of semantics and other stuff in LFG), and some principles concerning their relationships, such that every c-structure node has a unique f-structure correspondent, and then come up with notations for expressing the language particular restrictions (assuming that rule notations with an evaluation metric are a decent approximation to what is learned, salvaged in principle at least by the Bayesians).

    The languistics/linguistics distinction strikes me as unnecessary and in fact dangerous, since it can encourage people to ignore inconvenient phenomena in the pursuit of attractive UG-ish visions. Baker (2008) ignoring the literature on case stacking in his account of concord is perhaps a recent example of this happening.

  2. So I don't see the relevance of evo-devo to language acquisition; other than as a source of metaphors and analogies. And I don't see where the parameters that you discuss come from as I thought you had abandoned them- or are these parameters rather than Parameters?

    But other than that this seems very reasonable -- indeed it more or less summarises what non Chomskyan linguists and cognitive scientists have been trying to do for the last 30 years. It also seems to summarise what Chomsky has been arguing against quite vociferously for the last 30 years. He is not a big fan of Bayesian learning, for example. So I don't see what if anything survives of the classic model here.

    And one final trivial nit-pick. I have been getting told off over and over again by you, by Paul P, for using the word 'learning' -- it's not learning, it's acquisition or development or growth or whatever.
    And now you are happily using 'learning a language' 'learning problem' etc.
    So is it still taboo for dangerous empiricists like myself? Or can anyone use the term ?

    1. Sorry, but most of the non-Chomskyan work has not been trying to do this for it has abstracted away from the core problems. I keep returning to this but it's because I think it's critical: empiricists can play so long as they try to explain the facts. The facts are those roughly described by GB and its cousins. I have yet to see an "empiricist" derivation of the principles of the binding theory, island effects, ECP, fixed subject constraint, crossover effects, X'-theory, etc. Until this is provided empiricist aspirations are at best sociologically (may psychologically as well) interesting. The main difference between what empiricists are trying to do and what minimalists are trying to do is that the latter are trying to derive non-trivial generalizations (again as described by GB) and show how they follow on more natural assumptions. To date, this is not what I have seen from your favorite empiricists. But until they do, there is nothing to talk about really.

      Last point: what Chomsky dislikes about Bayesians is their general disregard for the structure of the hypothesis space. They seem to "abstract away" from this in their discussions, worrying about how to navigate around the space without worrying much about what it looks like. However, all the discussion of how one gets around it cannot tell us what it looks like. But this is what linguistics care about, and Bayesians should too. remember, the only real question at the end of the day is: what does this tell us about UG? Right now, nothing Bayesians have said really tells us much. That's why Chomsky (and anyone else) should consider their work of dubious value. However, once one adopts a reasonable proposal for UG, then the question of how Gs arise given this UG becomes a reasonable project. All agree that it involves using PLD + UG to get a G and all agree that this COULD involve elaborate statistical massaging. What we want is not obeisance to these truisms, but some actual nontrivial results that demonstrate that such truisms are indeed true.

    2. sorry Alex, reply will have to be brief, otherwise it would have to wait too long.
      1. RE Evo-Devo, I think it's more than a source of metaphors and analogies. Remember that the things I listed were things that *I* found useful. It may not work for everyone. But do take a look at the Lorenzo/Longa papers (they have nice pieces in Linguistics and Lingua from a few years ago).
      2. Re parameters, yes I meant lower case p, I should have written "points of underspecification" (like: if merge does not care about linear order, but the speech channel does, a decision will have to be made by the child to resolve the things left open by UG [not encoded in UG])
      3. I'm glad you found most of the items on my list reasonable, and I agree many of them are part of other traditions. As I tried to suggest, we have a lot to learn from them. Note that I was speaking for myself, I'm not saying this is true for all Chomskyans. Nor was I suggested Chomsky agrees.
      4. Having said this (point 3), I should also say that just like we have to learn from the other traditions, they also should learn a lot from us. Even if the P&P picture is not correct, I don't want to suggest it's useless. Far from it. I wish the Bayesians took the pain to learn more about UG: they would get better priors from doing so! It would make their analyses more compelling (Yang and others, who btw Chomsky often cites with approval, make room for plenty of statistical learning but show that it gets nowhere without UG)
      5. RE learning, should have watched out for purists like you. Sorry. But you know I meant 'growth/development' (or perhaps Pietroski-learning ;)

    3. There is one thing about the some of the Bayesians (in particular, the MDL people) which strikes me as especially discordant with Chomsky, which is the idea that you don't need any specific structure for the hypothesis space at all, because, given enough data, they're all equivalent. Dumbed-down version for linguists in Chater & Vitanyi (2007) Journal of Mathematical Psychology 51:135–163, and much more on the pages of those two authors and Anne Chu, and John Goldsmith seems to have been into this for some time.

      Unfortunately, they appear to have no estimates of how many universes would probably have to die before a child with a randomly determined 'universal prior' would accumulate enough data to determine the grammar of English. But the idea that there is are universal priors might explain their apparent lack of interest in the structure of the hypothesis space.

      As far as I can see, it is still possible to not buy into the universal prior idea at all, but use their general approach as an upgrade of the classic evaluation metric (which LFG at least I think has continued to use implicitly without any serious discussion of the underlying ideas).

    4. Yes, the methods can be used. Jeff Lidz is exploring this in the context of a more informed structure of the hypothesis space. In fact, Aspects essentially outlined a Bayesian model the main conceptual problem being what the evaluation metric looked like. So, I completely agree; the technology is fine, it's the silly associationism that gets tacked on that is objectionable. For reasons I simply cannot understand, the statistically inclined seem to assume that what gets counted and how somehow just emerges from the data and so there is no reason to specify native structure. Add to this a firm set of convictions about what can and cannot be "innate" and you get very shallow theories. As Cedric points out, nobody that I know is against stats. How could you be? The question is what role they play. Berwick, Niyogi, Yang, a.o. have all combined probabilities with reasonable views of UG to great effect. So, it's not the stats that are the problem, it's the associationism which is always lurking in the background.

  3. Avery,
    for me the main point of the Chater and Vitanyi results (and the Horning results which it develops in some sense) is that it shows that you don't need negative evidence. I don't take it as being a theory of language acquisition but just as an argument against the claims that you can't control overgeneralisation without negative evidence -- the 'logical problem' of language acquisition -- and related ideas like the Subset principle.

    *I* think it does that quite well, but opinions may differ.

    I agree about complexity problems -- these are asymptotic results and don't tell you anything much about finite sample behaviour which is what is important, and neglect the computational issues, but that latter point may not be convincing to the crowd here who seem sceptical about computation.

    1. Whose skeptical about computation? Which crowd? Feasibility has always been interesting and Berwick and Niyogi's discussion of sample complexity always seemed right on target, though it has been hard to do this with realistic models with many parameters. However, it is certainly the right kind of thing to consider, rather than the asymptotic results.

      BTW, my crowd has always been comfortable with using indirect negative evidence. What we have insisted on is that this makes sense only if you already have some expectations, i.e. some prior knowledge about the shape of the hypothesis space. With this, the absence of evidence can be evidence of absence. Howard Lasnik made this point decades ago and so far as I can tell we all accepted this. What we did not accept is that this makes sense without articulated expectations. My reading of the Tannebaum et al stuff is that they agree. You get subset results given the structure of the space of alternatives in some special circumstances. The problem is that with realistic parameters things don't often fit nicely into sub and super sets. Rather they overlap and then it's not that clear how to proceed. This was Dresher and Kaye's point and not one that there are trivial ways of finessing.

    2. @Alex that's the part I too want to pay attention to, but I think the universal prior component needs to be discussed and dissected away by people with the right mathematical competence (not me, yet, and perhaps never).

      In particular, to my way of thinking (and, I conjecture, to Norbert's), it would be fine to construct the prior a.k.a. evaluation metric any way that seemed empirically to give sensible results, rather than make a big fuss about doing it in the statistically 'correct' way, as, for example Mike Dowman does in his paper. (tho the question of what it has to be is different and presumably harder to answer than the question of what it could be).

      Plus, of course, the thing I keep grumbling about: to apply those ideas you need to have a generation theory that assigns probability of occurrences to the utterances, which seems to be more than current syntactic theories are capable of providing for descriptively serious grammars that capture the generalizations that learners are almost certainly acquiring.

      I conjecture that this problem can be temporarily evaded by thinking about conditional probabilities of utterances given their meanings, so that, given a meaning M, the probability of the utterance is estimated at 1/n where n is the number of different ways the grammar provides of expressing M, but this idea has major limitations, so the evasion, if it is sensible at all, is only short-term.

      But there could well be something important that I haven't read or thought up that vitiates this idea.

    3. On the generation theory point, there have been some attempts -- e.g. Clark and Curran's probabilistic CCG work, but to be frank they cheat a little and convert it into a PCFG. The problem is not, in my opinion, a problem with the models or the estimation techniques but one of the data. You need a large amount of annotated data to train the models on, and this has to be the data that you expect to see (i.e. it has to be naturally occurring data).
      The richer the model the slower the annotation. People have now realised that the Penn treebank is too small to train lexicalised PCFGs on, so to train a descriptively adequate model with all of the features needed would need a lot of data and that is out of the question.

      Plus getting anyone to agree on what the annotations should be is a problem. If there actually was a lasting consensus on the syntactic structures of sentences in English or any other language then it would be a lot easier to get some large scale annotation off the ground. There are some projects of this type -- e.g. Redwoods for HPSG.

    4. This comment has been removed by the author.

    5. [typo fix of deleted above]

      I found something on generation by Clark & Zhang, but not yet Clark & Curran (but plenty on parsing).

      I think I would buy the line that mathematically defining the concept 'best grammar for a body of data' is logically prior to working out how to find, with the result that requiring unavailably large datasets and non-tractable algorithms isn't really an issue for me, in this quest.

    6. I was thinking of the Clark and Curran work on parsing -- in order to do statistical parsing, one way is to define a generative model that assigns a probability distribution over parses and then you pick the most likely parse given the surface string or yield. So that was what I thought you were looking for --
      Generation in NLP parlance as you probably know refers to the task of going from a semantic representation to a surface string, which is probably not relevant though it may well use a generative model too.

      (I am simplifying a bit as modern statistical parsers are often not generative in this sense but discriminative. )

    7. I think the parsing problem is in good hands, but also that the generation problem for descriptively serious grammatical formalisms (ie not PCFGs) is more fundamental scientifically, and to the focus of this blog, since, for example, you can't use a Bayesian method to find the best grammar for data if you can't calculate P(D|G). Therefore there seems to be no statistically based alternative to the unsatisfactory parameters game, which therefore remains the only game in town (without oracles). The problem also seems somewhat neglected, perhaps due to excessive difficulty or industrial irrelevance. (or perhaps I'm deeply confused about something.)

  4. Norbert, I am not sure I understand your point about super/subsets.
    If things don't fit neatly into super subset relations then there is no learnability problem (of this 'logical' sort) because you will see positive evidence that will distinguish A from B and B from A. I thought the problem here was when A is a neat subset of B, and then if your hypothesis is B, then you will never see a positive example which will explicitly disconfirm B, and so you are doomed.
    That is the (bad) argument that the statistical learners are meant to overcome. I think the statistical learners can also handle the non-neat case, just like any learner can.

    1. I see I was unclear. The sub/super problem goes away with indirect negative evidence. Given a theory of UG indIrect neg evidence is not a real problem, as Lasnik indicated. It's only an issue if one has no idea what to expect, which is not the case assuming UG. The Dresher-Kaye problem is that without independent parameters how to change settings is unclear and there is no trivial fix to this problem. Sadly, most of the parameter learning problem seems to be of the second kind.

    2. I completely agree about the Dresher-Kaye problem being the real one -- that is a very general problem for all learning approaches not just parametric ones.

      What do you mean about it being an issue if you have no idea what to expect ? All approaches have some bias, some expectations, even if it is just a universal prior -- even that counts as UG doesn't it?

    3. Indirect negative evidence is not a problem if you have some idea of what is grammatically possible. Why? Well, a reasonable enough acquisition strategy would be to wait and see if what you are "expecting" materializes within some finite time period. If not, assume that it's not possible. Presto indirect negative evidence. This was Lasnik's point many years ago and it seems right to me. One, of course, has to be somewhat careful here. For example, there are not many sentences with 4 levels of embedding produced, yet they are grammatical. However, in general, the point seems right and so there is no problem with indirect negative evidence once one has a UG. Without one, however, this is a problem, and not just on linguistics. Gallistel has a nice new piece reviewing the classical learning theories and reemphasizes the problems with correlations with the absence of a stimulus. The main problems are conceptual and they have never been adequately solved. This is analogous to indirect neg evidence. These are solvable when a richer set of assumptions are provided for however. That's what I had in mind.