Tuesday, November 17, 2015

Never thought I would say this

Never thought I would say this, but I found that I resonated positively to a recent small comment by Chris Manning on Deep Learning (DL) that Aaron White sent my way (here). It seems that the DL has computational linguistics (CL) of the Manning variety in its sights. Some DLers apparently believe that CL is just is nano-moments away from extinction. Here’s a great quote from one of the DL doyens:

NLP is kind of like a rabbit in the headlights of the Deep Learning machine, waiting to be flattened.

DL wise men like Geoff Hinton have already announced that they expect that machines will soon be able to watch videos and “tell a story about what happened” and be downsized onto an in-your-ear chip that can translate into English on the fly. Great things are clearly expected. Personally, I am skeptical as I’ve heard such hyperbole before. We have been five years away from this sort of stuff for a very long time.

Moreover, I am not alone. If I read Manning correctly, he is skeptical (though very politely so) as well.[1] But, like me, he sees an opportunity here, one I noted before (here and here). Of course we likely disagree about what kind of linguistics will be most useful for advancing these technological ends,[2] but when it comes to engineering projects I am very catholic in my tastes.

What’s the opportunity consist in? It relies on a bet: that generic machine learning (even of the DL variety) will not be able to solve the “domain problem.” The latter is the belief that how a domain of knowledge is structured matters a lot even if one’s aim is to solve an engineering problem.

An aside: shouldn’t those that think that the domain problem is a serious engineering hurdle also think that modularity is a good biological design feature? And shouldn’t these people therefore think that the domain specificity of FoL is a no-brainer? In other words, shouldn’t the idea that humans have domain specific knowledge that allows them to “solve” language problems (and support human facile acquisition and use) be the default position? Chris?  What think you? Dump general learning approaches and embrace domain specificity?

Back to the main point: The bet. So, if you think that using word contexts can only get you so far (and not interestingly far either), then you are ready to bet that knowing something about language will be useful in solving these engineering problems. And that provides linguists with an opportunity to ply their trade. In fact, Manning points to a couple of projects aimed at developing “a common syntactic dependency representation and POS (‘part of speech,’ NH) and feature label sets which can be used with reasonable linguistic fidelity and human usability across all human languages” (3).[3] He also advocates developing analogous representations for “Abstract Meaning.” This looks like the kind of thing that GGers could usefully contribute to. In other words, what we do directly fits into the Manning project.

Another aside: do not confuse this with investigating the structure of FL.  What matters for this project is a reasonable set of Greenberg “Universals.” Indeed, being too abstract might not be that useful practically, and being truly universal is not that important (what is important is finding those categories that best fit the particular languages of interest). This is not a bad thing. Engineering is not to be disparaged. It’s just not the same project as the one that GG has scientifically set for itself. Of course, should the Chomsky version of GG succeed, it is possible that it will contribute to the engineering problem. But then again, it might not. As I understand it, General Relativity has yet to make a big impact on land surveying. It really all depends (to fix ideas think birds and planes or fish and submarines. Last time I looked plane wings don’t flap and sub bodies don’t undulate).

Manning makes lots of useful comments about DL, many of which I didn’t understand. He makes some, however, that I did. For example, his the observation that DL has mainly proved useful in signal processing contexts (2) (i.e. where the problem is to get the generalization that is in the data, the pattern from (noisy) patternings). The language problem, as I’ve argued, is different from this (see here) so the limits of brute force DL will, I predict, become evident when the new wise men turn their attention to these. In fact, I make a more refined prediction: to “solve” this problem DLers will either (i) ignore it, (ii) restrict the domain of interest to finesse it or (iii) promise repeatedly that the solution is but 5 years away. This has happened before and will happen again unless the intricate structural constraints that characterize language are recognized and incorporated.

Manning also makes several points that I would take issue with. For example, IMO he (like many others) confuses squishy data for squishy underlying categories. See, in particular, Manning’s discussion of gerunds on p. 4. That the data does not exhibit sharp boundaries does not imply that the underlying structures are not sharp. In fact, at some level they must be for under every probabilistic theory there is a categorical algebra.  I leave it to you out there to come up with an alternative analysis of Manning’s observed data set. I give you a 30 second time limit to make it challenging.

At any rate, you will not be surprised to find out that I disagree with many of Manning’s comments. What might surprise you is that I think he is right in his reaction to DL hubris and he is right that there is an opportunity for what GGers know to be of practical value. There is no reason for DL (or Bayes or stats) to be inimical to GG. It’s just technology. What makes its practice often anathema is the hard-core empiricism gratuitously adopted by its practitioners. But this is not inherent to the technology. It is only a bias of the technologists. And there are some like Jordan and Manning and Reisinger who seem to get this. It looks like an opportunity for GGers to make a contribution? One, incidentally, that can have positive repercussions for the standing of GG. Scientific success does not require technological application. But having technological relevance does not hurt either.

[1] I confess to a touch of schadenfreude given that this is the kind of thing that Manning and Co like to say about my kind of linguistics wrt to their CL approaches.
[2] Though I am not confident about this. I am pretty confident about what kind of linguistics one needs to advance the cognitive project. I am far less sure about what one needs to advance the engineering one. In fact, I suspect that a more “surfacy” syntax will fit the latter’s design requirements better than a more abstract one given its NLPish practical aims. See below for a little more discussion.
[3] I have it from a reliable source that this project is being funded by Google to the tune of millions. I have no idea how many millions, but given that billions are rounding errors to these guys, I suspect that there is real gold in them thar hills.


  1. I'm not particularly surprised to hear this from Manning, he is one among a few handful in NLP that

    1) have a firm grasp of the linguistic literature,
    2) have good taste for what constitutes an interesting problem,
    3) realize that a hypertrophic focus on incremental performance improvements is always an intellectual dead end that inhibits progress in the long run.

    Manning's brief one-line remark on the ACL also shows that his views are not reflective of the NLP community at large. Maybe this will change if deep learning does end up obsoleting the simple models that currently dominate much of NLP.

    One more remark regarding your conclusion: [A contribution], incidentally, that can have positive repercussions for the standing of GG. Scientific success does not require technological application. But having technological relevance does not hurt either.

    This seems to focus on the institutional boons that come with applications, i.e. more prestige, money, jobs in other fields, and so on. But applications are also important sources for new empirical questions. For example, designing a wide-coverage grammar requires analyses for phenomena that are usually considered boring or part of the periphery. But these questions can turn out to be much more interesting than initially thought. Very little work in Minimalism has looked at the syntax of if-then in comparison to auxiliary inversion, but it's actually far from obvious what is going on in those constructions. Science has often profited from technology pushing new questions to the forefront, and in an ideal world that's what NLP should be doing for linguistics.

  2. This comment has been removed by the author.

  3. By the way, Manning did a straight syntax thesis (LFG) on ergativity, if my memory serves me right.

    1. It does. He's also written on various formal syntax issues from an LFG-ish perspective.

  4. There seems to be a general belief in the deep learning community that FoL is "nothing special." In the sense that it's "okay" to put some known domain structure in our models, so long it's of the type that they approve of. (Somewhat circular, yes.) But in particular when pushed on linguistic domain knowledge, that's typically seen as too specific and not broad enough. From this I conclude that the underlying theory is basically that language faculty is an artefact of other general problem solving skills. Whether you agree with that or not, is a reasonable question. (I believe I can guess where Norbert falls, and you can perhaps guess from my tone where I fall, though perhaps not as far and for slightly different reasons.)

    That said, I cannot complain too vehemently because when we lowly NLPers who happen to do some DL on the side integrate "linguistic knowledge" into our models, it's perhaps right to say that it's too specific because it typically is. We're not yet at a point where we really even use things like Greenbergian properties adequately, and I don't think the pudding there will bear much proof until we have far more than 30 languages we're looking at. But to go all the way to what Norbert might find appropriate (which I think hardcore DL folks would still consider too "FoL specific") is far from what we can do.

    1. Unless the basic properties of FoL, which include but are not limited to narrating a Youtube clip, can be discovered by some ML system under the conditions of language learning--i.e., no labeled data, 30 million rather than 30 billion words, etc.--what makes one feel entitled to offer any kind of opinion on Fol (special or otherwise)?

    2. Yes, it is easy to guess where I stand on this question. However, we should distinguish two different things: the views that DLers have and the view that DL requires one to have. The first is what Hal and Charles are reacting to. So far as I know there is nothing specific to the ideas behind DL that requires that domain specific knowledge (even VERY VERY specific domain particular knowledge) be excluded from the models. THus, if they are excluded it is not on any principled DL grounds but for other reasons. There are two that I can think of: (i) Empiricist prejudice and (ii) technological ambitions. The first is old hat, but it is worth noting how susceptible CSers are to this cognitive failure. the second, however, might be just as pressing here. If your aim is to develop systems that can easily generalize then building them on very specific cognitive grounding might be unattractive. This, of course, has nothing whatsoever to do with the scientific questions. But immediate Google dollars are often far more persuasive then future immortal fame.

      At any rate, whatever the DL community does, we should keep in mind that we can steal their stuff for our ends should this prove useful. The real question right now is whether it is.

      Last point: if language is going to be a focus, I doubt that seeding systems with some generic knowledge won't be useful. We already know that even for running basic stats over inputs knowing that language has headed phrases can be real useful. I cannot imagine that DL would eschew this info should it prove useful. So, does it? Even technologically?

    3. Bilingual acquisition seems to work pretty well, so perhaps corpus size could be lowered to 15m (what about trilingual, I wonder, could it go down 10m, Stephen Bird's target for his self-docco of very small language by their native speakers project?). But I doubt that there's zero annotation: since words referring to kinds of 'Spelke Objects' always fall into one part of speech category, and some of these are learned prior to significant syntax, a bit of (partial, perhaps sparse) labelling seems reasonable.

      Interestingly to me, nothing comparable to the Spelke Object -> Noun generalization holds for actions; they can be expressed by all kinds of things, & the verb category is sometimes closed (an interesting discovery of typological/descriptive approaches, I suggest).

    4. I would not be so excited about a possible contribution of GG to DL.First, there is a possibility of Granularity Issue there. Unless GG and DL talk about the same concepts and units, contribution of GG is not possible. Second, DL also has the problem of Black Box, which neural networks in general do have. We do not know what is going in the learning process. Even if GG makes some contribution, it is hard to tell what it contributes and how that contribution works.p

    5. I'm not sure I would call the issue one of granularity, but I think the general point is well taken. In fact, I think the Granularity Issue and the Black Box Problem are inextricably linked. One of DL's promises is supposed to be that one can learn higher and higher order abstractions at each layer of the network. What makes many DL systems black boxish is that (i) the layers don't (generally) naturally map to objects of interest or (ii) they do map back but no one has taken advantage of that mapping.

      I say generally here because there are definitely cases where the networks---e.g. recursive neural nets (RNNs)---are isomorphic to objects of interest---e.g., trees, such as in Iyyer et al. 2014, which Hal is on, Socher's dissertation under Manning and Ng, Bowman's work with Potts and Manning, etc. For my money, one of the important parts of this work is to recognize that it's not the nodes and connections in the neural net diagram that are interesting, it's the entire layers (mapped to, e.g., nodes in a tree) and the nature of the relationships between layers. With regard to the second kind of object, I think there could be legitimate interest in formal semantics in how these mappings---e.g., the tensors in recursive neural tensor networks---instantiate particular combinators---an investigation which could be guided by a wealth of already extant research in abstract algebra and category theory.

      More generally, it seems like one way at least semanticists can benefit from the technologies being developed by DL is to take their category theoretic representations and use results from Representation Theory to instantiate those categories as vector spaces and modules which DL methods can churn over. I think one place this might be beneficial is in approaching phenomena like S-selection and C-selection and constraints thereon. (Linking may be similarly approachable in this way.) This way of thinking puts the theory front and center in the sense that the networks at hand are just instantiations of theoretical objects and their relationships, and reconstructing our theoretical architectures in these sorts of models can augment our ability to do traditional distributional analysis at scale.

      I should say that models constructed in this way may not necessarily answer learning problems, so the L in DL may not be particularly accurate. And it may not be the case---in fact, it's probably not the case---that instantiating theoretical architecture would actually improve any particular system, but I think it's worth a look from theorists.

    6. @Aaron Steven White Could you give the details of those references?

    7. Compositional semantics is now a very large area in DL, so this is only a small sample with big names not represented. Others that know this area (and adjacent ones dealing with distributed representations) might want to chime in with further references, but these are the ones I was thinking of in this context. (Disclaimer: most of these are not making theoretical claims. I think this is something we as theorists would need to employ these tools for ourselves.)

      Mohit Iyyer, Jordan Boyd-Graber, Leonardo Claudino, Richard Socher and Hal Daumé III (2014) A Neural Network for Factoid Question Answering over Paragraphs. Conference on Empirical Methods in Natural Language Processing (EMNLP).

      Richard Socher, Danqi Chen, Christopher D. Manning, and Andrew Y. Ng. (2013). Reasoning With Neural Tensor Networks For Knowledge Base Completion. In NIPS.

      Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP.

      Bowman, Samuel R., Christopher Potts, and Christopher D. Manning. "Recursive neural networks can learn logical semantics." ACL-IJCNLP 2015 (2015): 12.

      Bowman, Samuel R., Christopher D. Manning, and Christopher Potts. (2015). "Tree-structured composition in neural networks without tree-structured architectures." in NIPS.

      Similar ideas with explicit connections to category theory can be found in:

      Clark, S., Coecke, B., & Sadrzadeh, M. (2008). A compositional distributional model of meaning. In P. Bruza, W. Lawless, K. van Rijsbergen, D. Sofge, B. Coecke, & S. Clark (Eds.), Proceedings of the Second Symposium on Quantum Interaction (pp. 133–140). Oxford, England: College Publications.

      Clark, S., & Pulman, S. (2007). Combining symbolic and distributional models of meaning. In P. Bruza, W. Lawless, K. van Rijsbergen, D. Sofge, B. Coecke, & S. Clark (Eds.), Proceedings of the AAAI Spring Symposium on Quantum Interaction (pp. 52–55). Stanford, CA: AAAI Press.

    8. I second Aaron's recommendations. I'd also add the charmingly named Frege in Space: A Program for Compositional Distributional Semantics by Baroni, Bernardi and Zamparelli, an introduction that might be more accessible to linguists than some of the technical papers.

      This blog post by Ewan Dunbar discusses a paper by Nagamine, Seltzer and Mesgarani that explored the phonetic representations learned by a deep neural network that was trained as an acoustic model. He also discusses some interesting challenges in doing this kind of work.

  5. This comment has been removed by the author.

  6. If anything, the deep learning community is much more averse than the "old-fashioned" statistical NLP community to knowledge that's hard-coded into the model by the architect of system. It's essentially an article of faith for deep learning - the whole pitch that the system is supposed to learn the best representations for the data (and the task) automatically.

    From the empirical point of view, there's a debate on whether even having hierarchical structure as part of the architecture of the model is useful, or whether you can get away with modeling language as a linear phenomenon with a slightly fancier neural network that can keep track of long-distance dependencies. My reading of that debate is that it's not at all clear that hierarchy helps you perform many of the standard tasks, at least when evaluated using standard metrics. So I find it hard to imagine a system that explicitly incorporates specific insights from generative grammar, like the ECP or Principle A or subjacency, outperforming massive "dumb" deep neural network.

    Finally, for what it's worth, I read Manning to be suggesting not that generative grammar can be used to improve deep learning, but that we should use neural networks to explain certain linguistic phenomena.