Wednesday, October 30, 2013

Some AI history

This recent piece (by James Somers) on Douglas Hofstadter (DH) has a brief review of AI from it's heady days (when the aim was to understand human intelligence) to its lucrative days (when the goal shifted to cashing out big time).  I have not personally been a big fan of DH's early stuff for I thought (and wrote here) that early AI, the one with cognitive ambitions, had problems identifying the right problems for analysis and that it massively oversold what it could do. However, in retrospect, I am sorry that it faded from the scene, for though there was a lot of hype, the ambitions were commendable and scientifically interesting. Indeed, lots of good work came out of this tradition. Marr and Ullman were members of the AI lab at MIT, as were Marcus and Berwick. At any rate, Somers gives a short history of the decline of this tradition.

The big drop in prestige occurred,  Somers notes, in about the early 1980s. By then "AI…started to…mutate…into a subfield of software engineering, driven by applications…[the]mainstream had embraced a new imperative: to make machines perform in any possible, with little regard for psychological plausibility (p. 3)." The turn from cognition was ensconced in the conviction that "AI started working when it ditched humans as a model, because it ditched them (p. 4)." Machine translation became the poster child for how AI should be conducted. Somers gives a fascinating thumb nail sketch of the early system (called 'Candide' and developed by IBM) whose claim to fame was that it found a way to "avoid grappling with the brain's complexity" when it came to translation. The secret sauce according to Somers? Machine translation! This process deliberately avoids worrying about anything like the structures that languages deploy or the competence that humans must have to deploy it. It builds on the discovery that "almost doesn't work: a machine…that randomly spits out French words for English words" can be tweaked "using millions of pairs of sentences…[to] gradually calibrate your machine, to the point where you'll be able to enter a sentence whose translation you don't know and get a reasonable result…[all without] ever need[ing] to know why the nobs should be twisted this way or that (p. 10-11).

For this all to work requires "data, data, data" (as Norvig is quoted as saying). Take …"simple machine learning algorithms" plus 10 billion training examples [and] it all starts to work. Data trumps everything" Josh Estelle at Google is quoted as noting (p. 11).

According to Somers, these machine-learning techniques are valued precisely because they allow serviceable applications to be built by abstracting away from the hard problems of human cognition and neuro-computattion. Moreover, the partitioners of the art, know this. These are not taken to be theories of thinking or cognition.  And, if this is so, there is little reason to criticize the approach. Engineering is a worthy endeavor and if we can make life easier for ourselves in this way, who could object.  What is odd is that these same techniques are now often recommended for their potential insight into human cognition. In other words, a technique that was adopted precisely because it could abstract from cognitive details is now being heralded as a way of gaining insight into how minds and brains function. However, the techniques here described will seem insightful only if you take minds/brains to gain their structure largely via environmental contact. Thinking from this perspective is just "data, data, data"plus the simple systems that process it.

As you may have guessed, I very much doubt that this will get us anywhere. Empiricism is the problem, not the solution. Interestingly, if Somers is right, AI's pioneers, the people that moved away from its initial goals and deliberately moved it in a more lucrative engineering direction knew this very well. It seem that it has taken a few generations to loose this insight.


  1. The article's discussion of machine translation is just intellectually lazy. (Confession: I skipped over the cult-of-personality stuff, which I find unbelievably boring.) The gap between the Candide system (hardly the "straightforward" system the article describes, the IBM team never fully described it in print and there are still ongoing debates about exactly how to make the "models 1-5" work) and Google Translate is enormous. This gap is not the result of "m0ar data" or a bigger machine learning hammer battering the nails of translation, but rather the use of representations from formal linguistics. State-of-the-art systems depend on morphological decomposition and perform translation using synchronous mildly-context sensitive grammars. And, any linguist in 1990 could have told you that you'd need both. The same is true in speech synthesis and recognition, where despite the (apocryphal) quote about firing linguists to make the recognizer better, everything is phonemes, allophones, features, and rewrite rules.

    _Cartesian Linguistics_ has some appropriately harsh words for those who think "analogy" is an insightful term.

    1. I'm not really following the SMT literature very closely, so this might be an obvious (and slightly embarassing) question, but what exactly do you have in mind when you say

      "state-of-the-art systems [...] perform translation using synchronous mildly-context sensitive grammars."

      I was under the impression that as of now, Phrase-Based translation is still considered state of the art, and that all the work using Synchronous _Context-Free_ Grammars is, well, somewhat academic at the moment, and already poses non-trivial issues for scalability. So which paper should I read to not embarrass myself at the next ACL when casually chatting to some SMT people?

    2. We might just have different ideas of what "state-of-the-art" means. What I have in mind (and, I believe, this is the normal understanding of the term) is like the highest degree of technical perfection yet achieved. Scalability is orthogonal (and I agree that MCSGs have obvious scaling problems compared to flatter models). Hierarchical phrase-based systems (still linguistics-inspired and sufficient to prove my point) are probably most widely used "in production" (I don't have inside information about what's in APIs like Google Translate.)

      Synchronous CFGs (not MCSGs) were the "academic" systems half a decade ago. ACL reviewing is very conservative and you can't hardly publish unless you show improvements over the previous system; from this you can infer they show improvements over the phrase-based systems that were their predecessors (e.g., David Chiang's 2005 paper, which shows improvements over an earlier, non-hierarchical phrase-based system, and so on). DeNeefe & Knight (2009, EMNLP) already have improvements for using synchronous MCSGs. While I haven't been following this as closely since then, I assume there've been further improvements.

    3. The scaling problems of MCSGs are probably so severe that the EMNLP paper you cited doesn't really use them --- the title is somewhat misleading, in that they end up using Synchronous TIGs which end up essentially being context-free (and makes parsing feasible).

      I googled a bit, and the closes to really using MCGs for MT I could find is this at, not WMT, but the Workshop on Syntax, Semantics and Structure in StatisticalTranslation:

      It ends with "It remains to determine how such more accurate and
      more expressive models relate to translation quality."

      My point wasn't really to contest what you're saying, it just surprised me and I was wondering whether I had missed any important recent development(s).

      Perhaps somebody with more inside knowledge could add some comments regarding syntax and machine translation?

  2. Thanks. This is very useful. Any thoughts as to why the perception still exists that when you fire a linguist your machine translation scores become better?

    1. There is a general bias among hard scientists to assume that they can hack major problems in softer sciences without using any of the domain-specific knowledge of said softer science. This is perhaps just hubris. Some recent examples include the Harvard physicists who solved language change back in 2009 (Nature, of course) or the Harvard machine-learning specialists who just solved autism diagnosis (and published the results in a Nature-subjournal).

      More specifically, in the ACL world, it is very hard to publish something that is not an incremental improvement on a previous system using the same task definition and data. And, there is a bias to assume that the reason for an incremental improvement is due to changes in machine-learning methods rather than things like how you split up the data into test and training subsets, or how you deal with capitalization (which are considered unimportant, despite considerable evidence to the contrary). These latter things are almost never described in publications, making most of the field non-replicable without the help of cooperative authors. There have recently been some papers complaining about this, so I'm hoping it will get better soon.