Monday, April 18, 2016

Two interviews with Chomsky on linguistics (and politics)

Here is a recent interview with Chomsky (thx to Elika for the link) where he talks about things like Big Data in linguistics, Experimental Syntax, islands, superiority and other things. The interview (Jeff Runner doing the ling questioning) is short but interesting.

He makes at least there important points.

First, that there is a difference between data collection and scientific experimentation. The idea, implicit in most of the big data PR, is that one can collect data quite a-theoretically and expect to gain scientific insight. As Chomsky notes that this runs against the accumulated wisdom of the last 200 years of scientific research. As Chomsky compactly put it:
...theory-driven experimental investigation has been the nature of the sciences for the last 500 years.
Quite right. Experiments are not just looking. They are looking with an attitude and the tude is a function of theory.

Second, much of what linguistic study has NO relevant data in any conceivable corpus. He cites ECP, but this is just the tip of a very large iceberg. No relevant data, then big data collection is besides the point:
In linguistics we all know that the kind of phenomena that we inquire about are often exotic. They are phenomena that almost never occur. In fact, those are the most interesting phenomena, because they lead you directly to fundamental principles. You could look at data forever, and you’d never figure out the laws, the rules, that are structure dependent. Let alone figure out why. And somehow that’s missed by the Silicon Valley approach of just studying masses of data and hoping something will come out. It doesn’t work in the sciences, and it doesn’t work here.
Let me underline one point Chomsky makes: it's the manufactured experimental data that is important to gaining insight. As in the other sciences, linguists create data not found in the wild and use this factitious data to understand what is happening. Real life data is often (IMO, generally) useless because it is too complex. The aim of good data is to reduce irrelevant interference effects that arise from the interaction of many component causes. Real life data is just that; too complex. In linguistics, of particular importance is negative data; data that some structure is unacceptable or cannot have a specific meaning. This is not the kind of data that Big Data can get because it is data that is missing from everyday usage of language. And yes, PoS arguments are built from this kind of data and that is why they are so useful.

Third, I am still not sure what Chomsky's take on island effects is. One of the interesting debates in the Sprouse and Hornstein volume revolved around whether these were reducible to simple complexity effects. My read on this is that Sprouse and Wagers and Phillips got the better of the discussion and that reducing islands to complexity just wasn't going to fly. I'd be interested to know what others think.

At any rate, take a quick look, as it is short and interesting.

CHomsky's recent Sophia Lectures is another excellent recent source of Chomsky syntax speculation. The lectures (plus an excellent interview by Naomi Fukui and Mihoko Zushi) are contained in volume 64 of Sophia Linguistica. I have no online link, unfortunately. But I recommend getting hold of the volume and reading it. Interesting stuff.


  1. apparently the lectures are recorded somewhere at this link, but my Japanese is non existent, so I have no idea where!

  2. update: this being a very impressive university, they have automatic translation and the lectures are at this link

  3. With regards to big data, I had separate cause today to look at Eldredge & Gould's paper on punctuated equilibria (a fun read), which in the opening gives a quote from Darwin that I'm now determined to trot out on occasion: "about thirty years ago there was much talk that geologists ought only to observe and not theorize; and I well remember someone saying that at this rate a man might as well go into a gravel-pit and count the pebbles and describe the colours. How odd it is that anyone should not see that all observation must be for or against some view if it is to be of any service."

  4. Well yes but, Geology (and taxonomic biology) are nevertheless very heavily based on observation, but requiring underlying ideas, and also, sometimes, calling for experiments. And, many of the interesting things in syntax, such as apparent oblique subjects in Icelandic, are also extremely common in texts; then one wants to do experiments to find out more about what they really are. I think we need to make a distinction between Big Data (irrelevant for theoretical syntax, I'd agree) and Little Data, ie corpora of up to say 20 million words, which is highly relevant, because it is the primary data for acquisition.