Sunday, June 22, 2014

Comments on lecture 2; part deux

In the first post (here), I discussed Chomsky’s version of Merge and the logic behind it.  The main idea is that Merge, the conceptually simplest conception of recursion, has just the properties to explain why NL Gs generate structures with unbounded hierarchical structure, why NLs allow displacement, show reconstruction effects, and why rules of G are structure dependent. Not bad for any story. Really good (especially for DP concerns) if we get all of this from a very simple (nay, simplest) conception. In what follows I turn to a discussion of the last three properties Chomsky identified and see how he aims to account for them. I repeat them here for convenience.

(v)           its operations apply cyclically
(vi)          it can have lots of morphology
(vii)        in externalization only a single “copy” is pronounced

In contrast to the first four properties, the last three do not follow simply from the properties of the conceptually “simplest” combination operation. Rather Chomsky argues that they reflect principles of computational efficiency. Let’s see how.

With respect to (vii), Chomsky assumes that externalization (i.e. “vocalizing” the structures) is computationally costly. In other words, actually saying the structures out loud is hard. How costly? Well, it must be more costly than copy deletion at Transfer is. Here’s why. Given the copy theory as a consequence of Merge, FL must contain a procedure to choose which copy/occurrence is pronounced (note: this is not a conceptual observation but an inference based on the fact that typically only one copy is pronounced). This decision/choice, I assume, requires some computation. I further assume that choosing which copies/occurrences to externalize requires some computation that would not be required were all copies/occurrences pronounced. Chomsky’s assumption is that the cost of choosing is less than the cost of externalizing.  Thus, FL’s choice lowers overall computational cost.

Furthermore, we must also assume that the cost of pronunciation also exceeds the computational cost of being misunderstood for otherwise it would make sense for FL to facilitate parsing by pronouncing all the copies, or at least those that would facilitate a hearer’s parsing of our sentences. None of these assumptions are self-evidently true or false. Plus, the supposition that copy deletion is more computationally efficient than pronouncing them would be does not follow simply from considerations of conceptual simplicity, at least as far as I can tell. It involves substantive assumptions about actual computational costs, for which, so far as I can tell, we have little independent evidence.

One more point: If copy deletion exists in Transfer to the CI interface (as Chomsky argued in his original 1993 paper and that underlies standard accounts of reconstruction effects and that so far as I know is still part of current theory) then in the normal case only a single copy/occurrence makes it to either interface, though which copy is interpreted at CI can be different form the copy spoken at AP (and this is typically how displacement is theoretically described). But if this is correct, then it suggests that Chomsky’s argument here might need some rethinking. Why? If deletion is part of Transfer to CI then copy deletion cannot be simply a fact about the computational cost of externalization, as it applies to the mapping of linguistic objects to the internal thought system as well. It seems that copies per se are the problem, not just copies that must be pronounced.

Before moving on to (v) and (vi) it is worth pausing to note that Chomsky’s discussion here reverberates with pretty standard conceptions of computational efficiency (viz. he is making claims about how hard it is to do something). This moves away from the purely conceptual matters that motivated the discussion of the first four features of FL. There is a very interesting hypothesis that might link the two: that the simplest computational operation will necessarily be embedded in a computationally efficient system. This is along the lines of how I interpreted the SMT in earlier posts (linked to in the first part of this post).  However, whether you think this is feasible, it appears, at least to me, that there are two different kinds of arguments being deployed to SMT ends, a purely conceptual one and a more conventional “resource” argument.

Ok, let’s return to (v) and (vi). Chomsky suggests that considerations of computational efficiency also account for these properties of. In particular, they follow from something like the strict cycle as embodied in phase theory.  So the question is what’s the relation between the strict cycle and efficient computation?

Chomsky supposes that the strict cycle, or something like it, is what we would expect from a computationally well-designed system. There are times that (to me) Chomsky sounds like he seems to be assuming that the conceptually simplest system will necessarily be computationally efficient.[1] I don’t see why. In particular, if I understand the lecture correctly, Chomsky is suggesting that the link between conceptual simplicity and computational efficiency should follow as a matter of natural law. Even if correct, it is clear that this line of reasoning goes considerably beyond considerations of conceptual simplicity. What I mean is that even if one grants that the simplest computational operation will be something like Merge, it does not follow that the simplest system that includes Merge will also incorporate the strict cycle.  Phases then, (Chomsky’s mechanism for realizing the strict cycle) are motivated not on grounds of conceptual simplicity alone but on grounds of efficiency (i.e. a well/optimally designed system will incorporate something like the strict cycle). So far as I can tell Chomsky does not explain the relation (if any) between conceptual simplicity and computationally efficiency, though to be fair, I may be over-interpreting his intent here.

This said how does the strict cycle bear on computational efficiency? It allows computational decisions to be made locally and incrementally. This is a generically nice feature for computational systems to have for it simplifies computations.[2] Chomsky notes that it also simplifies the process of distinguishing two selections of the same expression from the lexicon vs two occurrences of the same expression. How does it simplify it? By making the decision a bounded one. Distinguishing them, he claims, requires recalling whether a given occurrence/copy is a product of E- or I-Merge. If such decisions are made strict cyclically (at every phase) then phases reduce memory demand: because phases are bounded, you need not retain information in memory regarding the provenance of a valued occurrence beyond the phase where an expression’s features are valued.[3] So phases ease the memory burdens that computations impose. Let me note again without further comment, that if this is indeed a motivation for phases, then it presupposes some conception of performance for only in this kind of context do resource issues (viz. memory concerns) arise. God has no need for bounding computation.

Now I have a confession to make.  I could not come up with a concrete example where this logic is realized involving DP copies, given standard views.  It’s easy enough to come up with a relevant case if e.g. reflexivization is a product of movement.[4] If reflexives involve A-chains with two thematically marked “links” then we need to distinguish copies from originals (e.g. Everyone loves himself differs from everyone loves everyone in that the first involves one selection of everyone from the lexicon (and so one chain with two occurrences of everyone) while the second involves two selections of everyone from the lexicon and so two different chains). However, if you don’t assume this, I personally had a hard time finding an example of what’s worrying Chomsky, at least with copies. This might mean that Chomsky is finally coming to his senses and appreciating the beauty of movement theories of Control and Binding OR it might mean that I am a bear of little brain and just couldn’t come up with a relevant case. I know which option I would bet on, even given my little brain, and it’s not the first. So, anyone with a nice illustration is invited to put it in the comments section or send it to me and I will post it. Thanks.

It is not hard to come up with cases that do not involve DPs, but the problem then is not distinguishing copies from originals. Take the standard case of Subject-Predicate agreement for example. Here the unvalued features of T are valued by those of the inherently valued features of the subject DP.  Once valued, the features on T and D are indistinguishable qua features. However, there is assumed to be an important difference between the two, one relevant to the interpretation at the CI interface. Those on D are meaning relevant but those on T are uninterpretable. What, after all, could it mean to say that the past tense is first person and plural?[5] If one assumes that all features at the interfaces must be interpretable at those interfaces if they make it there, then the valued features on T must disappear at Transfer to CI. But if (by assumption) they are indistinguishable from the interpretable ones on D, the computational system must remember how the features got onto T (i.e. by valuation rather or inherently). The ones that get there by valuation in the grammar must be removed or the derivation will not converge. Thus, Gs need to know how features get onto the expressions they sit on and it would be very nice memory-wise if this was a bounded decision.

Before moving on, it’s worth noting that even this version of the argument is hardly straightforward. It assumes that phi-features on T are not-interpretable and that these cause derivations to crash (rather, then, for example, converge as gibberish) (also see note 5). It also requires that deletion not be optional, otherwise there would be derivations where all the good features remained on all of the right objects and all of the uninterpretable ones freely deleted. Nor does it allow Transfer (which, after all, straddles the syntax and CI) to peak at the meaning of T during Transfer, thereby determining which features are interpretable on which items and so which should be deleted and which retained. Note that such a peak-a-boo decision to delete during Transfer would be very local, relying just on the meaning of T and the meaning of phi-features. Were this possible, we could delay Transfer indefinitely. So, to make Chomsky’s argument we must assume that Transfer is completely “blind” to the interpretation of the syntactic objects at every point in the syntactic computation including the one that interfaces with CI. This amounts to a very strong version of the autonomy of syntax thesis; one in which no part of the syntax, even the rules that directly interface with the interpretive interfaces, can see any information that the interfaces contain.[6]

Let’s return to the main point. Must the simplest system imaginable be computationally efficient? It’s not clear. One might imagine that the conceptually “simplest” system would not worry about computational efficiency at all (damn memory considerations!). The simplest system might just do whatever it can and produce whatever structured products it can without complicating FL with considerations of resource demands like memory burdens. True, this might render some products of FL unusable or hard to use (and so we would probably perceive their use as perceive them as unacceptable) but then we just wouldn’t use them (sort of like what we say about self-embedded clauses).  So, for example, we would tend not to use sentences with multiple occurrences of the same expressions where this made life computationally difficult (e.g. you would not talk about two Norberts in the same sentence). Or without phases we might leave to context the determination of whether an expression is a copy or a lexical primitive or we might allow Transfer to see if features on an expression were kosher or not. At any rate, it seems to me that all of these options are as conceptually “simple” as adding phases to FL unless, or course, phases come for free as a matter of “natural law.”  I confess to being skeptical about this supposition. Phases come with a lot of conceptual baggage, which I personally find quite cumbersome (reminds me of Barriers actually, not one of the aesthetic high points in GG (ugh!)). That said, let’s accept that the “simplest” theory comes with phases. 

As Chomsky notes, phases themselves come have complex properties.  For example, phases bring with them a novel operation, feature lowering, which now must be added to the inventory of FL operations. However, feature lowering does not seem to be either a conceptually simple or cognitively/computationally generic kind of operation. Indeed, it seems (at least to me) quite linguistically parochial. This, of course, is not a good thing if one’s sights are set on answering Darwin’s problem.  If so, phases don’t fit snugly with the SMT. This does not mean there are none. It just means that they complicate matters conceptually and pull against Chomsky’s first conceptual argument wrt Merge.

Again, let’s put this all aside and assume that strict cyclicity is a desirable property to have and that phases are an optimal way of realizing this. Chomsky then asks how we identify phases? He argues that we can identify phases by their heads as phase heads are where unvalued features live. Thus a phase is the minimal domain of a phase head with unvalued features.[7] A possible virtue of this way of looking at things is that it might provide a way of explaining why languages contain so much morphology. They are the adventitious by-products for identifying the units/domain of the optimal computational system.  Chomsky notes that what he means by morphology is abstract (a la Vergnaud), so a little more has to be said, especially given that externalization is costly, but it’s an idea in an area where we don’t have many (see here).[8]

One remark: on this reconstruction of Chomsky’s arguments, unvalued features play a very big role. They identify phases, which implement strict cyclicity and are the source of overt morphology.  I confess to being wary here. Chomsky originally introduced unvalued features to replace uninterpretable ones. Now he assumes that features are both +/- valued and +/- interpretable. As unvalued features are always uninterpretatble, this seems like an unwanted redundancy in the feature system.  At any rate, as Chomsky notes, uninterpretable features really do look sort of strange in a perfect system. Why have them only to get rid of them?  Chomsky’s big idea is that they exist to make FL computationally efficient. Color me very unconvinced.

So this is the main lay of the land. I should mention that, as others have pointed out (especially Dennis O), part of Chomsky’s SMT argument here (i.e. the one linked to conceptual simplicity concerns) is different from the interpretation of the SMT that I advanced in other posts (here, here, here).  Thus, my version is definitely NOT the one that Chomsky elaborates when considering these. However, there is a clear second strand dealing with pretty standard efficiency concerns, and here my speculations and his might find some common ground. That said, Chomsky’s proposals rest heavily on certain assumptions about conceptual simplicity, and of a very strong kind. In particular, Chomsky’s argument rests on a very aggressive use of Occam’s razor.  Here’s what I mean. The argument he offers is not that we should adopt Merge because all other notions are too complex to be biologically plausible units of genetic novelty. Rather, he argues that in the absence of information to the contrary, Occamite considerations should rule: choose the simplest (not just a simple) starting point and see where you get. Given that we don’t know much about how operations that describe the phenotype (the computational properties of FL) relate to the underlying biological substrate that is the thing that actually evolved, it is not clear (at least to me) how to weight such strong Occamite considerations. They are not without power, but, to me at least, we don’t really know how to assess whether all things are indeed equal and how seriously to weight this very strong demand for simplicity

Let me end by fleshing this out a bit.  I confess to not being moved by Chomsky’s conceptual simplicity arguments. There are lots of simple starting points (even if some may be simpler than others). Ordered pairs are not that much more conceptually complex than sets. Symmetric operations are not obviously simpler than asymmetric ones, especially given that it appears that syntax abhors symmetry (see Moro and Chomsky). So, the general starting point that we need to start with the conceptually simplest conception of “combination” and that this means an operation that creates sets of expressions seems based on weak considerations. IMO, we should be looking for basic concepts that are simple enough to address DP (and there may be many) and evaluate them in terms of how well they succeed in unifying the various apparently disparate properties of FL. Chomsky does some of this here, and it’s great. But we should not stop here. Let me given an example.

One of the properties that modern minimalist theory has had trouble accounting for is the fact that the unit of syntactic movement/interpretation/deletion is the phrase. We may move heads, but we typically move/delete phrases. Why? Right now standard minimalist accounts have no explanation on hand. We occasionally hear about “pied piping” but more as an exercise in hand waving than in explanation. Now, this feature of FL is not exactly difficult to find in NL Gs. That constituency matters is one of the obvious facts about how displacement/deletion/binding operates. There is a simple story about this that labels and headedness can be used to deliver.[9] If this means that we need a slightly less conceptually simple starting point than sets, then so be it.

More generally: the problem that motivates the minimalist program is DP. To address DP we need to factor out most of the linguistic specific structure of FL and attribute it to more cognitively generic operations (or/and, if Chomsky is right, natural laws).  What’s simple in a DP context is not what is conceptually most basic, but what is simple given what our ancestors had available cognitively about 100k years ago. We need a simple addition to this, not something that is conceptually simple tout court.[10]  In this context it’s not clear to me that adding a set construction operation (which is what Merge amounts to) is the simplest evolutionary alternative. Imagine, for example, that our forbearers already had an itterative concatenation operation.[11]  Might not some addition to this be just as simple as adding Merge in its entirety? Or imagine that our ancestors could combine lexical atoms together into arbitrarily big unstructured sets, might not an addition that allowed that operation to yield structured sets be just as simple in the DP context as adding Merge? Indeed, it might be simpler depending in what was cognitively available in the mental life of our ancestors.  And once we are at it, how “simple” is an operation that forms arbitrary sets from atoms and other sets?  Sets may be simple objects with just the properties we need, but I am not sure that operations that construct them are particularly simple.[12]

Ok, let me end this much too long second post. And moreover, let me end on a very positive note. In the second lecture Chomsky does what we all should be doing when we are doing minimalist syntax. He is interested in finding simple computational systems that derive the basic properties of FL. He concentrates on some very interesting key features: unbounded hierarchy, displacement, reconstruction, etc. and makes concrete proposals (i.e. he offers a minimalist theory) that seem plausible. Whether he is right in detail is less important IMO than that his ambitions and methods are worth copying. He identifies non-trivial properties of FL that GG has discovered over the last 60 years and he tries to explain why they should exist.  This is exactly the right kind of thing MPers should be doing. Is he right? Well, let’s just say that I don’t entirely agree with him (yet!). Does lecture 2 provide a nice example of what MP research should look like. You bet. It identifies real deep properties of FL and sees how to derive them from more general principles and operations. If we are ever to solve Darwin’s problem, we will need simple systems that do just what Chomsky is proposing. 

[1] Note, we want the necessarily here. That it is both simple and efficient does not explain why it need be efficient if simple.
[2] It is also a necessary condition for incrementality in the use systems (e.g. parsing), as Bill Idsardi pointed out to me.  I know that the SMT does not care about use systems according to some (Dennis and William this is a shout-out to you), but this is a curious and interesting fact nonetheless.  Moreover, if I am right that the last three properties do not follow (at least not obviously) from conceptual considerations, it seems that Chomsky might be pursuing a dual route strategy for explaining the properties of FL.
[3] Note that this assumes that there is no syntactic difference between inherent features and features valued in the course of the derivation.
[4] And even this requires a special version of the theory, one like Idsardi and Lidz’s rather than Zwart’s.
[5] However, if v raised to T before Transfer then one might try and link these features to the thematic argument that v licenses. And then it might make lots of sense to say that phi-features are interpretable on T. They would say that the variable of the predicate bound by the subject must have such and such an interpretation. This information might be redundant, but it is not obviously uninterpretable.
[6] The ‘autonomy of syntax’ thesis refers to more than one claim. The simplest one is that syntactic primitives/operations are not reducible to phonetic or semantic ones. This is not  the version adverted to above. This is a more specific version of the thesis; one that requires a complete separation between syntactic and semantic information in the course of a derivation. Note, that the idea that one can add EPP/edge features only if it affects interpretation (the Reinhart-Fox view that Chomsky has at times endorsed) violates this strong version of the autonomy thesis.
[7] Note, we still need to define ‘domain’ here.
[8] Note, incidentally, that Chomsky assumes both that features are +/- valued and that they are +/- interpretable. At one time, the former was considered a substitute for the latter. Now, they are both theoretically required, it seems. As -valued features seem to always be –interpretatble, this seems like an unwanted redundancy. 
[9] I provide a story here based on labels and minimality.
[10] A question: we can define ordered pairs set theoretically. I assume the argument against labels is that ordered sets are conceptually more complex than unordered sets. So {a,b} is conceptually simpler than {a,{a,b}}.  If this is the argument, it is very very subtle. I find it hard to believe that whereas the former is simple enough to be biologically added, the latter is not. Or even that the relative simplicity of the two could possibly matter. Ditto for other operations like concatenation in place of Merge as the simplest operation.  Given how long this post is already, I will refrain from elaborating these points here.
[11] Birds (and mice and other animals) can string “syllables” together (put them together in a left/right order) to make songs. From what I can tell, there is no hard upper bound on how many syllables can be so combined.  These do not display hierarchy, but they may be recursive in the sense that the combination operation can iterate. Might it not be possible that what we find in FL builds on this iteration operation? That the recursion we find in FL is iteration plus something novel (I have suggested labeling is the novelty)? My point here is not that this is correct, but that the question of simplicity in a DP context need not just be a matter of conceptual simplicity.  
[12] How are sets formed? How computationally simple is the comprehension axiom in set theory, for example? It is actually logically quite involved (see here). I ask because Merge is a set forming operation, so the relevant question is how cognitively complex is it to form arbitrary sets. We have been assuming that this is conceptually simple and hence cognitively easy. However, it is worth considering just how easy. The Wikepedia entry suggests that it is not a particularly simple operation. Sets are funny things and what mental powers go into being able to construct them is not all that clear.


  1. Depending on one's assumptions regarding case and regarding how internal arguments of nouns become possessors, the following may be the example you're looking for:

    (1) John[1]'s arrest t[1]

    (2) John[1]'s arrest of [John][2]

    So, to be (only very slightly) more concrete: if the "of" in (2) is just the case morphology given to DPs that have not A-moved out of [Compl,N], then syntactically, (1) and (2) are distinguished only by the fact that the two "John"s in (1) are copies of the same object and the two "John"s in (2) are not.

  2. This is a question that has always puzzled me but that probably has a straight forward answer: In what sense can notions such as computational efficiency be applied to things which are not to be interpreted as corresponding to "actual processes" in the (vague) sense of performance?

    To elaborate a bit, I feel as if Norbert shares part of my puzzlement when saying "it presupposes some conception of performance for only in this kind of context do resource issues (viz. memory concerns) arise", though in a slightly more limited context than I think is appropriate. It's not just memory, even search seems, to me at least, to only make sense with respect when thinking about performance -- "God has no need for bounding search". Or does he?

    1. @BB
      As you say, I sort of agree with you (though I'm not sure what you mean by "actual" above. Real? The ones we use? If so, yes). Search, memory load etc only makes sense to me in the context of some system that uses the G. That's why I tried to suggest in earlier posts that we understand the SMT as committing hostages to the kinds of issues that Berwick, Wexler, Weinberg, DeMarcken etc discussed so fruitfully. So, I agree.

      What I did not fully appreciate is that Chomsky wants to get a lot of mileage out of conceptual simplicity concerns. Of the 7 properties he discusses, he believes 4 follow directly form the "simplest" conception of the combine operation. Say we agree, then these aspects of FL have little to do with resource issues. They are, as it were, purely facts about the data structures and the kinds of info they code. My own view is that even wrt these we can peak at their performance implications (The NTC in particular). However, one need not. When it comes to phases and copy deletion however, I think that even Chomsky is thinking in a performancy manner, albeit at a very abstract level. I personally don't think that 'search' is not the right way to put things. But I do think that bounding computation is a good idea for finite minds like ours. If resources are infinite (God?) then computational cost is irrelevant. But then if minds are infinite do we even need to recursively specify anything. I can see why only the brave indulge in metaphysics!

    2. One issue, not entirely clear to me, is how the deletion operation is implemented. This is not part of narrow syntax, right? That consists only of Merge. So the deletion operation is properly part of the interface(s), correct? I don't know about deletion occurring at CI, but if deletion occurs at SM, then presumably this interface can appropriately take into account the externalization system's computational costs. So, I suppose we have notions of computational efficiency/optimality that occur at multiple levels, both at narrow syntax (Merge) and at the interfaces, each with different notions of efficiency at play.

    3. @ William
      That's a good point. It's not clear what to make of deletion processes. One option is that there IS a deletion operation that cleans the syntactic phrase marker up. So something like FULL INTERPRETATION read as that the interface is passive and reads all that it gets suggests that there is some pre-interface process that cleans the relevant representations up. THis is assuredly NOT Merge. But then, feature lowering is not merge either but this is part of the syntactic computation so it contains more than Merge. Ditto with Probing and Agreeing. So, merge may be the newbie on the block but it is not the only thing that FL does.

      I think that I agree that there are various notions of "complexity" that Chomsky is playing with. And they may respond to different concerns. A big hypothetical way of putting these all together would be to argue that systems with the conceptually simplest rules are necessarily embedded in systems with optimal computational properties. Maybe. It's logically possible. But we would need an argument.