Friday, December 2, 2016

What's a minimalist analysis

The proliferation of handbooks on linguistics identifies a gap in the field. There are so many now that there is an obvious need for a handbook of handbooks consisting of papers that are summaries of the various handbook summaries. And once we take this first tiny recursive step, as you all know, sky’s the limit.

You may be wondering why this thought crossed my mind. Well, it’s because I’ve been reading some handbook papers recently and many of those that take a historical trajectory through the material often have a penultimate section (before a rousing summary conclusion) with the latest minimalist take on the relevant subject matter. So, we go through the Standard Theory version of X, the Extended Standard Theory version, the GB version and finally an early minimalist and late minimalist version of X. This has naturally led me to think about the following question: what makes an analysis minimalist? When is an analysis minimalist and when not? And why should one care?

Before starting let me immediately caveat this. Being true is the greatest virtue an analysis can have. And being minimalist does not imply that an analysis is true. So not being minimalist is not in itself necessarily a criticism of any given proposal. Or at least not a decisive one. However, it is, IMO, a legit question to ask of a given proposal whether and how it is minimalist. Why? Well because I believe that Darwin’s Problem (and the simplicity metrics it favors) is well-posed (albeit fuzzy in places) and therefore that proposals dressed in assumptions that successfully address it gain empirical credibility. So, being minimalist is a virtue and suggestive of truth, even if not its guarantor.[1]

Perhaps I should add that I don’t think that anything guarantees truth in the empirical sciences and that I also tend to think that truth is the kind of virtue that one only gains slantwise. What I mean by this is that it is the kind of goal one attains indirectly rather than head on. True accounts are ones that economically cover reasonable data in interesting ways, shed light on fundamental questions and open up new avenues for further research.[2] If a story does all of that pretty well then we conclude it is true (or well on its way to it). In this way truth is to theory what happiness is to life plans. If you aim for it directly, you are unlikely to get it. Sort of like trying to fall asleep. As insomniacs will tell you, that doesn’t work.

That out of the way, what are the signs of a minimalist analysis (MA)? We can identify various grades of minimalist commitment.

The shallowest is technological minimalism. On this conception an MA is minimalist because it expresses its findings in terms of ‘I-merge’ rather than ‘move,’ ‘phases’ rather than ‘bounding nodes’/‘barriers,’ or ‘Agree’ rather than ‘binding.’ There is nothing wrong with this. But depending on the details there need not be much that is distinctively minimalist here. So, for example, there are versions of phase theory (so far as I can tell, most versions) that are isomorphic to previous GB theories of subjacency, modulo the addition of v as a bounding node (though see Barriers). The second version of the PIC (i.e. where Spell Out is delayed to the next phase) is virtually identical to 1-subjacency and the number of available phase edges is identical to the specification of “escape hatches.”

Similarly for many Agree based theories of anaphora and/or control. In place of local coindexing we express the identical dependency in terms of Agree in probe/goal configurations (antecedents as probes, anaphors as goals)[3] subject to some conception of locality. There are differences, of course, but largely the analyses inter-translate and the novel nomenclature serves to mask the continuity with prior analyses of the proposed account. In other words, what makes such analyses minimalist is less a grounding in basic features of the minimalist program, then a technical isomorphism between current and earlier technology. Or, to put this another way, when successful, such stories tell us that our earlier GB accounts were no less minimalist than our contemporary ones. Or, to put this yet another way, our current understanding is no less adequate than our earlier understanding (i.e. we’ve lost nothing by going minimalist). This is nice to know, but given that we thought that GB left Darwin’s Problem (DP) relatively intact (this being the main original motivation for going Minimalist (i.e. beyond explanatory adequacy) then analyses that are effectively the same as earlier GB analyses likely leave DP in the same opaque state. Does this mean that translating earlier proposals into current idiom is useless? No. But such translations often make a modest contribution to the program as a whole given the suppleness of current technology.

There is a second more interesting kind of MA. It starts from one of the main research projects that minimalism motivates. Let’s call this “reductive” or “unificational minimalism” (UM). Here’s what I mean.

The minimalist program (MP) starts from the observation that FL is a fairly recent cognitive novelty and thus what is linguistically proprietary is likely to be quite meager. This suggests that most of FL is cognitively or computationally general, with only a small linguistically specific residue. This suggests a research program given a GB backdrop (see here for discussion). Take the GB theory of FL/UG to provide a decent effective theory (i.e. descriptively pretty good but not fundamental) and try to find a more fundamental one that has these GB principles as consequences.[4] This conception provides a two pronged research program: (i) eliminate the internal modularity of GB (i.e. show that the various GB modules are all instances of the same principles and operations (see here)) and (ii) show that of the operations and principles that are required to effect the unification in (i), all save one are cognitively and/or computationally generic. If we can successfully realize this research project then we have a potential answer to DP: FL arose with the adventitious addition of the linguistically proprietary operation/principle to the cognitive/computational apparatus the species antecedently had.

That’s the main contours of the research program. UF concentrates on (i) and aims to reduce the different principles and operations within FL to the absolute minimum. It does this by proposing to unify domains that appear disparate on the surface and by reducing G options to an absolute minimum.[5] A reasonable heuristic for this kind of MA is the idea that Gs never do things in more than one way (e.g. there are not two ways (viz. via matching or raising) to form relative clauses). This is not to deny different surface patterns obtain, only that they are not the products of distinctive operations.

Let me put this another way: UM takes the GB disavowal of constructions to the limit. GB eschewed constructions in that it eliminated rules like Relativization and Topicalization, seeing both as instances of movement. However, it did not fully eliminate constructions for it proposed very different basic operations for (apparently) different kinds of dependencies. Thus, GB distinguishes movement from construal and binding from control and case assignment from theta checking. In fact, each of the modules is defined in terms of proprietary primitives, operations and constraints. This is to treat the modules as constructions. One way of understanding UM is that it is radically anti-constructivist and recognizes that all G dependencies are effected in the same way. There is, grammatically speaking, only ever one road to Rome.

Some of the central results of MP are of this ilk. So, for example, Chomsky’s conception of Merge unifies phrase structure theory and movement theory. The theory of case assignment in the Black Book unifies case theory and movement theory (the latter being just a specific reflex of movement) in much the way that move alpha unifies question formation, relativization, topicalization etc. The movement theory of control and binding unifies both modules with movement. The overall picture then is one in which binding, structure building, case licensing, movement, and control “reduce” to a single computational basis. There aren’t movement rules versus phrase structure rules versus binding rules versus control rules versus case assignment rules. Rather these are all different reflexes of a single Merge effected dependency with different features being licensed via the same operation. It is the logic of On wh movement writ large.

There are other examples of the same “less is more” logic: The elimination of D-structure and S-structure in the Black Book, Sportiche’s recent proposals to unify promotion and matching analyses of relativization, unifying reconstruction and movement via the copy theory of movement (in turn based on a set theoretic conception of Merge), Nunes theory of parasitic gaps, and Sportiche’s proposed elimination of late merger to name five. All of these are MAs in the specific sense that they aim to show that rich empirical coverage is compatible with a reduced inventory of basic operations and principles and that the architecture of FL as envisioned in GB can be simplified and unified thereby advancing the idea that a (one!) small change to the cognitive economy of our ancestors could have led to the emergence of an FL like the one that we have good (GB) evidence to think is ours.  Thus, MAs of the UM variety clearly provide potential answers to the core minimalist DP question and hence deserve their ‘minimalist’ modifier.

The minimalist ambitions can be greater still. MAs have two related yet distinct goals. The first is to show that svelter Gs do no worse than the more complex ones that they replace (or at least don’t do much worse).[6] The second is to show that they do better. Chomsky contrasted these in chapter three of the Black Book and provided examples illustrating how doing less with more might be possible. I would like to mention a few by way of illustration, after a brief running start.

Chomsky made two methodological observations. First, if a svelter account does (nearly) empirically as well as a grosser one then it “wins” given MP desiderata. We noted why this was so above regarding DP, but really nobody considers Chomsky’s scoring controversial given that it is a lead footed application of Ockham. Fewer assumptions are always better than more for the simple reason that for a given empirical payoff K an explanation based on N assumptions leaves each assumption with greater empirical justification than one based on N+1 assumptions. Of course, things are hardly ever this clean, but often they are clean enough and the principle is not really contestable.[7]

However, Chomsky’s point extends this reasoning beyond simple assumption counting. For MP it’s not only the number of assumptions that matter but their pedigree. Here’s what I mean.  Let’s distinguish FL from UG. Let ‘FL’ designate whatever allows the LAD to acquire a particular GL based on PLDL. Let ‘UG’ designate those features of FL that are linguistically proprietary (i.e. not reflexes of more generic cognitive or computational operations). A MA aims to reduce the UG part of FL. In the best case, it contains a single linguistically specific novelty.[8] So, it is not just a matter of counting assumptions. Rather what matters is counting UG (i.e. linguistically proprietary) assumptions. We prefer those FLs with minimal UGs and minimal language specific assumptions.

An example of this is Chomsky’s arguments against D-structure and S-structure as internal levels. Chomsky does not deny that Gs interface with interpretive interfaces, rather he objects to treating these as having linguistically special properties.[9] Of course, Gs interface with sound and meaning. That’s obvious (i.e. “conceptually necessary”). But this assumption does not imply that there need be anything linguistically special about the G levels that do the interfacing beyond the fact that they must be readable by these interfaces. So, any assumption that goes beyond this (e.g. the theta criterion) needs defending because it requires encumbering FL with UG strictures that specify the extras required. 

All of this is old hat, and, IMO, perfectly straightforward and reasonable. But it points to another kind of MA: one that does not reduce the number of assumptions required for a particular analysis, but that reapportions the assumptions between UGish ones and generic cognitive-computational ones. Again, Chomsky’s discussions in chapter 3 of the Black Book provide nice examples of this kind of reasoning, as does the computational motivation for phases and Spell Out.

Let me add one more (and this will involve some self referentiality). One argument against PRO based conceptions of (obligatory) control is that they require a linguistically “special” account of the properties of PRO. After all, to get the trains to run on time PRO must be packed with features which force it to be subject to the G constraints it is subject to (PRO needs to be locally minimally bound, occurs largely in non-finite subject positions, and  has very distinctive interpretive properties). In other words, PRO is a G internal formative with special G sensitive features (often of the possibly unspecified phi-varierty) that force it into G relations. Thus, it is MP problematic.[10] Thus a proposal that eschews PRO is prima facie an MA story of control for it dispenses with the requirement that there exists a G internal formative with linguistically specific requirements.[11] I would like to add, precisely because I have had skin in this game, that this does not imply that PRO-less accounts of control are correct or even superior to PRO based conceptions.  No! But it does mean that eschewing PRO has minimalist advantages over accounts that adopt PRO as they minimize the UG aspects of FL when it comes to control.

Ok, enough self-promotion.  Back to the main point. The point is not merely to count assumptions but to minimize UGish ones. In this sense, MAs aim to satisfy Darwin more than Ockham. A good MA minimizes UG assumptions and does (about) as well empirically as more UG encumbered alternatives. A good sign that a paper is providing an MA of this sort, is manifest concern to minimize the UG nature of the principles assumed.

Let’s now turn to (and end with) the last most ambitious MA: it is one that not merely does (almost) as well as more UG encumbered accounts, but does better. How can one do better. Recall that we should expect MAs to be more empirically brittle than less minimalist alternatives given that MP assumptions generally restrict an account’s descriptive apparatus.[12]  So, how can a svelter account do better? It does so by having more explanatory oomph (see here). Here’s what I mean.

Again, the Black Book provides some examples.[13] Recall Chomsky’s discussion of examples like (1) with structures like (2):

(1)  John wonders how many pictures of himself Frank took
(2)  John wonders [[how many pictures of himself] Frank took [how many pictures of himself]]

The observation is that (1) has an idiomatic reading just in case Frank is the antecedent of the reflexive.[14] This can be explained if we assume that there is no D-structure level or S-structure level. Without these binding and idiom interpretation must be defined over that G level that is input to the CI interface. In other words, idiom interpretation and binding are computed over the same representation and we thus expect that the requirements of each will affect the possibilities of the other.

More concretely, to get the idiomatic reading of take pictures requires using the lower copy of the wh phrase. To get the John as potential antecedent of the reflexive requires using the higher copy. If we assume that only a single copy can be retained on the mapping to CI, this implies that if take pictures of himself is understood idiomatically, Frank is the only available local antecedent of the reflexive. The prediction relies on the assumption that idiom interpretation and binding exploit the same representation. Thus, by eliminating D-structure, the theory can no longer make D-structure the locus of idiom interpretation and by eliminating S-structure, the theory cannot make it the locus of binding. Thus by eliminating both levels the proposal predicts a correlation between idiomaticity and reflexive antecedence.

It is important to note that a GBish theory where idioms are licensed at D-structure and reflexives are licensed at S-structure (or later) is compatible with Chomsky’s reported data, but does not predict it. The relevant data can be tracked in a theory with the two internal levels. What is missing is the prediction that they must swing together. In other words, the MP story explains what the non-MP story must stipulate. Hence, the explanatory oomph. One gets more explanation with less G internal apparatus.

There are other examples of this kind of reasoning, but not that many.  One of the reasons I have always liked Nunes’ theory of parasitic gaps is that it explains why they are licensed only in overt syntax. One of the reasons that I like the Movement Theory of Control is that it explains why one finds (OC) PRO in the subject position of non-finite clauses. No stipulations necessary, no ad hoc assumptions concerning flavors of case, no simple (but honest) stipulations restricting PRO to such positions. These are minimalist in a strong sense.

Let’s end here. I have tried to identify three kinds of MAs. What makes proposals minimalist is that they either answer or serve as steps towards answering the big minimalist question: why do we have the FL we have? How did FL arise in the species?  That’s the question of interest. It’s not the only question of interest, but it is an important one. Precisely because the question is interesting it is worth identifying whether and in what respects a given proposal might be minimalist. Wouldn’t it be nice if papers in minimalist syntax regularly identified their minimalist assumptions so that we could not not only appreciate their empirical virtuosity, but could also evaluate their contributions to the programmatic goals.

[1] If pressed (even slightly) I might go further and admit that being minimalist is a necessary condition of being true. This follows if you agree that the minimalist characterization of DP in the domain of language is roughly accurate. If so, then true proposals will be minimalist for only such proposals will be compatible with the facts concerning the emergence of FL. That’s what I would argue, if pressed.
[2] And if this is so, then the way one arrives at truth in linguistics will plausibly go hand in hand with providing answers to fundamental problems like DP. This, proposals that are minimalist may thereby have a leg up on truth. But, again, I wouldn’t say this unless pressed.
[3] The agree dependency here established accompanied by a specific rule of interpretation whereby agreement signals co-valuation of some sort. This, btw, is not a trivial extra.
[4] This parallels the logic of On wh movement wrt islands and bounding theory. See here for discussion.
[5] Sportiche (here) describes this as eliminating extrinsic theoretical “enrichments” (i.e. theoretical additions motivated entirely by empirical demands).
[6] Note a priori one expects simpler proposals to be empirically less agile than more complex ones and to therefore cover less data. Thus, if a cut down account gets roughly the same coverage this is a big win for the more modest proposal.
[7] Indeed, it is often hard to individuate assumptions, especially given different theoretical starting points. However (IMO surprisingly), this is often doable in practice so I won’t dwell on it here.
[8] I personally don’t believe that it can contain less for it would make the fact that nothing does language like humans do a complete mystery. This fact strongly implies (IMO) that there is something UGishly special about FL. MP reasoning implies that this UG part is very small, though not null. I assume this here.
[9] That’s how I understand the proposal to eliminate G internal levels.
[10] It is worth noting that this is why PRO in earlier theories was not a lexical formative at all, but the residue of the operation of the grammar. This is discussed in the last chapter here if you are interested in the details.
[11] One more observation: this holds even if the proposed properties of PRO are universal, i.e. part of UG. The problem is not variability but linguistic specificity.
[12] Observe that empirical brittleness is the flip side of theoretical tightness. We want empirically brittle theories.
[13] The distinction between these two kinds of MAs is not original with me but clearly traces to the discussion in the Black Book.
[14] I report the argument. I confess that I do not personally get the judgments described. However, this does not matter for purposes of illustration of the logic.

Wednesday, November 23, 2016

Some material to point to when the uninformed say that GG is dead, which it isn't

Here are several pieces by our own estimable Jeff Lidz that fight the good fight against the forces of darkness and ignorance.  We need much more of this. We need to get stuff into popular venues defending the work that we have done.[1]

The most important is this piece in Scientific American rebutting the profoundly ignorant and pernicious piece by Ibbotson and Tomasello (I&T). (see here and here and here for longer discussion). Jeff does an excellent job of pointing out the issues and debunking the “arguments” that I&T advance. It is amazing, IMO, that T’s views on these issues still garner any attention. They no doubt arise from the fact that he has done good work on non-linguistic topics. However, his criticisms of GG are both of long-standing and very low quality and have been so for as long as they have been standing.  So, it is good to see the Sci Am has finally opened up its pages to those willing to call junk junk. Read it and pass it around widely in your intellectual community.

Here are two other pieces (here and here). The latter is a response to this. This all appears in PNAS. The articles are co-atuhored with Chung-hye Han and Julien Musolino. The discussion is an accessible entry into the big issues GG broaches for the scientifically literate non GGer. As such, excellent for publicity purposes.

So, read and disseminate widely. It is important to call out the idiocy out there. It is even fun.

[1] I have a piece in Current Affairs with Nathan Robinson that I will link to when it is available on the web.

Monday, November 21, 2016

Two things to read

Here are a pair of easy pieces to look at.

The first (here) is a review by Steven Mithen (SM) of a new book on human brain size.  The received wisdom has been that human brains are large compared to our body size. The SM review argues that this is false. The book by Suzana Herculano-Houzel, a neuroscientist from Brazil, makes two important points (and I quote):

(i) What is perhaps more astounding than that number itself, one that is actually less than the often assumed 100 billion neurons, is that 86 billion makes us an entirely typical primate for our size, with nothing special about our brain at all, so far as overall numbers are concerned. When one draws a correlation between body mass and brain mass for living primates and extinct species of Homo, it is not humans—whose brains are three times larger than those of chimpanzees, their closest primate relative—that are an outlier. Instead, it is the great apes—gorillas and the orangutan—with brains far smaller than would be expected in relation to their body mass. We are the new normal in evolution while the great apes are the evolutionary oddity that requires explanation.
(ii) But we remain special in another way. Our 86 billion neurons need so much energy that if we shared a way of life with other primates we couldn’t possibly survive: there would be insufficient hours in the day to feed our hungry brain. It needs 500 calories a day to function, which is 25 percent of what our entire body requires. That sounds like a lot, but a single cupful of glucose can fuel the brain for an entire day, with just over a teaspoon being required per hour. Nevertheless, the brains of almost all other vertebrates are responsible for a mere 10 percent of their overall metabolic needs. We evolved and learned a clever trick in our evolutionary past in order to find the time to feed our neuron-packed brains: we began to cook our food. By so doing, more energy could be extracted from the same quantity of plant stuffs or meat than from eating them raw. 
 What solved the energy problem? Cooking. So, human brain size to mass ratio is normal but the energy the brain uses is off the charts. Cooking then, becomes part of the great leap forward.

The review (and the book) sound interesting. For the minimalistically inclined the last paragraph is particularly useful. It seems that the idea that language emerged very recently is part of the common physical anthro world view. Here's the SM's prose:
If a new neuronal scaling rule gave us the primate advantage at 65 million years ago, and learning to cook provided the human advantage at 1.5 million years ago, what, one might ask, gave us the “Homo sapiens advantage” sometime around 70,000 years ago? That was when our ancestors dispersed from Africa, to ultimately replace all other humans and reach the farthest corners and most extreme environments of the earth. It wasn’t brain size, because the Neanderthals’ matched Homo sapiens. My guess is that it may have been another invention: perhaps symbolic art that could extend the power of those 86 billion neurons or maybe new forms of connectivity that provided the capacity for language.
 So 75kya something happened that gave humans a way of using their new big energy consuming brains another leg up. This adventitious change was momentous. What was it? Who knows. The aim of the Minimalist Program is to abstractly characterize what this could have been. It had to be small given the short time span. This line of reasoning seems to be less and less controversial. Of course what the right characterization of the change is at any level of abstraction is still unclear. But it's nice to know the problem is well posed.

Here's a "humorous" piece by Rolf Zwaan by way of Andrew Gelman. It's a sure fire recipe for getting things into the top journals. It focuses on results in "social priming" but I bet clever types can make the required adaptations for their particular areas of interest. My only amendment would be regarding the garnish in point 2. I believe that Greek Philosophers really are best.

Have a nice Thanksgiving (if you are in the USA). I will be off for at least a week until the turkey festivities end.

Sunday, November 20, 2016

Revisiting Gallistel's conjecture

I recently received two papers that explore Gallistel’s conjecture (see here for one discussion) concerning the locus of neuronal computation. The first (here) is a short paper that summarizes Randy’s arguments and suggests a novel view of synaptic plasticity. The second (here:[1] accept Randy’s primary criticism of neural nets and couples a neural net architecture with a pretty standard external memory system. Let me say a word about each.

The first paper is by Patrick Trettenbrein (PT) and it appears in Frontiers in Systems Neuroscience. It does three things.

First, it reviews the evidence against the idea that brains store information in their “connectivity profiles” (2). This is the classical assumption that inter-neural connection strengths are the locus of information storage. The neurophysiological mechanisms for this are long term potentiation (LTP) and long term depression (LTD). LTP/D are the technical terms for whatever strengthens or weakens interneuron connections/linkages. I’ve discussed Gallistel and Matzel’s (G&M) critique of the LTP/D mechanisms before (see here). PT reviews these again and emphasizes G&M’s point that there is an intimate connection between this Hebbian “fire together wire together” LTP/D based conception of memory and associationist psychology. As PT puts it: “Crucially, it is only against this background of association learning that LTP and LTD seem to provide a neurobiologically as well as psychologically plausible mechanism for learning and memory” (88). This is why if you reject associationsim and endorse “classical cognitive science” and its “information processing approach to the study of the mind/brain” you will be inclined to find contemporary connectionist conceptions of the brain wanting (3).

Second, there is recent evidence that connection strength cannot be the whole story. PT reviews the main evidence. It revolves around retaining memory traces despite very significant alterations in connectivity profiles. So, for example, “memories appear to persist in cell bodies and can be restored after synapses have been eliminated” (3), which would be odd if memories lived in the synaptic connections. Similarly it has recently been shown that “changes in synaptic strength are not directly related to storage of new information in memory” (3). Finally, and I like this one the best (PT describes it as “the most challenging to the idea that the synapse is the locus of memory in the brain”), PT quotes a 2015 paper by Bizzi and Ajemian which makes the following point:

If we believe that memories are made of patterns of synaptic connections sculpted by experience, and if we know, behaviorally, that motor memories last a lifetime, then how can we explain the fact that individual synaptic spines are constantly turning over and that aggregate synaptic strengths are constantly fluctuating?

Third, PT offers a reconceptualization of the role these neural connections. Here’s an extended quote (5):

…it occurs to me that we should seriously consider the possibility that the observable changes in synaptic weights and connectivity might not so much constitute the very basis of learning as they are the result of learning.

This is to say that once we accept the conjecture of Gallistel and collaborators that the study of learning can and should be separated from the study of memory to a certain extent, we can reinterpret synaptic plasticity as the brain's way of ensuring a connectivity and activity pattern that is efficient and appropriate to environmental and internal requirements within physical and developmental constraints. Consequently, synaptic plasticity might be understood as a means of regulating behavior (i.e., activity and connectivity patterns) only after learning has already occurred. In other words, synaptic weights and connections are altered after relevant information has already been extracted from the environment and stored in memory.

This leaves a place for connectivity, but not as the mechanism of memory but as what allows memories to be efficiently exploited.[2] Memories live within the cell but putting these to good use requires connections to other parts of the brain where other cells store other memories. That’s the basic idea. Or as PT puts it (6):

The role of synaptic plasticity thus changes from providing the fundamental memory mechanism to providing the brain’s way of ensuring that its wiring diagram enables it to operate efficiently…

As PT notes, the Gallistel conjecture and his tentative proposal are speculative as theories of the relevant cell internal mechanisms don’t currently exist. That said, neuroiphsyiological (and computational, see below) evidence against the classical Hebbian view are mounting and the serious problems for storing memories in usable form in connections strengths (the bases of Gallistel’s critique) are becoming more and more well recognized.

This brings us to the second Nature paper noted above. It endorses the Gallistel critique of neural nets and recognizes that neural net architectures are poor ways of encoding memories. It adds a conventional RAM to a neural net and this combination allows the machine to “represent and manipulate complex data structures.”

Artificial neural networks are remarkably adept at sensory processing, sequence learning and reinforcement learning, but are limited in their ability to represent variables and data structures and to store data over long timescales, owing to the lack of an external memory. Here we introduce a machine learning model called a differentiable neural computer (DNC), which consists of a neural network that can read from and write to an external memory matrix, analogous to the random-access memory in a conventional computer. Like a conventional computer, it can use its memory to represent and manipulate complex data structures, but, like a neural network, it can learn to do so from data.

Note that the system is still “associationist” in that learning is largely data driven (and as such will necessarily run into PoS problems when applied to any interesting cognitive domain like language) but it at least recognizes that neural nets are not good for storing information. This latter is Randy’s point. The paper is significant for it comes from Google’s Deep Mind Project and this means that Randy’s general observations are making intellectual inroads with important groups. Good.

However, this said, these models are not cognitively realistic for they still don’t make room for the domain specific knowledge that we know characterizes (and structures) different domains. The main problem remains the associationism that the Google model puts at the center of the system. As we know that associationism is wrong and that real brains characterize knowledge independently of the “input,” we can be sure that this hybrid model will need serious revision if intended as a good cog-neuro model.

Let me put this another way. Classical cog sci rests on the assumption that representations are central to understanding cognition. Fodor and Pylyshyn and Marcus long ago agued convincingly that connectionism did not successfully accommodate representations (and, recall, that connectionist agreed that their theories dumped representations) and that this was a serious problem for connectionist/neural net architectures. Gallistel further argued that neural nets were poor models of the brain (i.e. and not only of the mind) because they embody a wrong concpetion of memory; one that that makes it hard to read/write/retrieve complex information (data structures) in usable form. This, Gallistel noted, starkly contrasts with more classical architectures. The combined Fodor-Pylyshyn-Marcus-Gallistel critique then is that connectionist/neural net theories were a wrong turn because they effectively eschewed representations and that this is a problem both from the cognitive and the neuro perspective. The Google Nature paper effectively concedes this point, recognizes that representations (i.e. “complex data structures) are critical  and resolves the problem by adding a classical RAM to a connectionist front end.

However, there is a second feature of most connectionist approaches that is also wrong. Most such architectures are associationist. They embody the idea that brains are entirely structured by the properties of the inputs to the system. As PT puts it (2):

Associationism has come in different flavors since the days of Skinner, but they all share the fundamental aversion toward internally adding structure to contingencies in the world (Gallistel and Matzel 2013).

Yes! Connectionists are weirdly attracted to associationism as well as rejecting representations. This is probably not that surprising. Once on thinks of representations then it quickly becomes clear that many of their properties are not reducible to statistical properties of the inputs. Representations have formal properties above and beyond what one finds in the input, which, once you look, are found to be causally efficacious. However, strictly speaking associationsim and anti-representationalism are independent dimensions. What makes Behaviorists distinctive among Empiricists is their rejection of representations. What unifies all Empiricists is their endorsement of associationism. Seen form this perspective, Gallistel and Fodor and Pylyshyn and Marcus have been arguing that representations are critical. The Google paper agrees. This still leaves associationism however, and position the Googlers embrace.[3]

So is this a step forward? Yes. It would be a big step forward if the information processing/representational model of the mind/brain became the accepted view of things, especially in the brain sciences. We could then concentrate (yet again) all of our fire on pernicious Empiricism so many Cog-neuro types embrace.[4] But, little steps my friends, little steps. This is a victory of sorts. Better to be arguing against Locke and Hume than Skinner![5]

That’s it. Take a look.

[1] Thx to Chris Dyer for bringing the paper to my attention. I put in the URL up rather than link to the paper directly as the linking did not seem to work. Sorry.
[2] Redolent of a competence/performance distinction, isn’t it?  The physiological bases of memory should not be confused with the physical bases for the deployment of memory.
[3] I should add that it is not clear that the Googlers care much about the cog-neuro issues. Their concerns are largely technological, it seems to me. They live in a Big Data world, not one where PoS problems (are thought to) abound. IMO, even in a uuuuuuge data environment, PoS issues will arise, though finding them will take more cleverness. At any rate, my remarks apply to the Google model as if intended as a cog-neuro one.
[4] And remember, as Gallistel notes (and PT emphasizes) much of the connectionism one sees in the brain sciences rests on thinking that the physiology has a natural associationist interpretation psychologically. So, if we knock out one strut, the other may be easier to dislodge as well (I know that this is wishful thinking btw).
[5] As usual, my thinking on these issues was provoked by some comments by Bob Berwick. Thx.