Tuesday, June 16, 2015

Science "fraud"

Here's another piece on this topic in the venerable (i.e. almost always behind the curve) NYTs. The point worth noting is the sources of the "crimes and misdemeanors," what drive retraction and lack of replicability in so many domains. Curiously, outright fraud is not the great plague. Rather (surprise surprise), the main problems come from data manipulation (i.e. abuse and misuse of stats), plagiarism and (I cannot fathom why this is a real problem) publishing the same results in more than one venue.  Outright fraud comes in at number four and the paper does not actually quantify how many such papers there are. So, if you want to make the data better, then beware of statistical methods! They are very open to abuse; either as data trimming or phishing for results. This does not mean that stats are useless. Of course they aren't. But they are tools open to easy abuse and misunderstanding. This is worth keeping in mind when the stats inclined rail against our informal methods. It's actually easy to control for bad data in linguistics. To repeat, I am all in favor of using stat tools if useful (e.g. see the last post), but as is evident, sat because it is "statistically" represented is not without its own problems.

Last point: the NYT reports some dissenters's opinions regarding how serious this problem really is. People delight in being super concerned about these sorts of problems. As I have stated before, I am still not convinced that this is a real problem. The main problem is not bad data but very bad theories. When you know nothing then bad data matters. When you do, much much less. Good theory (even at the MLG level) purifies data. The problem is less bad data than the idea that data is the ultimate standard. There is one version of this which is unarguable (that facts matter). But there is another, the strong data first version, that is pernicious (every data point is sacrosanct). The idea behind the NYT article seems to be that if we are only careful and honest all the data we produce will be good. This is bunk. There is no substitute for thinking, no matter how much data one gets. And stats hygiene will not make this fact go away.


  1. What do you mean when you write that you cannot fathom why publishing the same results in more than one venue is a problem? Do you mean that you can't fathom how this could be one of the main problems in science? Or do you mean that you don't think it's a problem to publish the same thing in multiple venues?

    1. It's not the same kind of problem as the other three. Fake data and cooked data and stolen data seem different in kind from repeated data.

    2. Okay, I see what you're getting at. But given the strong incentives to publish (or perish), and given that there is a premium on pages in (good) journals, I can see how this problem would fit in with the others. It's kind of like faking reputation rather than data. Although it doesn't mislead like faked or cooked data, it does, like stolen data, take up publication space in a problematic way.

    3. Yes, it's not a good thing to do, but put this into the hands of the fraud police? Well, the world is getting nuts.

    4. a) the world is going nuts b) it might actually be a good idea to provide multiple presentations of the basically same stuff for audiences with different backgrounds. The problem would not appear to be anywhere near as much a problem were it not for idiotic reputation measurement methods (ie stats abuse).