Nosek on Truth, Science, and Academic Incentives

Sep 10 2012

Brian Nosek of the University of Virginia talks with EconTalk host Russ Roberts about how incentives in academic life create a tension between truth-seeking and professional advancement. Nosek argues that these incentives create a subconscious bias toward making research decisions in favor of novel results that may not be true, particularly in empirical and experimental work in the social sciences. In the second half of the conversation, Nosek details some practical innovations occurring in the field of psychology, to replicate established results and to publicize unpublished results that are not sufficiently exciting to merit publication but that nevertheless advance understanding and knowledge. These include the Open Science Framework and PsychFileDrawer.

LISTEN NOW:

Comment

●

READ TRANSCRIPT

●

DELVE DEEPER

DOWNLOAD

Time	Podcast Episode Highlights
0:33	Intro. [Recording date: September 4, 2012.] Russ: Our topic today is the reliability of research findings in the social sciences, particularly psychology; though I'm sure we'll be talking about economics as well. And our discussion will be based on a recent paper you wrote with Jeffrey Spies, and Matt Motyl, "Scientific Utopia, Part II: Restructuring Incentives and Practices to Promote Truth over Publishability," prepared for a special issue of Perspectives on Psychological Science. I'd like you to start by telling the story you tell at the beginning of your paper on your research finding where you looked at the ability of political extremists on the Left and the Right to detect shades of gray. Literal shades of gray, using their physical vision. How did that experiment work? What were you trying to do? Guest: We were interested in a very popular area of research in psychology right now, which is in embodiment--the sense that many of our social concepts can actually have a physical basis in our everyday activity and our physiology. And so we have an interest in political ideology, and so we recruited participants from the political Left, Center, and Right and had them do a very simple task. The task is to look at a word that was printed in a shade of gray and then match the shade of gray on a slider bar, from very dark to very light. And when the person thought that it was the shade they had selected, matched the shade of the word, then they would Enter it. And what we calculated was the accuracy of the person's perception of the shade of gray. And what we found in our initial study was that political moderates were more accurate in estimating the shades of gray than people on the political Left or Right. And so we interpret it, as you introduced, that political extremists see the world, literally, as more black and white than moderates do. Not just figuratively. That this has a physiological basis in some way. Russ: An incredible finding. And you are going to be famous as a result. That's extraordinary. Guest: Yeah. We planned our career banquets and the awards that we were going to receive based on finding this amazing result. We were stunned by it. Russ: But something happened on the way to the gravy train. What was that? Guest: Yeah. Well, in all of social sciences a recent topic of conversation is: How reproducible is the science? Can we replicate results that we've found? And there are many reasons that results might not replicate so easily. And we are in a fortunate position in our laboratory that we have very easy access to data collection. We run websites where lots and lots of people come to visit and try out our studies and do different things. And so it's very easy for us to do a replication. And so we thought what we should do; and even though we got a clear result the first time, we'll just run it again just to make sure that we get the same result again before we submit our results to a publication. And so we did. We ran the same study over again using a very slightly different sample, but otherwise very similar; using plenty of power and plenty of participants to detect the result of the effect's size. And it did not replicate. We didn't get the result again. Russ: You got no effect whatsoever. No difference in the ability of moderates compared to extremists. Guest: Right. A very ordinary result that would not change our careers one bit. Russ: And so you threw those results out. Because obviously those results weren't interesting or important, and you just published the ones you found in the first run-through. Correct? Guest: I wish we had done that. We didn't. Because we knew we had collected those results and our labmates knew we had collected those results. Russ: Darn! You shouldn't have told them. Guest: I know. It was a big, big mistake. It was a mixed bag. Our initial result could still be true. It could be the case that political extremists view the world in black and white to a greater extent than moderates do. But this null results provides some pause. And we don't have an obvious alternative explanation for why we didn't get it the second time. Because we did the same thing. It was on the same kind of infrastructure; we had the same procedure. So many things were in common between the two studies. Really, the only difference was that we had run it again with a new sample of participants. But we ran plenty of participants. So, at minimum it provides some caution in taking the first result as truth. That perhaps that was a false positive. Perhaps it occurred by chance. And so, it might be true. But we don't know it's true. And so it's going to be much harder to publish now with both of these together. Because a reviewer would reasonably say: Well, hang on a second now; maybe it's not true. Try to replicate it again. Russ: So, you said at a minimum, it would have called it into question. At a maximum, what would you conclude? I don't know if maximum is the right word; but the other extreme would be that that first result, something was just wrong. It wasn't accurate, it wasn't reliable, it wasn't true. Guest: Right. Right. And even if we'd done everything right--even if our analysis, our data collection was good and sound, our procedures were solid, it could still have just occurred by chance. And that's part of what statistical inference is about: we are looking probabilistically at the likelihood of these things happening. So we always have that possibility, that it had occurred by chance. But more so are the possibilities that we did little things in our initial analysis that made it more likely to get a result that looked good for us. When we were deciding which participants to exclude, we might have in our data analysis have seen: Well, when we exclude these participants, the effect is a little stronger. And so we might have felt: Well, that's probably a little more justified to exclude those participants. When we were deciding about covariates, we may have used some degrees of freedom of using some covariates, not using other covariates, and then found more compelling those analyses where the results looked better for us. Russ: What do you mean by covariates? Guest: Once we had had that particular analysis strategy, those covariates, these other things, then when we replicate, we presumably would have to use all of those same things again, and so we are much more constrained in our reanalysis of our new data set. And so we don't have the same opportunity to take advantage of chance. Because when we can make all of these different decisions of what should I exclude and what other things should I do, we are leveraging chance. We are taking our opportunities to get results that look good for us. Russ: That's if you are aware of it, at least. When you say covariates, do you mean other variables that might affect the results? Guest: Yeah. So, whenever someone does an analysis of a data set, it isn't always perfectly clear how the analysis should be done. We start with a data set, and then we have to make some decisions. Sometimes a lot of decisions, about what's the appropriate way to analyze these data. And so the researcher has many opportunities to do things that could increase the chances of getting a result that looks good for the researcher. That helps the researcher. And even though my colleagues and I were very genuinely trying to analyze this data reasonably and accurately and everything else, it's quite possible that without our awareness we were influenced by what was coming up as we were doing our data analysis. And sort of pushing the data in the direction of finding something that helps us.
8:59	Russ: So, we are ruling out here, and we are not going to be talking about fraud. Obviously you could have changed some of the scores when you realized that the results--if you'd done it the first time and it hadn't come out, as we'll be talking about--very few journals want to publish an article that finds no relationship between two things that might not be related. And so, you'd realize: We were wasting our time here. And so you could have fudged the data as they say, and fraudulently entered into your statistical spreadsheet or software package different findings than were actually done. That's fraud. That's conscious-- Guest: Right. If I was willing to commit fraud then I'm quite sure that this paper would be published by now. And our careers would be skyrocketing. Russ: At least in the short run. Guest: Hope that it's enough with our principles and values that we are trying to do a good job, genuinely trying to do good science. But that doesn't mean that we are not vulnerable to some of the reasoning and rationalizations that leverage the incentives for what it means to be successful in science. Russ: But as you point out in your paper, fraud is risky. You can get caught. And so even if you are not a good human being, you still might not want to do fraud. But even good human beings have to struggle with the incentives that we are going to talk about. I'm going to actually read a quote from the beginning of the paper. You say: The real problem is that the incentives for publishable results can be at odds with the incentives for accurate results. This produces a conflict of interest. The conflict may increase the likelihood of design, analysis, and Incentives for reporting decisions that inflate the proportion of false results in the published literature. And then one more, which is my favorite. You say: ...publishing is also the basis of a conflict of interest between personal interests and the objective of knowledge accumulation. The reason? Published and true are not synonyms. To the extent that publishing itself is rewarded, then it is in scientists' personal interests to publish, regardless of whether the published findings are true.... And my favorite line in there is the sentence, "Published and true are not synonyms." I think that would make a good t-shirt for EconTalk; and many of our listeners would sympathize with that statement. But for those who are not normal listeners to EconTalk, or who are skeptical, this is a slightly depressing idea. I think a lot of people have this image of scientists and professors, researchers, as truth seekers. And you are suggesting here that truth seeking can be derailed by the personal incentives that the researcher faces. Guest: Yeah, that's right. And the additional challenge is that what's ultimately truth isn't determined by any single contribution. Because we are dealing with probabalistic inference, we are trying to accumulate evidence for a particular claim, we will have lots of inferences of claims that don't hold up after repetition. And that is entirely ordinary. That's how it works. People can find it disconcerting to one year find out that eating this kind of food will extend your life and then two years later find out that eating that food will actually shorten your life. And think: Oh, my gosh, science is broken; they can't make up their minds which one it is. But really that's just reflecting what happens in science. We find some evidence here; we find some different evidence there; and then we converge toward the accurate solutions. The added challenge to that is what we are talking about in terms of the incentives in these papers. Which is: certain kinds of results are valued more than others. And because of that, the day-to-day decisions that I make in my laboratory are going to be influenced subtly, without my intention, in order to help me have the best career outcome I can. And that can also get in the way of getting to truth faster.
13:19	Russ: And we have to--for those of us who have spent time in the kitchen--that would be me and you--in the kitchen of statistical analysis, and we know what kind of things get put back on the plate; and others get pushed under the rug: I want to disagree with one thing you said a minute ago about science advances through these findings that turn out to be true or not true, and then they get confirmed or not confirmed. I think a lot of what gets reported in the newspaper in psychology results, in economic results, and in epidemiology--and I want to use epidemiology because it is not your field or mine--which is the one you just alluded to, where you find out that caffeine is bad for you, then it's good for you; fat is bad for you, now it's good for you. I really don't see those as science. Because we don't really understand the biology of those relationships. We really don't understand the chemical makeup of caffeine and the human body's functioning to understand that relationship. So, it's a fishing expedition, often, in the data. And the fact that sometimes it's true and sometimes it's not true, all of these psychological confirmation biases come into play. People get published by finding that something kills you or doesn't. And I just don't put much stock in a lot of it. Because I know, as you said, so many decisions had to be made in how the data were analyzed. Forget the science. There's just not enough science there. Guest: Yeah. It is a real challenge because of the underlying complexities. So, both of the things can be true. Caffeine can be good for you and bad for you. Probably both are true, given certain constraints. And part of the scientific process is identifying what those constraints are. So, we have two different challenges. One is that we can get contradictory findings that are both true. We just don't know why they are both true. There are these moderating or mediating variables underlying it that we haven't yet identified for when they will occur this way versus when they will occur that way. And that's a theoretical challenge, right? It's a challenge of explanation for identifying the circumstances for one or the other. The other challenge that's confronted by these different things is what you are referring to I think in the factors that elicit the results in the first place. Which is: I just need to get results that I can publish. And those aren't about figuring out what the underlying explanations are. Those are identifying what's the appropriate way to pursue the analysis, and are we actually pushing the analysis in one direction through the everyday decisions of the scientist rather than the underlying phenomena that we are trying to figure out. Russ: And you list, I think it's 10, 9, I'm sure there's more, but you list 9 ways that people can practice, as you call them, that are justifiable sometimes, but they also run the risk of increasing the proportion of published false results. And again, this is where you've got the data, but the data don't speak. You've got to torture them, or at last whack them, or give them a bang, a hit. And here are 9 things. You have many citations in the paper that discuss each of these: 1. leverage chance by running many low-powered studies rather than a few high-powered ones; 2. uncritically dismiss failed studies as pilots or due to methodological flaws but uncritically accept successful studies as methodologically sound; 3. selectively report studies with positive results and not studies with negative results every time; 4. stop data collection as soon as a reliable effect is obtained; 5. continue data collection until a reliable effect is obtained; 6. include multiple independent or dependent variables and report the subset that worked; 7. Maintain flexibility in design and analytical models, including the attempt of a variety of data exclusion or transformation methods for to subset; 8. Report a discovery as if it had been the result of an exploratory test and 9. once a reliable test is obtained, do not do a direct replication; shame on them. So these are--you know, we had Ed Leamer on here, econometrician, who has made the same critique of economic findings, typically that classical statistical tests of significance don't hold when you are constantly doing data drudging and looking around and trying 97 different specifications. Guest: Right. Very rapidly become irrelevant, because you are just, you are adding up all of these different chances that you have to find an effect, and then still using the same criterion: p less than .05 just on the one final one you report.
18:14	Russ: So, what do we do about this? One obvious question is: What's wrong with these journals that publish these unreliable results? Why do they publish them? Shouldn't they reject them? Guest: Yeah. Well, it's a tough problem. And part of the reason it's a tough problem is that journals have their own sets of incentives that encourage the publication, that foster this situation. And the challenge is the journals face is that they want prestige. They want attention. They want to be at the forefront of innovation. And innovation is really what drives science. Right? The exciting part of science is pushing at the boundaries of knowledge. Of seeking out things that are challenging to our current conceptions of how things work. And whenever we are pushing at the boundaries of knowledge, whenever we are pursuing innovation, we are by definition pursuing risk. There are risks that we are going to be wrong a lot when we are trying to find out stuff that's new. But at the same time, that's what we are trying to do. That's what science really is in the business of doing. So, there is a discovery or innovation component. And then the other side of it, that would be the fix, is kind of boring. And that is confirmation or verification. Of taking an idea that someone has claimed and repeating it to see if it holds up. Trying it in a slightly different way, that it should still hold, and it should happen there, too. And that does not have the same excitement value, even though it's just as important. Right? Finding out a new idea versus figuring out whether it's true--in sort of the abstract I would say both of those things are pretty important. But we are strongly tilted in scientific discipline to value the first one, the innovation part, at the expense of the verification part. It would be nonsensical to completely reverse it, to only do verification. Then we wouldn't actually do anything new. But the challenge is in trying to rebalance it enough so that those things that are having some importance, that are getting into the journals, because of their primary incentive, to have some mechanisms to encourage verification of those more interesting, more challenging, more provocative findings. Russ: So, you raised the question in the paper of self-correction--isn't science self-correcting? And you say: No. Why isn't it? Why isn't it that eventually we'll find out which results are solid and which results are not? Guest: I would agree with the notion that eventually we'll find out. But I think when we use the common trope of "science is self-correcting," when we talk about eventually we mean a very, very long time. And that's not okay with me. I think we waste a lot of resources and time and energy believing things that we could get out of the literature much more quickly. Or at least clarify much more quickly the conditions under which they are true. And so I think the self-correction can happen a lot faster on important stuff if we just make a few tweaks to the incentives; try to make it so that some degree of verification is part of the ordinary practice of science. And can get into the journals more easily. Russ: So, one suggestion is to have a journal of replication, right? A journal where results get confirmed. And you are not very optimistic about that solution. Guest: No. And it's been tried many times. Because this is not a new problem. We've known about the challenges of replication and null-results for a long time. It's been part of the research methodology for the last 40 years. So, I'm certainly not saying anything new. And many have tried--by introducing a journal of null results or a journal of replications. And they struggle to succeed, primarily because they are journals that are defined by the fact that they are publishing things that no other journals would publish. Which means: We are a crummy journal. Russ: Fascinating. Guest: And that's not a strong incentive for any individual scientist to bother writing up a result for that journal. So I don't think it makes sense to define a journal based on its publishing things that others won't. Instead I think the solutions need to integrate better into the existing publication structures so that the individual scientists have a strong incentive to publish the results. Russ: So, before we talk about some of the ways you think can make it better, how big a problem do you think this is in psychology? Because I think it's an enormous problem in economics, to the point where I've argued that very, very few, if any economists are convinced by multivariate regression results, statistical analyses of complex phenomena. We can't settle how many jobs are created by the stimulus; we can't settle what the multiplier is; we can't even agree on whether the minimum wage reduces unemployment. And when you argue about that, you produce a statistical study and the other side has got their statistical study, too. And you are stuck saying: Mine's better than yours. And that's bizarre in economics. What's the problem in psychology, do you think? Guest: Well, I think it's a big problem there, too. But the surprising fact is that we don't really know. There isn't much empirical evidence in economics or in psychology or any other discipline, for that matter, that really gives a good estimate of how reproducible the science is. And that is, to me, the most surprising gap of all of this: that we have lots of worries, lots of reasons to have worries--the list that you generated from the paper before--all of these are good reasons to be concerned that we are biasing our research literature. But we just really have no clue how biased the research literature is. And so one of the projects that I'm involved with is called the Reproducibility Project. And it is a research effort to try to estimate the reproducibility of psychological science. So, in that project we have a sample of journals from 2008 in psychology, three different important journals in psychology, and a team of researchers, which right now numbers 72 different researchers from 41 different institutions. Each are working in small groups or teams to replicate one or more of the studies from those journals. And so from that we'll get at least one initial estimate of how reproducible the science is. And hopefully that will spur many other investigations like that, so we can understand better whether there is a big problem, as you and I both intuit; and the extent of it. Because it could be the case that really a lot of this isn't something to be worried about: that the peer review system just works so beautifully that it manages to screen out all of these problems before they get in the literature. Not particularly plausible. But it's possible. Russ: I stifled a wild whoop of laughter.
25:45	Russ: Let me mention a recent issue that came up in psychology that we talked about on this program with Ed Yong, a science journalist. And I raise it because your idea of going back to 2008, although grand, has some challenges and I'm curious how you are going to handle them. So, in this study--I don't remember the name of the researcher; maybe you will--they wanted to find out whether if you used words related to old age in the experiment people would leave the experiment more slowly, kind of shuffling out. And they found out that that was indeed what happened. The experiment wasn't about old age; it was about something else; but in the course of it they subtly injected words about senior citizens or old age or something--I don't know exactly how it was done. And they found that people left the room more slowly. For me, those kind of results don't even pass the sniff test to start with. I have all the skepticism we've been talking about. But it was a big deal--this guy's a well-known, established researcher in psychology; I think it's the most cited paper in psychology in the last x years. Except when people tried to replicate it, they found it didn't replicate. Now, the original scholar said: Well, you didn't do it right. So, the question is: it's really different from combining hydrogen and oxygen and seeing if you get water. Maybe. Is it? How do you "replicate" these kind of psychological studies? Guest: Yeah. It's a great point. And that paper you are referring to, John Bargh, is the primary author of that; and then some of his collaborators were involved in the original result. And that is a very important paper for my subdiscipline in psychology of implicit social cognition and automaticity. So that is a paper, and related ones, that have been a very important basis of my substantive research interests. And it is a good example of raising some of the important issues for reproducibility that are different than false positives. So, one possibility is that original result is a false positive. Although we all note that in the original paper there is a replication. So, they got it twice in the original paper. Now in subsequent research, others have had trouble getting the result. And that could mean that those original ones are false positives, even though they did it twice. It could also mean that there's really a lot of subtlety in how one conducts the procedure, or other conditions of that particular setting, how the materials are delivered, how the timing is done, how people interact with the participants, that are very important for obtaining the result. And those are important things to consider in the context of reproducibility. Failing to replicate doesn't mean that the original result is false. It could mean that, but it doesn't unambiguously mean that. And so what one has to do in really systematically looking to build a cumulative science and to gain confidence in the reproducibility of science is also look at the other factors that may be influencing reproducibility. And one of them could be expertise and attention to the nuance and details that one doesn't necessarily know are relevant. Or wouldn't know just by reading the methods. And so one of the tasks of the reproducibility project is to try to identify predictors of reproducibility like these that are separate from the original results being false. And another thing that the project does is engage with the original researchers as much as possible before conducting the data collection to make sure that it is a fair test of the original design. And that is a good way to try to get rid of many of these sort of misunderstandings of what the original design even was. Russ: Correct. Guest: If the original researcher can look at it and say: Gee, you forgot this, or oh my God, you don't have x in there; it's 11 in the methods section. Not everything gets into the methods section. How they map back and forth can be very useful. Russ: When the guy leaves the room and he's going really quickly and he's gotten the old senior citizen stuff you have to know to grab him by the sleeve to slow him down. You might not know that. Guest: Yeah, right. There could be other factors like that, although it would be a different set of questions. Russ: We hope not. But I've suggested--and this shows the challenge of this in the social sciences--you just said something very interesting. I forget how you said it: The methodology, it's complicated; there's a lot of factors involved in how you actually implement the experiment. What's the right lingo for that? Guest: Yeah; there might be lots of nuance in the particular implementation of the procedures. Russ: But in economics we are usually not working with human beings. We are working just with the data that someone else has collected from a government survey. And I've suggested that you have to videotape what you do as you analyze the data; just do an ongoing screencast of all the different regressions you ran and all the different statistical analyses that you did so that people could go back and see how many times: That result doesn't count because that variable--I need to change it a little bit. Or: that's an outlier. It's one thing to say: We threw out some outliers; it didn't change the results--which is the way it's often phrased. But of course no one wants to watch a 73 hour video of your research experience. Guest: Yeah. And so documenting the work flow is of real value and even if no one else watches it, the fact that someone else could watch it could be something that changes people's behavior. Russ: That's true. Guest: Realizing that someone could find out that I've been doing this for 12 hours in order to get that single result. But the other thing that it raises, because there's many occasions in psychology, just like you are describing, too. It's not the nuance, the setting, and the procedures themselves. It's really at the analysis phase where a lot of this comes up. There are a couple of things that can address it. One is: if you have a strong confirmatory stance--you have a strong theory with a strong expectation, is you lay out your analysis plan before actually having any access or looking at the data itself. Russ: Right. Guest: So write the whole analysis up and then register it. We have this website, Open Science Framework, where people can register their hypotheses, their analysis scripts, before they conduct their analysis. And in strong confirmatory cases that can be an appropriate thing to do. There are many research applications that are exploratory, where you don't actually have a strong confirmatory test to do; so it wouldn't make sense to register it. But if there is a strong one, then registering in advance reduces your degrees of freedom dramatically. And then another one which is an interesting variation is where two camps or people that have different perspectives; have them work through an analysis script together that they can agree on. If they can agree on. And if they can't agree on, then it is very interesting to go through the process to identify what control variables, what data exclusion variables. What things are they differing on. Because that's really where the meat of the disagreement is. It may not actually be in the outcomes. It may be in how you get to the outcomes. And then that's really where the substance of the debate needs to be. Russ: Yeah. We should be doing that in economics for about 47 different issues. I think that would be helpful. I think it's less so in psychology.
33:30	Russ: One last question on this issue of self-correction. It's one thing when a result hasn't been established. Let's take your example. You posit that maybe there's this physiological relationship that relates to the ideological view, and you don't find anything. Okay. So, you have a bias toward finding something and eventually maybe you convince yourself that that second study that you did--ehhh, it was a Wednesday; it was raining; it wasn't reliable; the light was bad on the computer screens of America. So you ignore that and you publish it; and Brian, Mr. Spies, and Mr. Motyl become very famous for this finding. But then doesn't somebody get fame and glory for shooting it down? Can't you publish that piece, after that piece is out there with this big claim, can't you come along and say: I've refuted it? Isn't that part of the incentive system? Guest: There is some of that, but it's really there for things that get really famous. It's not there for stuff that's just kind of influential. The walking study being a good example of that. The folks that have evidence that doesn't confirm the original result are able to publish and get some attention for that because it's such an important result in the field. But for the many other results in the priming literature--and the walking example is just one of hundreds of studies that show effects of priming some concepts on subsequent behavior--it's not worth people's time to go back and confirm that, given the existing incentive structures. Because they won't get the same attention. It's like: That's not the walking study; that's these other things; who cares about those? We already know that priming is true, so why do you need to confirm that particular result? So it depends on where the result is on the continuum. If it's not impacting anybody, it's definitely not worth replicating. And it probably actually isn't worth replicating in general either, because it's not influencing anyone. Russ: So, you are part of two projects. You've got the Replication Study, with the 72 folks, right? Trying to go back to 2008. Guest: Yep. Russ: Is that part of the Open Science Framework, or is the Open Science Framework something separate? Guest: The Open Science Framework is a more general infrastructure that the Reproducibility Project is just making use of. So, the Open Science Framework generally is a system for documenting, archiving, and sharing and collaborating research results. And so on that system--it's actually any of the sciences, but most of the users so far are social scientists--you can document your workflow. And it has collaborative tools to help you with sharing your materials internally in your lab or with your collaborators. And when you are ready to make those materials or data public, with a mouseclick they become public and available to others as well. It offers the opportunity to register your hypotheses or your materials in advance if you choose to do so, for those occasions where it's appropriate. So, it's intended to sort of help expose a lot of the workflow to address many of the problems we know are contributors to these issues. Like the file drawer effect: we do lots of studies in my lab that never get published, but are actually providing some information that might be useful: failures to replicate our own designs or others' designs. And others might be able to make use of those data. So, if we have a common system where everyone can post their materials, post their data, share that when they want to share it, then we'll be able to have, I think, a more rapid accumulation of information that will help to address some of these problems. Russ: What else is going on in your field to try to fight these issues? Guest: Another approach that we are developing is to try to shift the incentive structures within individual journals a little bit, for replication. And as we've already discussed, we can't replicate everything. And we don't want to do that. We'd spend all of our time on verification and no innovation. But there are things that are important enough that they should be replicated. We should get more confidence in the accuracy of the results. But we still don't have good mechanisms for publishing them. And so one possibility that will be for the Journal of Social Psychology--I'll be a Guest Editor in an upcoming issue of that, is to publish important results, replications of important results in social psychology. So, it will be a special issue and it will send out a call for people to do replications of results that are having a high impact on the field that aren't getting a lot of replication yet. And so that would be an incentive for the researcher to actually do a replication, knowing that they could get it published in this journal. It also has some incentives for the journal, which is a sort of mid-tier journal: if they publish results, things that are high impact, replications, then the replications will get relatively strong citations, too, because they are publishing results that are already known to be important results. So it makes it for the journal something that is of value. So this is a little nudge to make replication something that is more normative and to reward those for places that are important to do it. Without overwhelming the system and focus everybody on replication all of the time. Russ: What's the Psych File Drawer Project? Guest: Psych File Drawer was started by Hal Pashler and his colleagues at U.C. San Diego, and they take on a complementary approach with the Open Science Framework for trying to expose the file drawer. And at Psych File Drawer, if you have results that are sitting in your file drawer and you don't seem to have the motivation, time, or energy to try to write them up and push them through the normal publication process, you can write up a very short summary, takes 15 minutes, to post to Psych File Drawer what you found in this replication project. And it is just a repository for you to be able to share results in a relatively rapid way, relatively low cost to you, about things that might be of interest to others. And they also have some social interaction functions where if you have found a result or are interested in a particular problem and are wondering if other people might be interested in that particular finding or result, too, you can register your name there, and if you get connected with other people, you can share information. So, it's a promising approach. Russ: So, it's really an online Journal of Unexciting Results. But sometimes those are useful. Because you can find out whether somebody else has already looked at something and didn't find anything. Guest: That's right. And it's very lightweight, in the sense that you don't have to put in tons of hours to try to get this result into the atmosphere as it were. You just have to put a little bit of time in. You can share what you have found or not found with other people. Russ: And you don't have to spend a lot of time writing the actual article, where you shape and pretend that you've found something important.
41:14	Russ: So, I've mentioned Ed Leamer before; he's been a pioneer in trying to get economists to try to be self-aware of these problems. And I think by his own admission he hasn't been very successful. Somewhat successful but it hasn't started a revolution. What would you say the general reaction is in the psychology profession to this concern over published versus true? Guest: Well, I haven't spoken with everyone, but I have spoken with a lot of people and observed conversations in the field among our leadership and others. And there is a lot of interest and concern about the issues of replicability and false positives and other things. It's particularly true now because there are a lot of high profile cases, but it is issues that have been of concern for many years. A lot of them are not very new. And the issue that's faced is not just engagement with these, because scientists are in the field because they want to find out things that are true. They don't want to waste their time on results that aren't true or be finding things that are just going to disappear years later. So, if there is a problem they are motivated to understand it. The real challenge is translating that interest into actual actions. How is it that we address it? So, I think the main barrier that science is confronted is not whether people are interested and willing to try to do something. It's actually figuring out what to do. And doing it. And so, instead of hand-wringing, the focus of our laboratory and many of my collaborators is on actually building tools that can address the problem, rather than continuing to worry about what the problem is. And so the Open Science Framework is one of those efforts; having special issues devoted to replications is a small contribution on replication, is another. The Reproducibility Project, start to do some replications, is another. And then there are other kinds of efforts that could develop that really start to give people ways in which to shift their own personal incentives, but also to really start to find new solutions in the field. Another example of this is journals like PLOS One (Public Library of Research). PLOS One is a generalist journal; you can publish in any field of science there. It's an open access journal. But what really makes it distinct for the purposes of replication and file drawer issues is their review process. They have standard editors and reviewers like anything else. But the review process is explicitly not about the importance of the research. It isn't the journal saying: is this an exciting, innovative enough result for us to publish? The review process is solely about the soundness of the research: are the findings and interpretations justified based on the design; is the design well conducted; is the research well done? Without any consideration of importance. So from that regard, getting published in PLOS One can be done if the research is well done. Doesn't matter if it's a replication or not. Doesn't matter if it's a null result or not. And so, to the extent that those journals are successful--now note they are not defining themselves as replication or no-result journals; lots of novel research gets published in PLOS One. It's now the biggest journal in the world, by far. But it opened the door to publishing those things. And so, once there are outlets, people are doing some of this stuff, they'll be able to get it into the field. And just having the opportunity will start to shift these incentives. Russ: So, just to clarify things: PLOS One is spelled P-L-O-S space One, correct? Guest: Right. Public Library of Science is the publisher. It's a nonprofit, open access publisher. And PLOS One is this one that has this kind of review process. And Public Library of Science also has some very high-bar journals--PLOS Medicine, PLOS Genetics. They have very strict review standards and push for innovation just like standard journals do. Russ: And you mentioned the File Drawer problem--we've mentioned that a couple of times. Describe what that is for those folks not used to that term. What do you mean by that? Guest: The file drawer problem is also called the gray literature, and it's essentially just describing the fact that almost every research laboratory and researcher does more research than they publish. And so all of that stuff that is done but not published is sitting in the proverbial file drawer. Now there is no actual physical file drawer--it's a hard drive. But it's sitting there unpublished, and only that researcher, that lab, that small group knows about it. And because there are certain kinds of things that get published--positive results are more likely to get published than negative results, innovative results more likely to get published than confirmatory results or replication results--then the kind of things that end up in the file drawer are a different kind of thing. It's a biased representation in the published literature than what is in the unpublished literature. And so knowing what's in both of the literatures is very useful to understanding what's going on. Russ: Are you optimistic? Do you think you are going to make a dent? Guest: I'm always optimistic, and that's my undoing and my re-doing, I guess. If I were pessimistic I probably wouldn't have an impact and probably wouldn't bother trying. Like any innovation, there's high probability of failure for any of the things we are doing because they are challenging to how it is that the field works now. But if they work, they could make a real difference. And so we're excited to try and see what we can learn about reproducibility in general and whether we can change our own actions to try to realign more closely to what our scientific values are: openness, transparency, sharing. If we can do that then we'll have made some progress even if it's only on ourselves.
47:41	Russ: Now the study going back to 2008 trying to replicate studies in three different journals, when that work is completed, how will it be made public? Guest: It's already public, so anyone can track how the project is going and look at the results and look the reports and the designs. All of it is posted on the Open Science Framework, which is just openscienceframework.org. And we will, once the data collections are done, write a summary report and submit that, presumably, to a traditional journal as well as making it available online with the rest of the materials. Russ: There was a recent attempt to do this in medicine. I'm not sure how reliable it was, but they took something like the 50-something most important cancer studies. And they were only able to replicate, I think a handful. One of the authors allegedly confessed that yes, he had run it six times and it had only worked once. They thought that was the interesting result, so that's the one they published. That's obviously a huge problem that we've been talking about. But it will be interesting to see how it turns out in psychology. Guest: Yes. Russ: What's your guess? Based on what's out there so far? Guest: Well, it's hard to guess accurately because I see many reasons to guess one way or the other in different circumstances. The two that have been done in medicine were both done by industrial laboratories, Bayer and Amgen, replicating results as a first step towards translating the results into clinical applications--therapies, drugs, pharmaceuticals, whatever. And so they have, the industrial laboratories have, a high level of replication, because they don't want to waste their money developing a new drug that has no effect or new impact. So, they want to know right up front: Is the research coming out of the academic laboratory sound for us to be able to use to make money? And as you said, their results were very dispiriting. One of the studies, it was 11% of the results replicated; another it was about 25%. And those are stunningly low. I hope that we do better than that. And I think there are a few reasons that we will, in psychology. But at the same time, if it was that low, well, geeze, all of these fields, biology, chemistry, economics, psychology--we are all confronting many of the same incentives. Individual scientists have to get published in order to advance their careers. And so we may have a system that's skewed more than we'd like it to be skewed. But I'd much rather know that it's skewed so that we can do something about it rather than keep my head in the sand and just hope it isn't.
50:36	Russ: So, let me ask you a very depressing question. One argument would be that none of this really matters very much. For a lot of us--not all of us--what we spend our time doing, experimental and statistical analysis about, are things that only a few people care about, the 73 people who read the journal anyway. So, nobody really cares. Whether priming exists, whether using words like old and senior and all that, isn't really that practical except for people in the field. There are a lot of things in economics that are like that; but there are a lot of things that aren't. A lot of thins where billions of dollars are spent on whether Keynesianism is true, and if Keynesianism is true the money is well-spent; otherwise it's wasted. So there, there's a big return for finding out whether things are true or not. What do you feel about that in psychology? What's important? Forget what's influential, because it's not the same thing, always. Guest: Right. It's an important point because a lot of times basic science, where there isn't a clear direct application, the value of it is not known in advance. So, you don't know if this is something that is going to turn into something that has large implications for human health or human prosperity or anything else. But my general response is: If it's false, then there's no way it's going to have an impact. So, really we want it to be true; and then the importance of it can be determined later. That's the promise of basic science: you don't really know, need to know, what impact it's going to have. Just the fact that you now know it is an opportunity to build more knowledge on top of it, that starts to address some of the problems that we really care about. But if it's false then there's no point in it at all. Russ: But isn't it worse than that? If it's false, it's dangerous. Guest: Right. Russ: If it's false, you are going to stop drinking coffee, or start drinking coffee, or whatever is the analogy. You are going to raise your kids a certain way because you believe some psychological study. I have a guest coming up soon who talks about the attempts to understand parenting outcomes related to children's success, and I think we know very little about that. He's more confident about it than I am, but that is important, right? If you are implementing a false theory of child-raising and you warp your kids as a result, that would be very depressing. Guest: Yeah, right. Russ: Well, any further thoughts about academic life? One thing we haven't really talked about is the incentives, where they come from. They come from the fact that if you want to be a tenured professor you've got to really publish a lot of stuff. Maybe instead of fixing the journals we ought to be trying to fix the universities. Have you thought about that? Guest: It's a good idea. And it's a very challenging one, because universities have many of the same incentives. And so if a university says: Okay, well, publishing isn't really important for us--then they are really defining themselves as not being a research university. So they confront many of the same challenges for shifting incentives, because of the way that universities gain prestige, particularly as research intensive universities. And so it is--all of the different stake holders have a lot of these incentives as they are confronted with themselves and things to wrestle with. My preference is to start at the low end of the individual daily activities of scientists and try and figure out how we can realign their incentives so that their everyday practice ends up contributing to a cumulative knowledge base that we can be confident in. And I'd have to say that the main thing I take from this--we've had a conversation that many might call dispiriting about the state of science. I actually don't feel that way. Like I said, I'm congenitally optimistic. And I feel like this is a great time for science. Science is one of the only ways of knowing that is consistently and actively self-critical. And to the extent that it is working the way it should, it will look at its own practices, identify problems in those practices, and come up with new solutions to improve on them. So the fact that we have a lot of people enthusiastic about really looking critically at how we do things on a day-to-day basis and figure out a better way to do it, to me is very exciting. Not depressing at all.

Time

Podcast Episode Highlights

0:33

Intro. [Recording date: September 4, 2012.] Russ: Our topic today is the reliability of research findings in the social sciences, particularly psychology; though I'm sure we'll be talking about economics as well. And our discussion will be based on a recent paper you wrote with Jeffrey Spies, and Matt Motyl, "Scientific Utopia, Part II: Restructuring Incentives and Practices to Promote Truth over Publishability," prepared for a special issue of Perspectives on Psychological Science. I'd like you to start by telling the story you tell at the beginning of your paper on your research finding where you looked at the ability of political extremists on the Left and the Right to detect shades of gray. Literal shades of gray, using their physical vision. How did that experiment work? What were you trying to do? Guest: We were interested in a very popular area of research in psychology right now, which is in embodiment--the sense that many of our social concepts can actually have a physical basis in our everyday activity and our physiology. And so we have an interest in political ideology, and so we recruited participants from the political Left, Center, and Right and had them do a very simple task. The task is to look at a word that was printed in a shade of gray and then match the shade of gray on a slider bar, from very dark to very light. And when the person thought that it was the shade they had selected, matched the shade of the word, then they would Enter it. And what we calculated was the accuracy of the person's perception of the shade of gray. And what we found in our initial study was that political moderates were more accurate in estimating the shades of gray than people on the political Left or Right. And so we interpret it, as you introduced, that political extremists see the world, literally, as more black and white than moderates do. Not just figuratively. That this has a physiological basis in some way. Russ: An incredible finding. And you are going to be famous as a result. That's extraordinary. Guest: Yeah. We planned our career banquets and the awards that we were going to receive based on finding this amazing result. We were stunned by it. Russ: But something happened on the way to the gravy train. What was that? Guest: Yeah. Well, in all of social sciences a recent topic of conversation is: How reproducible is the science? Can we replicate results that we've found? And there are many reasons that results might not replicate so easily. And we are in a fortunate position in our laboratory that we have very easy access to data collection. We run websites where lots and lots of people come to visit and try out our studies and do different things. And so it's very easy for us to do a replication. And so we thought what we should do; and even though we got a clear result the first time, we'll just run it again just to make sure that we get the same result again before we submit our results to a publication. And so we did. We ran the same study over again using a very slightly different sample, but otherwise very similar; using plenty of power and plenty of participants to detect the result of the effect's size. And it did not replicate. We didn't get the result again. Russ: You got no effect whatsoever. No difference in the ability of moderates compared to extremists. Guest: Right. A very ordinary result that would not change our careers one bit. Russ: And so you threw those results out. Because obviously those results weren't interesting or important, and you just published the ones you found in the first run-through. Correct? Guest: I wish we had done that. We didn't. Because we knew we had collected those results and our labmates knew we had collected those results. Russ: Darn! You shouldn't have told them. Guest: I know. It was a big, big mistake. It was a mixed bag. Our initial result could still be true. It could be the case that political extremists view the world in black and white to a greater extent than moderates do. But this null results provides some pause. And we don't have an obvious alternative explanation for why we didn't get it the second time. Because we did the same thing. It was on the same kind of infrastructure; we had the same procedure. So many things were in common between the two studies. Really, the only difference was that we had run it again with a new sample of participants. But we ran plenty of participants. So, at minimum it provides some caution in taking the first result as truth. That perhaps that was a false positive. Perhaps it occurred by chance. And so, it might be true. But we don't know it's true. And so it's going to be much harder to publish now with both of these together. Because a reviewer would reasonably say: Well, hang on a second now; maybe it's not true. Try to replicate it again. Russ: So, you said at a minimum, it would have called it into question. At a maximum, what would you conclude? I don't know if maximum is the right word; but the other extreme would be that that first result, something was just wrong. It wasn't accurate, it wasn't reliable, it wasn't true. Guest: Right. Right. And even if we'd done everything right--even if our analysis, our data collection was good and sound, our procedures were solid, it could still have just occurred by chance. And that's part of what statistical inference is about: we are looking probabilistically at the likelihood of these things happening. So we always have that possibility, that it had occurred by chance. But more so are the possibilities that we did little things in our initial analysis that made it more likely to get a result that looked good for us. When we were deciding which participants to exclude, we might have in our data analysis have seen: Well, when we exclude these participants, the effect is a little stronger. And so we might have felt: Well, that's probably a little more justified to exclude those participants. When we were deciding about covariates, we may have used some degrees of freedom of using some covariates, not using other covariates, and then found more compelling those analyses where the results looked better for us. Russ: What do you mean by covariates? Guest: Once we had had that particular analysis strategy, those covariates, these other things, then when we replicate, we presumably would have to use all of those same things again, and so we are much more constrained in our reanalysis of our new data set. And so we don't have the same opportunity to take advantage of chance. Because when we can make all of these different decisions of what should I exclude and what other things should I do, we are leveraging chance. We are taking our opportunities to get results that look good for us. Russ: That's if you are aware of it, at least. When you say covariates, do you mean other variables that might affect the results? Guest: Yeah. So, whenever someone does an analysis of a data set, it isn't always perfectly clear how the analysis should be done. We start with a data set, and then we have to make some decisions. Sometimes a lot of decisions, about what's the appropriate way to analyze these data. And so the researcher has many opportunities to do things that could increase the chances of getting a result that looks good for the researcher. That helps the researcher. And even though my colleagues and I were very genuinely trying to analyze this data reasonably and accurately and everything else, it's quite possible that without our awareness we were influenced by what was coming up as we were doing our data analysis. And sort of pushing the data in the direction of finding something that helps us.

8:59

Russ: So, we are ruling out here, and we are not going to be talking about fraud. Obviously you could have changed some of the scores when you realized that the results--if you'd done it the first time and it hadn't come out, as we'll be talking about--very few journals want to publish an article that finds no relationship between two things that might not be related. And so, you'd realize: We were wasting our time here. And so you could have fudged the data as they say, and fraudulently entered into your statistical spreadsheet or software package different findings than were actually done. That's fraud. That's conscious-- Guest: Right. If I was willing to commit fraud then I'm quite sure that this paper would be published by now. And our careers would be skyrocketing. Russ: At least in the short run. Guest: Hope that it's enough with our principles and values that we are trying to do a good job, genuinely trying to do good science. But that doesn't mean that we are not vulnerable to some of the reasoning and rationalizations that leverage the incentives for what it means to be successful in science. Russ: But as you point out in your paper, fraud is risky. You can get caught. And so even if you are not a good human being, you still might not want to do fraud. But even good human beings have to struggle with the incentives that we are going to talk about. I'm going to actually read a quote from the beginning of the paper. You say:

The real problem is that the incentives for publishable results can be at odds with the incentives for accurate results. This produces a conflict of interest. The conflict may increase the likelihood of design, analysis, and Incentives for reporting decisions that inflate the proportion of false results in the published literature.

And then one more, which is my favorite. You say:

...publishing is also the basis of a conflict of interest between personal interests and the objective of knowledge accumulation. The reason? Published and true are not synonyms. To the extent that publishing itself is rewarded, then it is in scientists' personal interests to publish, regardless of whether the published findings are true....

And my favorite line in there is the sentence, "Published and true are not synonyms." I think that would make a good t-shirt for EconTalk; and many of our listeners would sympathize with that statement. But for those who are not normal listeners to EconTalk, or who are skeptical, this is a slightly depressing idea. I think a lot of people have this image of scientists and professors, researchers, as truth seekers. And you are suggesting here that truth seeking can be derailed by the personal incentives that the researcher faces. Guest: Yeah, that's right. And the additional challenge is that what's ultimately truth isn't determined by any single contribution. Because we are dealing with probabalistic inference, we are trying to accumulate evidence for a particular claim, we will have lots of inferences of claims that don't hold up after repetition. And that is entirely ordinary. That's how it works. People can find it disconcerting to one year find out that eating this kind of food will extend your life and then two years later find out that eating that food will actually shorten your life. And think: Oh, my gosh, science is broken; they can't make up their minds which one it is. But really that's just reflecting what happens in science. We find some evidence here; we find some different evidence there; and then we converge toward the accurate solutions. The added challenge to that is what we are talking about in terms of the incentives in these papers. Which is: certain kinds of results are valued more than others. And because of that, the day-to-day decisions that I make in my laboratory are going to be influenced subtly, without my intention, in order to help me have the best career outcome I can. And that can also get in the way of getting to truth faster.