06 February 2018

The latest Cornell Food and Brand Lab correction: Some inconsistencies and strange data patterns

[Update 2018-05-12 20:40 UTC: The study discussed below has now been retracted. ]

The Cornell Food and Brand Lab has a new correction. Tim van der Zee already tweeted a bit about it.

"Extremely odd that it isn't a retraction"? Let's take a closer look.

Here is the article that was corrected:
Wansink, B., Just, D. R., Payne, C. R., & Klinger, M. Z. (2012). Attractive names sustain increased vegetable intake in schools. Preventive Medicine55, 330–332. http://dx.doi.org/10.1016/j.ypmed.2012.07.012

This is the second article from this lab in which data were reported as having been collected from elementary school children aged 811, but it turned out that they were in fact collected from children aged 3–5 in daycares.  You can read the lab's explanation for this error at the link to the correction above (there's no paywall at present), and decide how convincing you find it.

Just as a reminder, the first article, published in JAMA Pediatrics, was initially corrected (via JAMA's "Retract and replace" mechanism) in September 2017. Then, after it emerged that the children were in fact in daycare, and that there were a number of other problems in the dataset that I blogged about, the article was definitively retracted in October 2017.

I'm going to concentrate on Study 1 of the recently-corrected article here, because the corrected errors in this study are more egregious than those in Study 2, and also because there are still some very substantial problems remaining.  If you have access to SPSS, I also encourage you to download the dataset for Study 1, along with the replication syntax and annotated output file, from here.

By the way, in what follows, you will see a lot of discussion about the amount of "carrots" eaten.  There has been some discussion about this, because the original article just discussed "carrots" with no qualification. The corrected article tells us that the carrots were "matchstick carrots", which are about 1/4 the size of a baby carrot. Presumably there is a U.S. Standard Baby Carrot kept in a science museum somewhere for calibration purposes.

So, what are the differences between the original article and the correction? Well, there are quite a few. For one thing, the numbers in Table 1 now finally make sense, in that the number of carrots considered to have been "eaten" is now equal to the number of carrots "taken" (i.e., served to the children) minus the number of carrots "uneaten" (i.e., counted when their plates came back after lunch).  In the original article, these numbers did not add up; that is, "taken" minus "uneaten" did not equal "eaten".  This is important because, when asked by Alison McCook of Retraction Watch why this was the case, Dr. Brian Wansink (the head of the Cornell Food and Brand Lab) implied that it must have been due to some carrots being lost (e.g., dropped on the floor, or thrown in food fights). But this makes no sense for two reasons. First, in the original article, the difference between the number of carrots "eaten" was larger than the difference between "taken" and "uneaten", which would imply that, rather than being dropped on the floor or thrown, some extra carrots had appeared from somewhere.  Second, and more fundamentally, the definition of the number of carrots eaten is (the number taken) minus (the number left uneaten).  Whether the kids ate, threw, dropped, or made sculptures out of the carrots doesn't matter; any that didn't come back were classed as "eaten". There was no monitoring of each child's oesophagus to count the carrots slipping down.

When we look in the dataset, we can see that there are separate variables for "taken" (e.g., "@1CarTaken" for Monday, "@2CarTaken" for Tuesday, etc), "uneaten" (e.g., "@1CarEnd", where "End" presumably corresponds to "left at the end"), and "eaten" (e.g., "@1CarEaten").  In almost all cases, the formula ("eaten" equals "taken" minus "uneaten") holds, except for a few missing values and two participants (#42 and #152) whose numbers for Monday seem to have been entered in the wrong order; for both of these participants, "eaten" equals "taken" plus "uneaten". That's slightly concerning because it suggests that, instead of just entering "taken" and "uneaten" (the quantities that were capable of being measured) and letting their computer calculate "eaten", the researchers calculated "eaten" by hand and typed in all three numbers, doing so in the wrong order for these two participants in the process.

Another major change is that whereas in the original article the study was run on three days, in the correction there are reports of data from four days.  In the original, Monday was a control day, the between-subject manipulation of the carrot labels was done on Tuesday, and Thursday was a second control day, to see if the effect persisted. In the correction, Thursday is now a second experimental day, with a different experiment that carries over to Friday. Instead of measuring how many carrots were eaten on Thursday, between two labelling conditions ("X-ray Vision Carrots" and "Food of the Day"; there was no "no-label" condition), the dependent variable was the number of carrots eaten on the next day (Friday).

OK, so those are the differences between the two articles. But arguably the most interesting discoveries are in the dataset, so let's look at that next.

Randomisation #fail


As Tim van der Zee noted in the Twitter thread that I linked to at the top of this post, the number of participants in Study 1 in the corrected article has mysteriously increased since the original publication. Specifically, the number of children in the "Food of the Day" condition has gone from 38 to 48, an increase of 10, and the number of children in the "no label" condition has gone from 45 to 64, an increase of 19.  You might already be thinking that a randomisation process that leads to only 22.2% (32 of 144) participants being in the experimental condition might not be an especially felicitous one, but as we will see shortly, that is by no means the largest problem here.  (The original article does not actually discuss randomisation, and the corrected version only mentions it in the context of the choice of two labels in the part of the experiment that was conducted on the Thursday, but I think it's reasonable to assume that children were meant to be randomised to one of the carrot labelling conditions on the Tuesday.)

The participants were split across seven daycare centres and/or school facilities (I'll just go with the authors' term "schools" from now on).  Here is the split of children per condition and per school:


Oh dear. It looks like the randomisation didn't so much fail here, as not take place at all, in almost all of the schools.

Only two schools (#1 and #4) had a non-zero number of children in each of the three conditions. Three schools had zero children in the experimental condition. Schools #3, #5, #6, and #7 only had children in one of the three conditions. The justification for the authors' model in the corrected version of the article ("a Generalized Estimated Equation model using a negative binominal distribution and log link method with the location variable as a repeated factor"), versus the simple ANOVA that they performed in the original, was to be able to take into account the possible effect of the school. But I'm not sure that any amount of correction for the effect of the school is going to help you when the data are as unbalanced as this.  It seems quite likely that the teachers or researchers in most of the schools were not following the protocol very carefully.

At school #1, thou shalt eat carrots


Something very strange must have been happening in school #1.  Here is the table of the numbers of children taking each number of carrots in schools #2-#7 combined:

I think that's pretty much what one might expect.  About a quarter of the kids took no carrots at all, most of the rest took a few, and there were a couple of major carrot fans.  Now let's look at the distribution from school #1:


Whoa, that's very different. No child in school #1 had a lunch plate with zero carrots. In fact, all of the children took a minimum of 10 carrots, which is more than 44 (41.1%) of the 107 children in the other schools took.  Even more curiously, almost all of the children in school #1 apparently took an exact multiple of 10 carrots - either 10 or 20. And if we break these numbers down by condition, it gets even stranger:

So 17 out of 21 children in the control condition ("no label", which in the case of daycare children who are not expected to be able to read labels anyway presumably means "no teacher describing the carrots") in school #1 chose exactly 10 carrots. Meanwhile, every single child12 out of 12in the "Food of the Day" condition selected exactly 20 carrots.

I don't think it's necessary to run any statistical tests here to see that there is no way that this happened by chance. Maybe the teachers were trying extra hard to help the researchers get the numbers they wanted by encouraging the children to take more carrots than they otherwise would (remember, from schools #2-#7, we could expect a quarter of the kids to take zero carrots). But then, did they count out these matchstick carrots individually, 1, 2, 3, up to 10 or 20? Or did they serve one or two spoonfuls and think, screw it, I can't be bothered to count them, let's call it 10 per spoon?  Participants #59 (10 carrots), #64 (10), #70 (22), and #71 (10) have the comment "pre-served" recorded in their data for this day; does this mean that for these children (and perhaps others with no comment recorded), the teachers chose how many carrots to give them, thus making a mockery of the idea that the experiment was trying to determine how the labelling would affect the kids' choices?  (I presume it's just a coincidence that the number of kids with 20 carrots in the "Food of the Day" condition, and the number with 10 carrots in the "no label" condition, are very similar to the number of extra kids in these respective conditions between the original and corrected versions of the article.)

The tomatoes... and the USDA project report


Another interesting thing to emerge from an examination of the dataset is that not one but two foods, with and without "cool names", were tested during the study.  As well as "X-ray Vision Carrots", children were also offered tomatoes. On at least one day, these were described as "Tomato Blasts". The dataset contains variables for each day recording what appears to be the order in which each child was served with the tomatoes or carrots.  Yet, there are no variables recording how many tomatoes each child took, ate, or left uneaten on each day. This is interesting, because we know that these quantities were measured. How? Because it's described in this project report by the Cornell Food and Brand Lab on the USDA website:

"... once exposed to the x-ray vision carrots kids ate more of the carrots even when labeled food of the day. No such strong relationship was observed for tomatoes, which could mean that the label used (tomato blasts) might not be particularly meaningful for children in this age group."

This appears to mean that the authors tested two dependent variables, but only reported the one that gave a statistically significant result. Does that sound like readers of the Preventive Medicine article (either the original or the corrected version) are being provided with an accurate representation of the research record? What other variables might have been removed from the dataset?

It's also worth noting that the USDA project report that I linked to above states explicitly that both the carrots-and-tomatoes study and the "Elmo"/stickers-on-apples study (later retracted by JAMA Pediatrics) were conducted in daycare facilities, with children aged 35.  It appears that the Food and Brand Lab probably sent that report to the USDA in 2009. So how was it that by March 2012the date on this draft version of the original "carrots" articleeverybody involved in writing "Attractive Names Sustain Increased Vegetable Intake in Schools" had apparently forgotten about it, and was happy to report that the participants were elementary school students?  And yet, when Dr. Wansink cited the JAMA Pediatrics article in 2013 and 2015, he referred to the participants as "daycare kids" and "daycare children", respectively; so his incorrect citation of his own work actually turns out to have been a correct statement of what had happened.  And in the original version of that same "Elmo" article, published in 2012, the authors referred to the childrenwho were meant to be aged 8–11as "preliterate". So even if everyone had forgotten about the ages of the participants at a conscious level, this knowledge seems to have been floating around subliminally. This sounds like a very interesting case study for psychologists.

Another interesting thing about the March 2012 draft that I mentioned in the previous paragraph is that it describes data being collected on four days (i.e., the same number of days as the corrected article), rather than the three days that were mentioned in the original published version of the article, which was published just four months after the date of the draft:


Extract from the March 2012 draft manuscript, showing the description of the data collection period, with the PDF header information (from File/Properties) superposed.

So apparently at some point between drafting the original article and submitting it, one of the days was dropped, with the second control day being moved up from Friday to Thursday. Again, some people might feel that at least one version of this article might not be an accurate representation of the research record.

Miscellaneous stuff


Some other minor peculiarites in the dataset, for completeness:

- On Tuesdaythe day of the experiment, after a "control" dayparticipants 194, 198, and 206 was recorded as commenting about "cool carrots"; it is unclear whether this was a reference to the name that was given to the carrots on Monday or Tuesday.  But on Monday, a "control" day, the carrots should presumably have had no name, and on Tuesday they should have been described as "X-ray Vision Carrots".

- On Monday and Friday, all of the carrots should have been served with no label. But the dataset records that five participants (#199, #200, #203, #205, and #208) were in the "X-ray Vision Carrots" condition on Monday, and one participant (#12) was in the "Food of the Day" condition on Friday. Similarly, on Thursday, according to the correction, all of the carrots were labelled as "Food of the Day" or "X-ray Vision Carrots". But two of the cases (participants #6 and #70) have the value that corresponds to "no label" here.

These are, again, minor issues, but they shouldn't be happening. In fact there shouldn't even be a variable in the dataset for the labelling condition on Monday and Friday, because those were control-only days.

Conclusion


What can we take away from this story?  Well, the correction at least makes one thing clear: absolutely nothing about the report of Study 1 in the original published article makes any sense. If the correction is indeed correct, the original article got almost everything wrong: the ages and school status of the participants, the number of days on which the study was run, the number of participants, and the number of outcome measures. We have an explanation of sorts for the first of these problems, but not the others.  I find it very hard to imagine how the authors managed to get so much about Study 1 wrong the first time they wrote it up. The data for the four days and the different conditions are all clearly present in the dataset.  Getting the number of days wrong, and incorrectly describing the nature of the experiment that was run on Thursday, is not something that can be explained by a simple typo when copying the numbers from SPSS into a Word document (especially since, as I noted above, the draft version of the original article mentions four days of data collection).

In summary: I don't know what happened here, and I guess we may never know. What I am certain of is that the data in Study 1 of this article, corrected or not, cannot be the basis of any sort of scientific conclusion about whether changing the labels on vegetables makes children want to eat more of them.

I haven't addressed the corrections to Study 2 in the same article, although these would be fairly substantial on their own if they weren't overshadowed by the ongoing dumpster fire of Study 1.  It does seem, however, that the spin that is now being put on the story is that Study 1 was a nice but perhaps "slightly flawed" proof-of-concept, but that there is really nothing to see there and we should all look at Study 2 instead.  I'm afraid that I find this very unconvincing.  If the authors have real confidence in their results, I think they should retract the article and resubmit Study 2 for review on its own. It would be sad for Matthew Z. Klinger, the then high-school student who apparently did a lot of the grunt work for Study 2, to lose a publication like this, but if he is interested in pursuing an academic career, I think it would be a lot better for him to not to have his name on the corrected article in its present form.

7 comments:

  1. Thanks for this. This is not good. It is crazy that there are this number of problems in the data set and it has been used for this correction. Have you made Prev Med or the authors aware of this post? Eric

    ReplyDelete
    Replies
    1. I only finished the post a few hours ago, around midnight. I plan to write to the editor of the journal in the next couple of days.

      Delete
  2. Thanks for this great posting. You wrote: "First, in the original article, the difference between the number of carrots "eaten" was larger than the difference between "taken" and "uneaten", which would imply that, rather than being dropped on the floor or thrown, some extra carrots had appeared from somewhere."

    Note that kids of all kind of ages can easily break carrots, large or small, into one or more pieces, and in particular during fights. Fights were reported by Brian Wansink.

    I am looking forward to comments from Brian Wansink et al. en from the editor of this journal.

    ReplyDelete
    Replies
    1. Ha! Good point. But it would need a lot of carrots to be split into two and remain on the plate (to be counted when the meal was cleared away) to compensate for the other fight-related losses. And of course, as my other point showed, the whole thing is meaningless because the number "eaten" was defined as the number "taken" minus the number that came back on the plate.

      Although, I suppose that if there was a fight with carrots and very sharp knives (in a daycare, ouch), you could end up with *more* carrots being returned than were taken. Then the title of the article could have been "Attractive names cause carrots to spontaneously reproduce".

      Delete
  3. Hi Nick
    I do not care, at the moment, about teh retraction, BUT what bugs my mightily is the title of that paper!! What kind of BS is this? Correlation of names with veggie intake? WTF is wrong witt that hypotheses?!? How can such a drivel even be considered as being science?
    Cheers oliver

    ReplyDelete
    Replies
    1. FWIW. my biggest problem with the title is the word "sustains". OK, they show a small effect in the consumption on Friday after using the cool names on Thursday. But who knows what happened a week later? For changes in eating patterns to be "sustained", in my book, means that the kids are eating more carrots six months later, even if they aren't given a goofy name.

      Delete
  4. The contents of the retraction note at https://www.sciencedirect.com/science/article/pii/S0091743512003222 and the views of EiC Eduardo Franco at https://retractionwatch.com/2018/02/27/after-considerable-intellectual-agony-journal-retracts-wansink-paper/ are highly remarkable.

    For example https://retractionwatch.com/2012/11/30/poignancy-in-physics-retraction-for-fatal-error-that-couldnt-be-patched/ provides some insight in the way how physics can interact with each other when there are fatal errors in a paper. "Sometime after the correction ran, Pavičić heard from another student in China, about another error."

    So the very long correction note ends with: "We thank Yu-Bo Sheng (Tsinghua University and Beijing Normal University, Beijing, China) for bringing this error to our attention."

    The retraction note at https://journals.aps.org/prl/pdf/10.1103/PhysRevLett.109.079902 is very short: "I hereby retract my paper [1] due to a fatal error I explained in [2]. All my attempts to patch the error have failed. I thank Shi-Lei Su, a student from Yan Bian University, Ji Lin Province, China for bringing the error to my attention."

    So student Shi-Lei Su is able to add to his CV that he detected a fatal error in a paper of professor Pavičić of Harvard University and that the paper of Pavičić therefore needed to be retracted, and Mladen Pavičić will tell to the whole world that Shi-Lei Su is a very clever student.

    So how comes that EiC Eduardo Franco does not refer to the contents of this blog, and how comes Franco even states in this note that Brian Wansink et al are given the opportunity to resubmit a revised version?

    ReplyDelete