Wednesday 22 August 2012

What's Wrong with What Works?

Sexual intercourse began
In nineteen sixty-three
(which was rather late for me) -
Between the end of the Chatterley ban
And the Beatles’ first LP.
(from Annus Mirabilis by Philip Larkin)

            Evidence-based practice has become all but a cliché in educational discourse. Perhaps finally tiring of talking about ‘learnings’, ‘privileging’ and verbing any other noun they can get their hands on, educationists have decided to "sing from the same songsheet” of evidence-based practice. That’s got to be a good thing, right? Well, yes, it would be if they were singing the same tune and the same words. Unfortunately, evidence-based practice means different things to different people. This is why I personally prefer the term scientific evidence-based practice. But how are we to know what constitutes (scientific) evidence-based practice?
The Education Minister for New South Wales has recently (August 2012) launched the Centre for Education Statistics and Evaluation which “undertakes in depth analysis of education programs and outcomes across early childhood, school, training and higher education to inform whole-of-government, evidence based decision making.” (See http://tinyurl.com/d53f2y2 and http://tinyurl.com/c6uh3y4). Moreover, we are told,
The Centre turns data into knowledge, providing information about the effectiveness of different programs and strategies in different contexts – put simply, it seeks to find out what works”.
Ah, ‘what works’, that rings a bell. It is too early to tell whether this new centre will deliver on its promises but what about the original ‘What Works Clearinghouse’ (WWC), the US based repository of reports on educational program efficacy that originally promised so much?
As Daniel Willingham has pointed out:
The U.S. Department of Education has, in the past, tried to bring some scientific rigor to teaching. The What Works Clearinghouse, created in 2002 by the DOE's Institute of Education Sciences, evaluates classroom curricula, programs and materials, but its standards of evidence are overly stringent, and teachers play no role in the vetting process.” (See http://tinyurl.com/bn8mvdt)
My colleagues and I have also been critical of WWC. And not just for being too stringent. Far from being too rigorous, the WWC boffins frequently make, to us, egregious mistakes; mistakes that, far too often for comfort, seem to support a particular approach to teaching and learning.
I first became a little wary of WWC when I found that our own truly experimental study on the efficacy of Reading Recovery (RR) had been omitted from their analyses underlying their report on RR. Too bad, you might think, that’s just sour grapes. But, according to Google Scholar, the article has been cited 160 times since publication in 1995 and was described by eminent American reading researchers Shanahan and Barr as one of the “more sophisticated studies”. Interestingly enough, it is frequently cited by proponents of RR (we did find it to be effective) as well as by its critics (but effective only for one in three children who received it). So why was it not included by WWC? It was considered for inclusion but was rejected on the following grounds:
“Incomparable groups: this study was a quasi-experimental design that used achievement pre-tests but it did not establish that the comparison group was comparable to the treatment group prior to the start of the intervention.”
You can read the details of why this is just plain wrong, as well as other criticisms of WWC, in Carter and Wheldall (2008) (http://tinyurl.com/c6jcknl). Suffice to say that participants were randomly allocated to treatment groups and that we did establish that the control group (as well as the comparison group) was comparable to the (experimental) treatment group who received RR prior to the start of the intervention. This example also highlights another problem with WWC’s approach. Because they are supposedly so ‘rigorous’, they discard the vast majority of studies from the research literature on any given topic as not meeting their criteria for inclusion or ‘evidence standards’. In the case of RR, 78 studies of RR were considered and all but five were excluded from further consideration. Our many other criticisms of what we regard as a seriously flawed WWC evaluation report on RR are detailed in Reynolds, Wheldall, and Madelaine (2009) (http://tinyurl.com/cuj8sqm).
Advocates of Direct Instruction (DI) seem to have been particularly ill-served by the methodological ‘rigour’ of WWC, for not only are most more recent studies of the efficacy of DI programs excluded because they do not meet the WWC evidence standards but they also impose a blanket ban on including any study (regardless of technical adequacy) published before 1985; an interesting if somewhat idiosyncratic approach to science. Philip Larkin told us that sex only began in 1963 but who would have thought that there was no educational research worth considering before 1985?  (Insert your own favourite examples here of important scientific research in other areas that would fall foul of this criterion. Relativity anyone? Gravity?) Zig Engelmann, the godfather of DI, has written scathingly about the practices of WWC (http://tinyurl.com/c5pjm9d and http://tinyurl.com/85t2vpt), concluding:
I consider WWC a very dangerous organization. It is not fulfilling its role of providing the field with honest information about what works, but rather seems bent on finding evidence for programs it would like to believe are effective (like Reading Recovery and Everyday Mathematics).”
Engelmann can be forgiven for having his doubts given that for the 2008 WWC evaluation report on the DI program Reading Mastery (RM) (http://tinyurl.com/d8kawf7), WWC could not find a single study that met their evidence standards out of the 61 studies they were able to retrieve. (Engelmann claims that there were over 90 such studies, mostly peer reviewed.)
The most recent WWC report on RM in 2012 (http://tinyurl.com/7bdobxv), specifically concerned with its efficacy for students with learning disabilities, determined that only two of the 17 studies it identified as relevant met evidence standards and concluded:
“Reading Mastery was found to have no discernible effects on reading comprehension and potentially negative effects on alphabetics, reading fluency, and writing for students with learning disabilities.”
In response to this judgement, the Institute for Direct Instruction pointed out, not unreasonably, that, of the two studies considered:
“One actually showed that students studying with RM had significantly greater gains than students in national and state norming populations. Because the gains were equal to students in Horizons (another DI program), the WWC concluded that RM had no effect. The other study involved giving an extra 45 minutes of phonics related instruction to students studying RM. The WWC interpreted the better results of the students with the extra time as indicating potentially negative effects of RM.” (http://tinyurl.com/9oewdlo)
 In other words when Reading Mastery was compared with another very similar DI program (in each case), and the results were no different from or slightly better than the standard Reading Mastery program, it was concluded that Reading Mastery was therefore ineffective for students with learning disabilities and possibly even detrimental to their progress. It is conclusions such as these that have led some experts in the field to wonder whether this is the result of incompetence or bias: cock up or conspiracy.
If we needed any further proof of the unreliability of WWC reports, we now have their August 2012 report on whether Open Court Reading© improves adolescent literacy (http://tinyurl.com/9nzv5wj). True to form, they discarded 57 out of 58 studies as not meeting evidence standards. On the basis of this one study they concluded that Open Court “was found to have potentially positive effects on comprehension for adolescent readers”. There are at least three problems with this conclusion. First, this is a bold claim based on the results for just one study, the large sample size and their ‘potentially positive’ caveat notwithstanding. Second, the effect size was trivial at 0.16, not even ‘small’, and well below WWC’s own usual threshold of 0.25. Third, and most important of all, this study was not even carried out with adolescents! The study sample comprised “more than 900 first-grade though fifth-grade who attended five schools across the United States”. As Private Eye magazine would have it “shorely shome mishtake” …
There is, then, good reason for serious concern regarding the reliability of the judgments offered by WWC. The egregious errors noted above apart, there is the more pressing problem that truly experimental trials are still relatively rare in educational research and those that have been carried out may often be methodologically flawed. In its early years, What Works was renamed ‘Nothing Works’ by some because there was little or no acceptable evidence available on many programs. Clearly, teachers cannot just stop using almost all programs and interventions until there are sufficient RCTs testifying to their efficacy to warrant adopting them. Hattie, for example, in his seminal 2009 work ‘Visible Learning’ has synthesized over 800 meta-analyses relating to achievement in order to be able to offer evidence-based advice to teachers (http://tinyurl.com/3h9jssl). (Very few of the studies on which the meta-analyses were based were randomized control trials, however, as Hattie makes clear.)
Until we have a large evidence base of methodologically sound randomized control trials on a wide variety of educational programs, methods and procedures, we need a more sophisticated and pragmatic analysis of the evidence we currently have available. It is not a question of accepting any evidence in the absence of good evidence, but rather of assessing the existing research findings and carefully explaining the limitations and caveats.
As I have attempted to show, the spurious rigour of WWC whereby the vast majority of studies on any topic are simply discarded as being too old or too weak methodologically, coupled with their unfortunate habit of making alarming mistakes, makes it hard to trust their judgments. If the suggestions of bias regarding their pedagogical preferences have any substance, we have even more cause for concern. As it stands, What Works simply won’t wash.

Postscript November 15, 2013 
Further to my original blog post ‘What’s wrong with What Works?’, WWC have released new reports on reading interventions that confirm that at WWC it is business as usual. My colleague, Mark Carter, alerted me to problems associated with the WWC evaluation (March 2013) of the efficacy of FastForWord® (FFW):
“I just can't believe WWC. They found "positive" effects of FFW for alphabetics but the ES was 0.15 – trivial, below their own 0.25 (low) standard for educational significance and moving the average child a whole 6 percentile ranks. They got exactly the same results for fluency and comprehension and reached different conclusions for each. Three identical effect sizes - three different conclusions. 
They say in the text that the effects are below 0.25 and are "indeterminate" but then give it a positive rating. They seem to be vote counting the number of significant outcomes in a given area. This is conceptually antithetical to the whole idea of meta-analysis. The problem with examining significance is that it is substantially a function of sample size - that is why we use effect sizes to aggregate findings across studies. And, to boot, they are ignoring their own criteria for educational significance. I really can't believe it.” 

It is also worth noting that WWC based their efficacy evaluation on just nine studies out of the 342 studies they originally identified that looked at the efficacy of FFW on early reading skills; seven studies that met their evidence standards without reservations and two studies that met their standards with reservations.