Systematic reviews claim to present the gold standard when it comes to research synthesis. Vested in a rigorous and transparent methodological protocol, systematic review findings are regarded as authoritative and replicable. These core principles of systematic reviews – rigour, transparency, and replicability – have distinguished systematic reviews from the more common traditional literature reviews, and established the former as an accepted research methodology. Such is systematic reviews’ reputation as a research method that, for example, in the UK the National Institute for Health and Care Excellence (NICE) considers systematic reviews as the preferred form of evidence when making decisions on the admission of new drugs.
But what would happen if NICE conducts two independent systematic reviews on the effectiveness of the same new drug and the two reviews would come to different recommendations? Certainly this hypothetical case would have large implications for the assumed rigour of systematic review as a research methodology. A recent World Bank blog claims to present such a case. In the blog based on a policy research paper assessing what works in improving learning outcomes in developing countries, Evans and Popova state that they conducted a ‘systematic review of systematic reviews’ to investigate commonalities in review findings. The authors identify six systematic reviews, and in comparing the reviews’ findings fail to discover much overlap between the reviews’ conclusions. In further analysis, they discover that this divergence in findings is mainly driven by differences in the included primary evidence and the formulation of intervention categories. Due to the contradictions in the findings of what are perceived as similar systematic reviews, the authors rightfully ask ‘how definitive are these systematic reviews really?’ and caution that the community should ‘take systematic reviews with a grain of salt’.
So what is going on? Have the authors posed a serious challenge to the methodology of systematic review and the wider science of research synthesis? In this response, we argue that this is not the case and that the authors are at risk of constructing a straw man argument, since neither their own review nor the majority of the reviewed reviews can be considered as systematic reviews. Our argument is based on a rigorous assessment of both the authors’ review of reviews, and the systematic reviews it included. Our assessment used a structured critical appraisal tool applied by three independent reviewers, which is explained in more detail below.
1. Critical appraisal methodology
There are many organisations that aim to establish best practice and minimum standards for systematic reviews across a range of different disciplines. Modelled on the Cochrane Collaboration, which has administered systematic reviews in health for over 20 years, the Campbell Collaboration and the Collaboration for Environmental Evidence have assumed the same function in social and environmental sciences respectively. Several other organisations also coordinate the publishing of systematic reviews, including the European Food Safety Authority, the EPPI-Centre, and the Centre for Reviews and Dissemination. These organisations have strict minimum requirements that must be assessed through external peer-review and met before they will publish a systematic review (or review protocol). These minimum requirements are not an officially accepted definition of what constitutes a systematic review, but their stringency maximizes reliability.
Three broad minimum standards are common to all of the above organisations’ systematic reviews, however: i) systematic review methods should be described in sufficient detail to allow full repeatability and traceability; ii) they must include a systematic approach to identifying and screening relevant academic and grey literature, iii) they should include critical appraisal of the validity (quality and generalisability) of included studies to give greater weight to more reliable studies. We have used these minimum standards to produce a schema to aid critical appraisal of the included reviews (Table 1).
We stress again the we do not claim here to provide a strict definition of a systematic review; using this critical appraisal tool to describe the reviews that were analysed in the World Bank report it emerges, however, that most of them do not classify as systematic reviews. Our analysis of the included studies is available in a spreadsheet appended to this review.
Table 1. Schema for critical appraisal of reviews
Domain | Questions | Explanation |
The review | Does the document refer to itself as a “systematic review”? How does the document refer to itself? Is the review published in an academic journal? Was the review subject to peer-review? If so, how (external/internal)? | Reviews may not claim to be systematic. Peer-review (whether formal or informal) is a central cornerstone of scientific research and indicates some form of community appraisal. |
The protocol | Was a protocol produced (as mentioned in the review)? Was this protocol externally peer-reviewed? Was the protocol published? | A review protocol sets out the methods for the review and allows expert and public input to fine-tune the sources and strategies for identifying and including the best available evidence. |
The search | Were multiple academic sources searched? Was a search string established and used in all resources? Were searches documented (minimum date, search terms, numbers of results)? Were attempts to search for grey literature included? | Systematic reviews should be as comprehensive as possible, searching multiple databases and making efforts to search for grey literature in addition to published research. Search activities should be documented in sufficient detail to allow the review to be repeated. |
The screening | Are clear inclusion criteria reported? Was screening undertaken by multiple reviewers to any extent? Was consistency between reviewers tested? Are the results of screening (numbers) reported for titles and abstracts? Are the results of screening (numbers and exclusion reasons) reported for full texts? Are the number of unobtainable/untranslated articles reported? | Transparent criteria for the inclusion of articles in the review are vital to demonstrate objectivity and allow repeatability in a review. Screening should be carried out by more than one reviewer to demonstrate objectivity and consistency should be assessed between reviewers. Screening activities should be clearly documented for traceability. Numbers of articles making it through each stage should be documented. Ideally, reasons for excluding articles at full text should be provided. |
Critical appraisal | Are included studies appraised for internal and external validity? Are the criteria for CA provided in detail? Are the results of CA reported in detail? | Critical appraisal of study internal (quality) and external (generalisability) validity is a vital stage in every systematic review. Synthesis should be based on reliable evidence. Critical appraisal allows for unreliable evidence to be excluded or down-weighted in analyses. Activities should be documented and justified in detail. |
Data extraction | Is the method for data extraction reported in detail? Are the extracted data reported? Is any data manipulation reported in detail? | Data (quantitative and qualitative) should be extracted in a transparent way to ensure that objectivity and consistency across studies is maintained. Any manipulation of data (e.g. calculation of effect sizes) should be documented in detail. Ideally, all extracted data should be available in some form. |
Synthesis | Is a narrative/qualitative/quantitative synthesis present? Is there any evidence of vote-counting? Is publication bias assessed or discussed? Are synthesis methods provided in detail? Are synthesis outputs reported in detail? | Narrative synthesis is the discussion of the evidence base as a whole. Qualitative synthesis involves an established means of combining results in a qualitative way. Quantitative synthesis involves the use of powerful statistics. Vote-counting should always be avoided since it ignores effect sizes and can mask underlying patterns in the data that are not apparent in individual studies. Publication bias should always be investigated to assess whether the findings of the review may be affected by the lack of studies with certain findings (e.g. non-significant or contradictory results). All synthesis activities should be documented in detail. |
Other | Are potential conflicts of interest discussed? Is it obvious who funded the review? | Potential conflicts of interest should be dealt with in an acknowledgements section and through documenting author affiliations. Sources of funding should be included in the acknowledgements. |
2. How systematic are the six included ‘systematic reviews’?
None of the included studies claim to be systematic reviews, although Krishnaratne et al. (2013) is based on a systematic review with additional studies included from 2009 onwards. Conn (2014) claims to include a systematic literature search as part of the review process. Two of the reviews purport to be meta-analyses, whilst the rest claim to be ‘reviews’. Two of the reviews were not peer-reviewed: not itself an explicit indication of poor quality, but an indication of high potential risk of unreliability or bias.
Only one review included a description of a protocol, but that was for the systematic review upon which it was based (Krishnaratne et al. 2013). Only one review (Conn 2014) reported methods in any detail at all. The remaining five reviews gave no details of the searches used or the number of results. Murnane and Ganimian 2014 gave no description of their methods whatsoever. At least one review failed to include searches for grey literature (Murnane and Ganimian 2014), with no details on grey literature searching in Kremer et al. (2013). The numbers of included articles were reported by only two reviews (Conn 2014; Glewwe et al. 2014) at abstract level; nothing more. Four reviews failed to assess the reliability of the studies that they included, and where investigated this was only as a basic assessment of experimental design (Glewwe et al. 2014). Two reviews (Conn 2014; McEwan 2014) provided details of how data were extracted and/or coded within their reviews. One paper provided a link to a website where further information on data manipulation were provided (Kremer et al. 2013). Three reviews included some form of vote-counting (Glewwe et al. 2014; Kremer et al. 2014; Murnane and Ganimian 2014). Only two reviews report synthesis activities in detail (Conn 2014; McEwan 2014).
In summary, two of the reviews included in the World Bank paper can by no interpretation of the term be regarded as systematic reviews (Kremer et al. 2014; Murnane and Ganimian 2014). One review (Glewwe et al. 2014) approaches a systematic design, but the lack of methodological detail and reporting of results mean that it is of highly questionable reliability. A further review is an update of a published 3ie systematic review (Krishnaratne et al. 2013), but the update is not carried out in a systematic way, with a lack of a systematic search and methodological detail reducing its reliability. McEwan’s review (2014) is a sophisticated meta-analysis that includes a partially systematic search that could be referred to as a systematic review, although there is no critical appraisal. Finally, Conn (2014) is a high quality review that fails only on its lack of methodological detail and documenting of activities.
3. How systematic is the authors’ systematic review?
The blog post ends on a tongue-in-cheek call for someone to “do a systematic review of our systematic review of the systematic reviews”. On a serious note, umbrella systematic reviews are a common feature of clinical medicine, since thousands of systematic reviews are now published, and overlap or replication is common. Tertiary systematic reviews of multiple overlapping systematic reviews are also undertaken, and the rigorous methodological standards are maintained in these reviews. The review described in the blog post and detailed in the white paper, however, is also not a systematic review.
The evidence obtained for the review is not the result of a systematic search. In fact, the methods are sparsely described. Based on the blog and the report it appears that the 3ie database of systematic reviews was searched for reviews from 2013 and 2014. No details of the search strategy are given, and no other resources were searched for evidence. Fairly detailed inclusion criteria are given, although the means by which these criteria were applied are not reported at all. No critical appraisal has been conducted, explaining the inclusion of several reviews that are not systematic reviews. Neither are methods provided for how data were extracted and synthesised, and it is unclear whether the data reported are all that were available from each study. The review itself is essentially vote-counting: no formalised qualitative or quantitative synthesis are undertaken.
We do not claim to have undertaken a systematic critical appraisal of Evans and Popova’s report. Rather, we have systematically assessed the rigour of the review and commented on potential sources of bias. From a gold standard systematic review perspective, the evidence included in the review is not a highly reliable set of studies, and the way in which the review of reviews was undertaken is not as rigorous, transparent and replicable as one would expect a systematic review of reviews to be.
In sum, Evans and Popova have conducted a helpful review of evidence that holds much promise to inform thematic debates regarding education in developing countries. Though, as neither the majority of the identified reviews, nor the produced review of reviews can be regarded as following systematic review methodology the paper carries little practical implication for the science of research synthesis. Having said that, we feel that as reviewers we can nevertheless draw a number of lessons from this exchange:
4. We need to get better at communicating the system behind systematic reviews
There seems to be an underlying belief throughout Evan and Popova’s report and blog that all reviews which follow some form of methodological system should automatically be counted as systematic reviews. This perception is mistaken. Counting narrative reviews as systematic reviews in their paper illustrates this point: while a systematic review might rely on a narrative synthesis if qualitative or quantitative synthesis are not feasible, there is no such product as a fully narrative systematic review. Such a literature review, that follows a narrative thread informed by expert opinion rather than following a set of a priori, established systematic methods, necessarily remains a literature review. Having a system to search for studies or to define inclusion criteria does not mean that the review can be classified as a systematic review.
Systematic reviews must provide sufficient detail to ensure repeatability and traceability. They must include a systematic approach to identifying and screening relevant academic and grey literature. They must include a critical appraisal of included study validity to give greater weight to more reliable primary research. These three minimum requirements are common to all true systematic reviews, and they ensure rigour, transparency, and replicability. To safeguard these standards, systematic reviews are typically undertaken according to a published and peer-reviewed protocol Taken together, this ensures that systematic reviews do not just follow any system that individual authors deemed fit, but rather that they follow one established method. While ‘systematic review’ might be an accepted trademark in the medical literature, we need to get better at communicating systematic review as a distinctive research methodology in the social and environmental sciences.
5. We need more reviews to be registered with accredited systematic review bodies
As outlined above, the majority of the reviews assessed by Evans and Popova do not meet the accepted standard for conducting and reporting systematic reviews. Had they been registered with a systematic review body, the authors would have had access to a range of methodological experts and resources, including detailed methodological guidelines and expert advice through peer-review. There would have been an associated requirement to make their review methodology public a priori, subjecting their search strategy, inclusion criteria, and synthesis methods to peer-review. This would have exposed fundamental flaws in the review products.
Systematic review coordinating organisations are an invaluable asset to the field of research synthesis. They help to prevent – to cite one of Evans and Popova’s concerns – ‘reviews [from varying] in how systematically they define the strategy used to identify the papers reviewed’ by using a uniform system to report search results and inclusion criteria (that is, the PRISMA diagram and the PICO criteria). Systematic review bodies aim to prevent exactly the kind of issues Evans and Popova identified in their reviews. 3ie-funded systematic reviews of international development interventions already require registration of the review with the Campbell Collaboration, a practice that should be more widely used (as well as extended to other reviews bodies, for example the Cochrane Collaboration, CEE and the EPPI-Centre). In the long-run, the term systematic review in development should be associated with certain minimum standards such as those set out by these coordinating bodies. In this way, the reliability of a review product will be clearly guaranteed
6. The risks of relying on common literature review findings are as imminent as ever, and we need more, not less systematic reviews, including meta-reviews
Dean Karlan made the following comparison: ‘Literature reviews are like sausages… I don’t eat sausages as I don’t know what goes into them’. In essence, at least two of the reviews included by Evans and Popova are simply literature reviews. Should we then be surprised that their findings differ? Murnane and Ganimian (2014) state ‘in this paper, we describe our interpretation of the evidence from well-designed evaluations of interventions’. Kremer and colleagues (2013) also add their interpretation and may well define ‘well-designed evaluation’ differently. Are we then surprised that the studies yield different conclusions?
In addition, Evans and Popova engage in the same conduct on a meta level. They do not consider all of the available evidence on education in developing countries (see, for example, DFID’s review series). Neither do they appraise the reviews critically. Tertiary systematic reviews are not exempt from the need to follow rigorous systematic review methodology. The most important methodological insight we then gain from this meta-literature review of literature reviews is: let’s do less literature reviews, no matter at what level of abstraction.
Therefore Evans and Popova’s call for future work might need to be revised to encourage: ‘someone to please first do an actual systematic review of real systematic reviews’; and what next? A quaternary systematic review of their revised systematic review of real systematic reviews! What a marvel indeed!