How Science Takes Stock
How Science Takes Stock The Story of Meta~Analysis
Morton Hunt
Russell Sage Foundation
New York
The Russell Sage Foundation The Russell Sage Foundation, one of the oldest of America's general purpose foundations, was established in 1907 by Mrs. Margaret Olivia Sage for "the improvement of social and living conditions in the United States." The Foundation seeks to fulfill this mandate by fostering the development and dissemination ofknowledge about the country's political, social, and economic problems. While the Foundation endeavors to assure the accuracy and objectivity of each book it publishes, the conclusions and interpretations in Russell Sage Foundation publications are those of the authors and not of the Foundation, its Trustees, or its staff. Publication by Russell Sage, therefore, does not imply Foundation endorsement. BOARD OF TRUSTEES Peggy C. Davis, Chair Alan S. Blinder Anne Pitts Carter Joel E. Cohen Phoebe C. Ellsworth Timothy A. HultqUist
Ira Ka tznelson Ellen Condliffe Lagemann Howard Raiffa John S. Reed Neil]. Smelser
Eugene Smolensky Harold Tanner Marta Tienda Eric Wanner William Julius Wilson
Library of Congress Cataloging-in-Publication Data Hunt, Morton., 1920How Science takes stock: the story of meta-analysis / Morton Hunt. p. cm. Includes bibliographical references and index. ISBN 0-87154-389-3 (cloth) ISBN 0-87154-398-2 (paper) 1. Science-Methodology. 2. Science-Miscellanea. 3. Meta-analysis. 1. Title. 97-964 Q175.H94 1997 CIP 001. 4'22-dc21 Copyright © 1997 by Russell Sage Foundation. First paper cover edition 1999. All rights reserved. Printed in the United States of America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, ~ithout the prior written permission of the publisher. Reproduction by the United States Government in whole or in part is permitted for any purpose. The paper used in this publication meets the minimum requirements of American National Standard for Information Sciences-Permanence of Paper for Printed Library Materials. ANSI Z39.48-1992. RUSSELL SAGE FOUNDATION 112 East 64th Street, New York, New York 10021 1098765432
to
Bernice, meta,mate
"Chaos was the law of nature; Order was the dream of man." --Henry Adams, paraphrasing Karl Pearson
Contents
Acknowledgments Chapter 1
Making Order of Scientific Chaos The Explosion of Contemporary Science The Classic-and Inadequate-Solution A Radical New Approach A Sampler of Meta-Analytic Achievements The Value of Meta-Analysis
Chapter 2
xi
1
6 8
14 18
Settling Doubts About Psychotherapy Attack, Counterattack, and Stalemate Vote-Counting-A Plausible but Unreliable Way to Sum Up Research Genesis of Meta-Analysis: Part I Mining and Smelting the Data Genesis of Meta-Analysis: Part II
20 22 25 35 40
IN BRIEF ..•
Does Marital and Family Therapy Work? Second-Level Meta-Analysis Trying to Stay Sick in Order to Get Well Between Treatment and Effect: A Rube Goldberg View
Chapter 3
44 47 48 51
Clarifying Murky Issues in Education Is It True That "Throwing Money at Schools" Does No Good? Of Universes and Their Boundaries "Dollars and Sense: Reassessing Hanushek" Apples and Oranges Going Public Against Hanushek
54 57 59 61 63
vii
viii
Cuntents
IN BRIEF ... What Good Is Homework?
New Knowledge Sex and Science Achievement The Bugaboo of Causality
Chapter 4
IN BRIEF .•. The Surprising Finding of a Large Breast Cancer Meta-Analysis
Meta-Analytic Epiphanies The Case of the Does-and-Doesn't-Work Vaccine Central Question
77
81
83
86 91
94 96
99 102 104 107
Firming Up the Shaky Social Sciences Judging a Stranger in Thirty Seconds How Widely Is It True? Thin Slices, Fat Findings The File-Drawer Problem
IN BRIEF ... The Great Expectancy Effect Controversy
Checking for Leaks Rehabilitating Juvenile Delinquents: Is It True That Nothing Works? The Noise-to-Signal Ratio
Chapter 6
73
"Who Shall Decide, When Doctors Disagree ?" Too Many Studies, Too Few Answers Second and Other Opinions-and Combined Opinions The Streptokinase Case Measures of Outcome Time and Lives Lost That Could Have Been Saved Are Big Trials Necessary?
Chapter 5
68 70
109 112 115 118 121 125 128 131
Lighting the Way for Makers of Social Policy A Tough Customer Makes a Tough Request
135
Contents
Doing Meta-Analyses for Congress: A Challenging Task A Meta-Analysis Changes a Senator's Mind Meta-Analyses and Congressional'Policy-Making
ix
137 139 146
IN BRIEF .•.
Lumpectomy Versus Mastectomy Into the Black Hole Smoke Gets in Your Eyes-and Lungs, and That's Nothing to Sing About "More Ways than One to Skin a Cat" -Mark Twain
Chapter 7
149 152 154 158
Epilogue: The Future of Meta~Analysis Elixir of Forecast: Take Minimal Dose A Few Specific Prophecies A Parting Word
Appendix by Harris Cooper Notes References Index
161 163 166 169 183 193 205
ABOUT THE AUTHOR Morton Hunt writes about the social and behavioral sciences. He attended Temple University and the University of Pennsylvania, worked briefly on the staffs of two magazines, and since 1949 has been a freelance writer. He has written eighteen books and some 450 articles. He has won a number of prestigious science-writing awards.
ALSO BY MORTON HUNT The Natural History of Love Her Infinite Variety: The American Woman as Lover, Mate and Rival Mental Hospital The Talking Cure (with Rena Corman and Louis R. Ormont) The World of the Formerly Married The Affair: A Portrait of Extra-Marital Love in Contemporary America The Mugging Sexual Behavior in the 1970s Prime Time: A Guide to the Pleasures and Opportunities of the New Middle Age (with Bernice Hunt) The Divorce Experience (with Bernice Hunt) The Universe Within: A New Science Explores the Human Mind Profiles of Social Research: The Scientific Study of Human Interactions The Compassionate Beast: What Science Is Discovering About the Humane Side of Humankind The Story of Psychology
Acknowledgments
The Russell Sage Foundation has supported research into the application of meta-analysis to the social sciences since 1987; many of the findings that have emanated from that support lend themselves to metaanalysis in other fields of science as well. The publications resulting from the Foundation's grants have been largely technical, intended for scientists and policy analysts interested in learning about and applying meta-analytic methodology. But in 1994 Eric Wanner, preSident of the Foundation, felt it was time to tell a larger public about meta-analysis-a public comprising any and all persons with a general intellectual interest in science, plus one special audience: members of Congress, state legislators, agency administrators, and their staffs, all of whom might find meta-analyses a swift and effective way to acquire the data needed to buttress decisions about social programs and legislation. This book is the product of a grant proposed by Mr. Wanner and affirmed by the Russell Sage Foundation's board of directors. I am deeply grateful to Foundation Scholar Robert K. Merton for recommending me to Mr. Wanner, and to Mr. Wanner and the board for entrusting me with the task of authorship. My speCial thanks go to Harris Cooper, who read the manuscript and corrected a number of technical errors; any that remain are my responsibility. I am also grateful to all those people who gladly suffered-at least I hope it was gladly-to be interviewed at length and to those others whom I did not interview but who generously furnished me with reprints and other documents: Nalini Ambady, Alexia A. AntczakBouckoms, Theodore X. Barber, Betsy Becker, lain Chalmers, Thomas Chalmers, Eleanor Chelimsky; Graham Colditz, Rory Collins, George Comstock, Thomas Cook, Harris Cooper, David S. Cordray, Elizabeth C. Devine, Judith Droitcour, Daniel Druckman, Alice Eagly, H. J. Eysenck, Chris Fossett, Gene V Glass, Roger Greenberg, Joel Greenhouse, Rob Greenwald, Judith Hall, Eric Hanushek, Larry V. Hedges, Bruce Kupelnick, Richard D. Laine, Joseph Lau, Richard]. Light, Mark Lipsey; Norman Miller, Frederick Mosteller, Ingram Olkin, Michele Orza, Robert Rosenthal, Donald Rubin, Christopher Schmid, William Shadish, Varda xi
xii
Acknowledgments
Shoham, Mary Lee Smith, William Stock, Miron Straf, Kenneth Wachter, Paul M. Wortman, Robert York, and Salim Yusuf. I thank the following for permission to reprint copyrighted materials: -The American Psychological Association and William R. Shadish: figures 1 and 2 from Shadish and Sweeney (1991). Adapted with permission. -The Lancet: Figure 2 from the Early Breast Cancer Trialists' Collaborative Group (1992). -The New EnglandJoumal of Medicine and the Massachusetts Medical Society: Figure 1 from Lau and others (1992). Reprinted by permission; all rights reserved. -The Russell Sage Foundation: Figures 6.6 and 6.7 from Betsy Jane Becker's chapter in Cook and others (1992). -U .S. General Accounting Office: Figure titled "Quality of Studies and Credibility of Available Information" from u.s. General Accounting Office (1984). -John Wiley & Sons and Mark Lipsey: Figures 2 and 3 from Lipsey (1995). MORTON HUNT
Chapter 1
Making Order of Scientific Chaos
,,I
The Explosion of Contemporary Science
f I have seen further it is by standing on the shoulders of giants," Isaac Newton wrote to Robert Hooke in 1675/6. In assuming this modest pose, he alluded to a fundamental assumption that our culture makes about science, namely, that it is progressive and cumulative, a corollary of which is that forays into the unknown by any researcher, however brilliant, are merely extensions of the knowledge amassed up to that time. For centuries it has been an article of faith that scientists base their research on existing information, add a modicum of new and better data to it, and thereby advance toward an ever more profound, complete, and accurate explanation of reality. But today we are experiencing a crisis of faith; many of us no longer feel sure that science, though growing explosively, is moving inexorably toward the truth. Indeed, "growing explosively" is an ominous oxymoron: "growing" implies orderly development, but "explosively" denotes disorder and fragmentation. Virtually every field of science is now pervaded by a relentless cross fire in which the findings of new studies not only differ from previously established truths but disagree with one another, often vehemently. Our faith that scientists are cooperatively and steadily enlarging their understanding of the world is giving way to doubt as, time and again, new research assaults existing knowledge. In recent years, however, methodologists in a number of scientific disciplines have been developing an antidote to the increasingly chaotic output of contemporary research. Known as meta-analysis, it is a means of combining the numerical results of studies with disparate, even conflicting, research methods and findings; it enables researchers to discover the consistencies in a set of seemingly inconsistent findings and to arrive at conclusions more accurate and credible than those presented in anyone of the primary studies. More than that, meta-analysis makes it possible to pinpoint how and why studies come up with different results, 1
2
Making Order of Scientific Chaos
and so determine which treatments-circumstances or interventionsare most effective and why they succeed. * To appreciate how anarchic contemporary research has become and how needed this new methodology is, one has only to read the daily papers. Here, for instance, are two typical recent news stories: NEW STUDY FINDS VITAMINS ARE NOT CANCER PREVENTERS
A new study [reported in The New England Journal of Medicine) has failed to find evidence that vitamin supplements protect against the development of precancerous growths in the colon .... Many [previous) studies had found that people who eat large amounts of fruits and vegetables had lower cancer rates, and fruits and vegetables are known for providing vitamins C and E.' STUDY SAYS EXERCISE MUST BE STRENUOUS TO STRETCH LIFETIME
Moderate exercise may well be the route to a healthier life, but if living longer is your goal, you will have to sweat. A new Harvard study that followed the fates of 17,300 middle-aged men for more than 20 years has found that only vigorous and not nonvigorous activities reduced their risk of dying during the study period. 2
In a follow-up, the writer adds: "The new finding ... has surprised leading researchers in the field. They are striving to reconcile it with many other studies that point to a life-saving benefit from moderate exercise, and they are perplexed that the Harvard study failed to find the expected benefit."3 Some other instances of seeming disarray in recent scientific findings: • Ten studies determine how much the risk of ischemic heart disease (blockage of heart arteries) is reduced when serum cholesterol is lowered by roughly one-tenth of the average levels in Western countries. All ten studies conclude that it does reduce the risk, but the reported reduction ranges from nearly 40 percent in one study to as little as 15 percent in another. 4 • Twenty-one studies of the use of fluorouracil against advanced colon cancer all find it beneficial, but findings of its effectiveness vary so widely-from a high of 85 percent to a low of 8 percent-as to be meaningless and useless to clinicians. 5
* The method has many other names, among them research synthesis, evaluation synthesis, overview, systematic review, pooling, and structured review. In general, I use the term metaanalysis, since it is the one most often used in journal titles, indexes, and data bases.
How Science Takes Stock
3
• In 1994 a study published by the National Task Force on the Prevention and Treatment of Obesity reported that "yo-yo dieting" (the repeated losing and regaining of weight) poses no significant health risks-a direct contradiction of the findings of previous studies that off-and-on dieting can disrupt the body's metabolism, increase body fat, lead to heart problems, and heighten the risks of suffering other health problem. 6 • A recent major study of the effects of exposure to the electromagnetic fields that surround power lines and electrical equipment shows a stronger link between electromagnetic fields and brain cancer than any previous study-but also contradicts earlier studies by finding no evidence of increased risk of leukemia. 7 Such cases are legion not only in medical and biological research but also in behavioral and social science research:* • Many studies of the treatment of aphaSia (loss of speech due to brain damage) by speech therapy find it valueless, while others find it distinctlyeffective. 8 • A number of studies of the effect of coaching on Scholastic Aptitude Test scores have shown that it raises them significantly, others that it raises them only trivially.9 • A generation ago, the Department of Health, Education, and Welfare asked Richard J. Light, a statistician at Harvard University, to determine whether the Head Start program worked. Light found a wealth of research data in thirteen s~udies that had already evaluated the program. The first twelve all showed modest positive effects, but the thirteenth, far larger than any of the others, disconcertingly showed no effect. lO "I had no idea what to do," Light recently told a reporter for Science; his bewilderment eventually motivated him to develop a way of combining disparate research results, a precursor of metaanalysis. II • Some studies find school desegregation to improve the academic achievement of black students significantly; others find only modest gains; and still others observe hardly any improvement. Even more confusing, some social scientists present credible evidence that desegregation improves achievement, but others offer equally credible evidence that it diminishes achievement. 12
* For brevity, the behavioral and social sciences are referred to hereafter as the social sciences.
4
Making Order of Scientific Chaos
• Do women in management have a different leadership style from men? The question has long been hotly debated: some management experts and social scientists claim the evidence shows they do differ, others that the data yield no clear pattern of differences in supervisory style. 13 • A massive and influential review of the scientific literature on sex differences assembled during the heyday of feminism by two respected women psychologists found little evidence of such differences in any area of social behavior except aggression. But later studies have furnished experimental and observational evidence of sex differences in many kinds of social behavior, including helping, sending and receiving nonverbal messages, and conforming to group values. 14 • The Department of Health and Human Services recently ordered a review of studies of the prevalence of alcohol, drug, and mental disorders among the homeless, expecting the information to help in developing sensible policies for reducing homelessness. A reviewer located eighty studies containing an abundance of data-but no answers. The estimates in the studies differed so widely as to be useless: alcohol problems in the homeless population, from 4 percent to 86 percent; mental health problems, from 1 percent to 70 percent; and drug problems, from 2 percent to 90 percent. 15 And so on.
Why have the sciences apparently degenerated into an intellectual free-far-all in our time? In truth, there never was a golden era of pure harmony and cooperation among scientists. There have always been and are always bound to be competition, disagreement, and conflict among those pursuing research in any given area, since no two researchers think, perceive, or conduct a study in exactly the same fashion, nor are any two laboratory experiments or field observations exactly alike. Even when two researchers use the same methods to study a phenomenon, normal sampling errors (akin to the chance variations in the sum of two identical dice thrown repeatedly), minor differences in the persons they are studying, and other random factors make it unlikely that they will get the same or even very similar results. Accordingly; comparable and even replicate studies of any subject almost never yield identical findings. In the social sciences the possibility of disparity is far greater than in the physical and biomedical sciences. 16 So many interacting variables influence human behavior that no two groups of human subjects are identical, even if the groups are large and carefully equated. Moreover, human subjects, unlike cells in vitro or bacteria in a patient's body; often react to
How Science Takes Stock
5
experimental situations according to their own volitions and past experiences, thereby adding unique influences to the effects of the variables being examined by the researchers and supposedly under their control. While discrepancies and contradictions have always existed in scientific research, today they are more numerous, well publicized, and disturbing than before. One obvious reason for the increase is the mushroom-like expansion of the sciences in the last half century. In medicine, for instance, during a single recent year the New EnglandJournal ofMedicine and the British MedicalJournal alone devoted some 4,400 pages to 1,100 articles, and currently, throughout the world, over two million medical articles are published each yearY With such voluminous output and so many new areas of investigation, it is inevitable there should be more disagreement than ever. The reward system of science greatly intensifies the potential for disagreement. Career success is contingent on publication, and publishers are most interested in those studies that present news-findings that challenge the previously accepted wisdom. Although the primary motive of researchers is new knowledge, they are bound to hope that the results of their investigations will correct, conflict with, or disprove the results of earlier studies and those of concurrent research by colleaguecompetitors; such hopes, as a wealth of experimental evidence has shown, often unconSciously affect the researchers' performance in ways that tend to produce the desired result. The resulting intellectual melee does serious injury to science and society. For one thing, it impedes scientific progress: As the volume of research grows, so does the diversity of results, making it all but impossible for scientists to make sense of the divergent reports even within their own narrow specialties. In consequence, their research tends to be based less on the accumulated knowledge of their field than on their limited and biased view of it. For another, when legislators and public administrators seek, through hearings and staff research, to study a pressing issue, they can rarely make sense of the hodge-podge of findings offered them. In the recent congreSSional debates about smoking in public places, members of Congress were told by antismoking advocates that many studies, including a report by the surgeon general, found "passive smoking" (inhalation of others' smoke) to be a cause of lung cancer. Tobacco-state colleagues and tobacco-industry lobbyists told them, however, of other studies by qualified researchers showing little evidence of such a connection, and even, remarkably, of two small 1984 studies and a larger, more recent one-in the authoritative New England Journal of Medicine-that found less lung cancer in people exposed to others' smoke in the home than in
6
Making Order of Scientific Chaos
people not so exposed. 18 Who could blame a legislator for not relying on research to help him or her decide how to vote on the issue? Lastly, the prevalence of scientific disparities is eroding public belief in and support for research. Many intellectuals see the conflicting findings as justifying "constructivism," a view now popular among the "postmodernist" academic Left that scientific discoveries are not objective truths but only cultural artifacts, not representations of reality but self-serving products of the system. 19 At a different intellectual level, many of the uninformed and gullible see the contradictory outcomes of current research as grounds for broadly rejecting scientific knowledge in favor of simpler and more coherent beliefs-or "faiths"-in the power of prayer, guardian angels, miracles, astrology, past-lives regression, channeling, back-fromdeath experiences, and assorted New Age psychic phenomena. 2o
The Classic-and Inadequate-Solution In most fields of science the standard way of dealing with the multiplicity of studies and divergent findings has long been the "literature review" or "research review." Scientific reports customarily begin with a brief resume of previous work on the problem being considered; in that tradition, for some decades journal editors have published occasional articles summarizing and evaluating recent studies in actively researched areas of their discipline, and nearly every field has a type of annual review journal consisting entirely of such resumes. Review articles, according to Howard White of the College of Information Studies, Drexel University, "are generally admired as a force for cumulation in science."21 From the vantage point of meta-analysis, however, that tribute seems unwarranted. It is true that a good review article can marshal and summarize recent work on a particular topic; it is true, too, that one can only admire those who perform the heroic task of reading scores of often dense, technical, and tedious studies and summing up each in a sentence or two. But anyone conversant with metaanalysis will question whether reading the desiccated summaries in such articles-not unlike chewing a mouthful of dry bran-yields a genuine integration of the new knowledge. Consider, for example, the follOWing brief excerpt from a recent review article on individual psychotherapy: Some comparisons of psychotherapy and drug treatment have suggested that combined treatment may present definite advantages over either treatment alone (Frank &: Kupfer 1987; Weissman et al 1987; Hollon et al 1988), others have shown no differences between psychotherapy and psychotherapy plus medication at termination (Beck
How Science Takes Stock
7
et al1985), and still others have shown advantages at follow-up for patients who received cognitive-behavior therapy (Simons et al1986). In the comparison of a cognitive-behavioral (prescriptive) therapy and a dynamic-experiential (exploratory) treatment of depression and anxiety, Shapiro &: Firth (1987) found a slight advantage for the prescriptive approach, especially on symptom reductionY This specimen exemplifies the typical achievement and typical failure of the research review article: Although it offers a handy list of items in a particular area of research, it does little to integrate or cumulate them. Some reviews do offer more combinatory conclusions, but not methodically or rigorously; a recent critique of fifty medical review articles said that most summarized the pertinent findings in an unsystematic, subjective, and "armchair" fashion.23 In an even harsher appraisal of medical review articles, two leading medical meta-analysts, Thomas Chalmers and Joseph Lau, write, Too often, authors of traditional review articles decide what they would like to establish as the truth either before starting the review process or after reading a few persuasive articles. Then they proceed to defend their conclusions by citing all the evidence they can find. The opportunity for a biased presentation is enormous, and its readers are vulnerable because they have no opportunity to examine the possibilities of biases in the review. 24 Such criticisms apply to review articles in other fields of science. In Summing Up, a handbook of meta-analysis, Richard Light and David Pillemer characterize traditional review articles as not only subjective and Scientifically unsound but "an inefficient way to extract useful information" because they lack any systematic method of integrating the relationships among the variables in the different studies and of reconciling differences in the results. 25 Most review articles do not subject the studies they examine to the relatively simple statistical tests that would estimate how likely it is they mistook chance results-chiefly, sampling error-for meaningful ones (a false positive conclusion) or used too small a sample so that chance factors concealed the important results (a false negative conclusion). Review articles, in short, offer knowledge without measurement, the worth of which was famously expressed long ago by Lord Kelvin: When you can measure what you are speaking about and express it in numbers you know something about it, but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely, in your thoughts, advanced to the stage of science, whatever the matter may be.
8
Making Order of Scientific Chaos
A Radical New Approach In 1904 the British mathematician Karl Pearson invented a statistical method for combining divergent findings. At that time, the effectiveness of inoculation against typhoid fever was still unclear; not only did the results of different trials vary but the samples were so small that the results of anyone trial might be partly or largely due to chance factors. Pearson's simple but creative idea was to compute the correlation within each sample between inoculation and mortality (a correlation is a statistic showing how closely one variable is related to another variable) and then average the correlations of all the samples; the result, balancing out the chance factors and idiosyncrasies of the individual studies, would be a datum more trustworthy than any of the individual statistics that went into computing it. 26 Seven decades later, when meta-analysis finally caught on, its practitioners developed an array of more complicated and precise computations than Pearson's, but to this day the basic concept behind combining and reconciling studies remains as simple and radical as in 1904. Rather than reaching vague conclusions like those of review articles-"The majority of studies show that ... " or "While some studies find the treatment effective, most fail to reach statistical significance"-the new approach asks, "How can we produce a precise, quantitative finding representing what the studies show when synthesized into one superstudy?" Current meta-analytic techniques involve subtle and discriminating procedures, of which Pearson's averaging is one of the simplest. We will look at them later (in verbal, not mathematical, terms) when we peer over the shoulders of scientists as they conduct meta-analyses that resolve the ambiguities in bodies of important research data. For now, however, let us carry out a hypothetical experiment that will give us a first glimpse at how meta-analysts combine data, the central process in meta-analysis. You are one of a hundred physicians taking part in the testing of a new fever-reducing medication, antipyron. You are to give the same specified dose of the drug to the next eight patients you treat for influenza, and to record their temperatures on taking the dose and again four hours later. Your first case is a young man who has a fever of 1040 F; four hours after taking the drug his temperature has plummeted to 98 0 , and you, naturally, are delighted and enthusiastic. Your second flu patient is a middle-aged man with a fever of 1020 ; disappointingly, in four hours the antipyron lowers his temperature only to 1000 • You give it to six more patients-with varying results.
9
How Science Takes Stock
What can you conclude about the overall effectiveness of the drug? Can you average the data of your eight cases and arrive at a typical figure for fever reduction for the dosage given? Certainly. Indeed, the researchers with whom you are collaborating might then take your average and mathematically combine it with the averages turned in by the other ninety-nine physicians taking part in the study to arrive at an overall drug effect. They might discover, say, that antipyron reduced fever in flu patients by an average of 2.5 0 F in four hours, across one hundred medical trials and eight hundred patients. This is a rudimentary metaanalysis, indicating how effective the drug is on average. But being a well-trained doctor, you do not accept your own finding-or the finding of all one hundred trials-as an adequate guide to treatment, since you know that all patients are not alike. For one thing, the young man who was your first case weighs 150 pounds, the older man who was the second case, 220 pounds, and, typically, the greater the body weight, the less effect a given dose will have, since its effect is more diffused in a larger system. For another, the men also differ in age; perhaps the drug works more swiftly in a young body, with its higher metabolic rate, than in an older one. Of course, the other ninety-nine physicians know this too. So they, like you, not only recorded the change in each patient's temperature but also their weight, age, and possibly some other data. The project researchers might have compiled your data, arranging the patients in the order of their weight: Patient
Age
Weight
Temp. change
1 5 4 6 3 7
20 50 35 60 45 30 25 55
150 160 170 190 180 200 210 220
-6° -5° -4° -4° -3° -3° -2° -2°
8
2
With the data arranged in this fashion, you would easily see a clear relationship between the weight of the patient and the effect of the drug; with one exception, patient 6, as the patient's weight goes up the effect gets smaller. It is less clear that age affects the outcome. Using your data, it is possible to calculate the correlations between the variables of weight and temperature change and between age and temperature change. A correlation will be + 1.00 if the relation between
10
Making Order of Scientific Chaos
two variables is perfect and positive (that is, as one variable goes up, so does the other), -1.00 if it is perfect but negative (as one goes up the other goes down), and zero if no relation exists at all. When you receive the researchers' report, you learn that in your own set of eight patients the correlation between weight and the drug's effectiveness is - .93 and between age and effectiveness, - .17. But when the researchers meta-analyzed-combined-the correlations for all one hundred data sets, they found that the average correlation between weight and temperature change is - .85, a strong (but not perfect) negative relation. The average correlation between age and effectiveness is only - .l3, a weak association. Based on these meta-analytic findings, they conclude that higher doses of the drug may be necessary when treating heavier patients but that age, being largely irrelevant, does not need to be taken into account. This imaginary experiment is a simplified version of only one process central to meta-analysis, namely, combining the findings of different studies. But a second and equally central problem exists: reconciling differences among studies. Suppose that nearly all one hundred physicians reported average reductions in their patients' fever of 3° but that one doctor reported an average reduction of only 1° and another an average reduction of 5.5°. How do you reconcile these differences from the more general finding? A little detective work might reveal that the small reduction came from a clinic that treated obese patients while the large reduction involved athletes. The meta-analyst might then suggest that differences among the patients are the clue to reconciling the disparate outcomes. With that brief rundown behind us, let's return to Pearson's first effort at combining the data of a set of studies. In 1904 the mainstream scientific community expressed little interest in Pearson's techniques. The time was not ripe, and the idea of synthesizing studies mostly langUished for the next six decades. Only a handful of avant-garde scientists pursued Pearson's ideas in the first half of the century, sensing the impending need to synthesize studies. In the first third of the century, for example, agricultural scientists conducted numerous experiments in farming techniques but were unable to draw general conclusions from them since the tests almost always involved differences in soil, agricultural practices, climatic conditions, and so on. The problem was how to reach any useful generalizations from these dissimilar studies. A statistician named Leonard H. C. Tippett found an answer. From each experiment, Tippett obtained three pieces of data: the size of the sample, the size of the difference in crop
How Science Takes Stock
11
yield between different farming techniques, and the amount of variation in yield that occurred by chance within any specific technique (for instance, in experiments using the same kind and amount of fertilizer, how much variation in yield occurred without any known cause?). With this information, Tippett was then able, adjusting for sample size, to compare the difference between techniques to the difference within techniques, the latter being a measure of how much variation might occur by chance. This enabled him to calculate the likelihood that the difference in yield between farming techniques was due to chance-and conversely, the likelihood that the technique was causing real improvements in yield. He might, for instance, discover that, given a particular sample size and a certain amount of Within-technique variation, only twelve times in one hundred would a between-technique difference of a prescribed size as large as existed in that set of studies occur solely by chance. (Today we call this number the "probability of a chance finding.") Then Tippett made a notable leap: He worked out a statistical method of combining the probability values of the several studies. This statistic, bypassing all the differences among the studies, showed how likely it was that the results of the whole set of studies were due to chance and, conversely, how likely that the results arose from the new farming technique. 27 Although a handful of other researchers working with agricultural studies soon used Tippett'S method or their own variants of it, scientists in other disciplines did not adopt the approach. Nor did they accept the other methods of combining probabilities constructed by a few avantgarde statisticians working on research in education, psychology, and the social sciences. In 1937 an imaginative biostatistician, William G. Cochran, went off in a different direction and worked out a way of combining the sizes of the effects reported in studies rather than the correlations between treatment and effect; although this approach would become a key technique of meta-analysis, it also attracted little attention. 28 But by the 1950s the sciences were growing explosively, and scientists increasingly needed to sum up the proliferating studies in their fields and reconcile their differences. A small bu t growing cadre of researchers began developing methods to combine the results of studies within medicine, psychology, and SOciology. 29 By the early 1970s, others were designing methods for aggregating studies of teaching methods, television instruction, and computer-assisted instruction. Robert Rosenthal, a social psychologist at Harvard, was developing a technique for combining the effect sizes of psychological studies at the very same time that Gene V Glass, a professor of education at the University of Colorado, was working out
12
Making Order of Scientific Chaos
a remarkably similar method of combining studies of the effects of psychotherapy, though neither knew of the other's work. 30 Finally came the event generally cited as the beginning of the metaanalysis movement. In April 1976, Glass, then president of the American Educational Research Association, delivered his presidential address at the annual meeting, held that year at the St. Francis Hotel in San Francisco. For this important event, he chose to highlight a new and higher level of scientific analysis to which he gave the name "meta-analysis." Glass, then in his mid-thirties and fully aware of the topic's potential importance, had labored and agonized over the paper for two years, during which, he recently said, "I was a basket case." * As the day of his address neared, he was desperately afraid that the audience of a thousand would either drift away, doze off, or deride his ideas. But Glass, who describes himself as a highly competitive person, stepped cockily to the podium and with seeming selfassurance gave a lucid, witty, and persuasive talk. The audience, recalls psychologist Mary Lee Smith (who was then his wife and had worked with him on the meta-analysis paper), was "blown away by it. There was tremendous excitement about it; people were awestruck." His address, published later that year in The Educational Researcher, was judged by many who read it to be a breakthrough applicable to all sciencesY The meta-analytic process, briefly sketched in Glass's paper and later spelled out by him and two collaborators, as well as by others, has five basic phases that parallel the phases of conducting a new study. They can be summarized in a few phrases, though the details fill books: 32 I. Formulating the problem: Deciding what questions the meta-analyst hopes to answer and what kinds of evidence to examine. 2. Collecting the data: Searching for all studies on the problem by every feasible means. 3. Evaluating the data: Deciding which of the gathered evidence is valid and usable; eliminating studies that do not meet these standards. 4. SyntheSizing the data: Using statistical methods, such as the combining of probabilities and the combining of effect sizes, to reconcile and aggregate disparate studies. 5. Presenting the findings: Reporting the resulting "analysis of analyses" (Glass's phrase) to the wider research community, providing details, data, and methods used. That sounds simple; in fact, the process is almost always complicated, tedious, and problematic. "The magnitude of the task cannot be
*Unpublished quotations throughout the text are from personal interviews conducted by the author.
How Science Takes Stock
13
overemphasized in my view," writes Nan Laird of the Harvard School of Public Health. 33 Daniel Druckman, an eminent political scientist who devoted three years of his evenings and weekends to single-handedly conducting a meta-analysis, says, "It drove me crazy. Never again!" Nonetheless, from Glass's 1976 presentation to the present the prestige and practice of meta-analysis has grown steadily-at first slowly, then with increasing speed and spreading from one discipline to another. But not without initially encountering much scornful and even hostile opposition: In an article in American Psychologist in 1978, the distingUished English psychologist H. J. Eysenck contemptuously dismissed Glass's work as "an exercise in mega-silliness"; in 1979 a peer reviewer of a meta-analysis by Harris Cooper, which looked at studies of sex differences in conformity, wrote, "I simply cannot imagine any great contribution to knowledge from co.mbining statistical results"; and when social psychologist judiFh HaIr gave a seminar on meta-analysis at the Harvard School of Public Health in 1980, to an audience made up primarily of natural scientists, she encountered such acerbic criticism and derision that, she recently said, "If they'd had any rotten tomatoes to throw at me, they would have. "34 None of which prevented meta-analysis, an idea whose time had come, from winning converts in first one discipline, then another. Those who carry out meta-analyses, when asked what draws them to such work, offer a variety of motivations. One sees it as the "cutting edge" of science; another says he is "a compulsive data analyst" who has to solve the puzzles presented by disparate findings; a third calls herself a "neatnik" who likes "to bring order out of chaos, to tidy things up"; and nearly all share a desire to discover patterns in the seemingly hopeless jumble of dissimilar findings. And so meta-analysis gained currency, with books on its methodology appearing in the 1980s and a few top-notch statisticians, among them Frederick Mosteller of Harvard, Ingram Olkin of Stanford, and Larry Hedges of the University of Chicago, refining the techniques. The best indicator of its success may be the frequency of its appearance in scientific journals. At first, editors, unsure that the method was either scientifically legitimate or a true contribution to knowledge, were reluctant to publish meta-analyses. But each year they published more than the year before. In 1977, five major data bases, ERIC (education), PsycINFO, Scisearch, SOCIAL SCISEARCH, and MEDLINE had zero listings of meta-analyses, but in 1984 they had a total of 108, in 1987, 191, and in 1994, 347, by which time the grand total in those five data bases was 3,444.35 Recently, some journal editors have even plaintively asked contributors to ease up on submissions of meta-analyses.
14
Making Order of Scientific Chaos
A Sampler of Meta~ Analytic Achievements What has this flood of meta-analyses yielded? Not every effort has produced important findings; indeed, many have added little to the world's inventory of scientific knowledge. But a fair share have made important contributions and resolved long-standing uncertainties. In a number of cases, the validity of meta-analytic conclusions has been confirmed by later massive studies or clinical trials-which suggests that the latter were more or less unnecessary. Some of the more noteworthy cases are presented in chapters 2 through 6, but here is a handful of others in capsule form: • Coronary artery bypass surgery to treat ischemic heart disease has been practiced for twenty-five years, but clinical studies of varied design have yielded widely divergent conclusions about its ability to reduce mortality as compared to medical treatment. There has even been some question as to whether bypass surgery, though it improves quality of life, has any life-extending value. A 1994 meta-analysis, collaboratively conducted by a dozen institutions in five countries, combined seven major trials that compared bypass surgery with medical therapy. It found that five years after treatment, the mortality rate of bypass patients was 10.2 percent while that of medically treated patients was 15.8 percent-half again as high-and that there was a comparable advantage for bypass patients at seven and ten years. 36 • In the 1980s, calcium channel blockers were among the most commonly used drugs for acute myocardial infarction (heart attack), unstable angina, and certain other cardiovascular conditions. The National Heart, Lung, and Blood Institute and the Bowman-Gray School of Medicine cosponsored a meta-analysis of twenty-eight disparate studies. Its surprising conclusion, published in 1987, was that "calcium channel blockers do not reduce the risk of initial or recurrent infarction or death when given routinely to patients with acute myocardial infarction or unstable angina."37 • Between 1965 and 1980 at least fifty clinical trials sought to determine whether there was any benefit in giving preventive antibiotics to patients about to undergo colon surgery, rather than merely requiring the standard bowel-cleansing procedures. The clinical trials reported confUSingly discrepant infection and mortality rates; a few even indicated better results for patients not treated with antibiotics. A meta-analysis conducted by a team at New York's Mount Sinai School of Medicine and published in 1981 clarified the issue: Combining the results of twenty-six trials that met the standards for meta-analysis, it showed
How Science Takes Stock
15
that antibiotic therapy reduced infection rates from 36 percent to 22 percent, and death rates from 11.2 percent to 4.5 percent. 38 • In 1977, cimetidine, an H2 blocker, dramatically changed the preferred treatment of peptic ulcers from surgery to pharmaceutical therapy, and over the next fifteen years two other H2 blockers, famotidine and ranitidine, were introduced. Clinical studies differed as to which drug yielded the best results. A 1993 meta-analysis of sixteen trials directly comparing the drugs showed that famotidine taken at bedtime had significantly higher healing rates than either cimetidine or ranitidine and that the three did not differ significantly in terms of adverse reactions. 39 • For many years researchers have debated whether chlorination of drinking water, which prevents many infectious diseases, is carcinogenic. Studies provided contradictory findings and the issue long remained unsettled. In 1992, however, a team at the Medical College of Wisconsin in Milwaukee meta-analyzed ten studies and reported that chlorination is correlated with slightly higher rates of rectal and bladder cancer but that "the potential health risks of microbial contamination of drinking water greatly exceed the risks" of the two cancers. The team also pointed out that the ten meta-analyzed studies were conducted in the 1970s; since then, federal standards have lowered the permissible level of chlorination, and the risk of the two cancers may now be lower.40 • Is intelligence related to the innate quickness of the individual's brain when reacting to external stimuli? The answer would cast light on the long-debated issue of the extent to which intelligence is determined by heredity rather than by experience and social influences. A good index of innate, unlearned mental speed is "inspection time" (IT), commonly measured by flashing two lines of different lengths on a screen immediately followed by a pattern to overcome the brief residual image in memory. An individual's IT is defined as the minimum time of exposure he or she needs to reliably discriminate between the two lines. Dozens of studies have yielded a mish-mash of answers, but a recent meta-analysis by psychologists John Kranzler and Arthur Jensen of the University of California-Berkeley found a strong negative correlation of about -.54 between IT and adult general IQ; that is, the longer the IT, the lower the individual's IQ.41 • Does alcohol cause aggressive behavior? Most studies have provided only correlational evidence-that is, alcohol and aggression tend to co-occur-but while correlation suggests some kind of link between the two, it does not prove causality; something else may cause the corre-
16
Making Order of Scientific Chaos
lation. The amount of time spent watching television, for instance, correlates with ill health, but TV watching does not itself impair health; sicker people watch more because they are less able to do other things. 42 Or, back to our example, it might be the case that people who lack the ability to control their impulses are more likely both to drink and to behave aggressively. Experimental studies have yielded evidence for several competing causal explanations of the alcohol-aggression correlation. Among them: alcohol causes cognitive and emotional changes resulting in aggression; drinkers deliberately use alcohol so as to have an excuse for aggressive behavior; alcohol psychologically disinhibits persons who are predisposed to aggression; alcohol directly and physiologically causes aggression by anesthetizing the part of the brain that normally prevents such responses. Brad Bushman and Harris Cooper of the University of Missouri meta-analyzed thirty experimental studies in which the behavior of different kinds of drinkers was observed either after drinking alcohol, after drinking a placebo they thought was alcohol, or after drinking nothing. The meta-analysis revealed little or no support for any single causal theory but, synthesizing the results of the studies, Bushman and Cooper did conclude that alcohol definitely causes aggression, possibly through a combination of some of the hypotheSized causal factors. 43 • The extent to which violence on television stimulates aggressive, antisocial, or delinquent behavior has been a matter of controversy for over three decades. More than two hundred studies have yielded an array of answers; over the years, that lack of agreement has undoubtedly strengthened the hand of television programmers and weakened that of government regulators. A meta-analysis recently conducted for the National Research Council finally furnished a definitive answer: Viewers are more apt to commit aggressive or antisocial acts after seeing violence on TV (particularly violent erotica), the most common kind being physical violence against another person. 44 • As mentioned earlier, the findings of studies of the prevalence of alcohol, drug, and mental health problems among the homeless have been extraordinarily inconsistent, ranging from 4 percent to 86 percent for alcohol problems, 2 percent to 90 percent for drug problems, and 1 percent to 70 percent for mental health problems. A meta-analysiS by Anthony Lehman of the University of Maryland and David Cordray of Vanderbilt University made sense of these vast discrepancies, their synthesized findings being that 28 percent of the homeless have alco-
How Science Takes Stock
17
hoI abuse problems, 10 percent have drug abuse problems, anywhere from 23 to 49 percent have mental disorders (depending on the category of disorder), and 11 percent have various combinations of the three. The figures represent the current prevalence of these problems among the homeless; far larger numbers have had such problems at some time in the past and presumably could have them again. 45 • Fluoxetine (Prozac) came on the market in 1987, swiftly became the most prescribed antidepressant in the United States, and was hailed by the media as a wonder drug. Scores of studies said that it was far more effective than the tricyclics, the previous standard antidepressants. A team composed of researchers from the State University of New York at Syracuse and several other institutions carried out two meta-analyses of studies comparing the effects of Prozac with those of tricyclics and placebos under double-blind conditions (that is, with neither patient nor physician knowing what the patient was being given). The meta-analytic findings were illuminating: In many ostensibly doubleblind trials, Prozac has less noticeable side-effects than tricyclics, which tend to cause dry mouth, blurry vision, and other obvious symptoms; as a result, patients and doctors were often able to guess correctly when Prozac was being administered and wishfully overevaluated its antidepressant effect. Correcting for this error and aggregating the results, the team found that Prozac was only a half to a quarter as effective as previously reported and no better than the tricyclics, except for the diminished side-effects. 46 • A staggering amount of research is churned out annually on agricultural issues-far more than growers or agricultural administrators can master. An example: In 1992 alone, the National Agricultural Library added 363 new research items about strawberries. To demonstrate that meta-analysis can present both growers and administrators with easily comprehensible summary findings, Douglas Shaw, a professor of agriculture at the University of California-Davis, and statistician Ingram Olkin of Stanford University meta-analyzed a group of studies of chemical control and a group of studies of biological control that focused on the important strawberry pest Tetranychus urticae (a spider mite). The original studies in each group differed in their findings, but the meta-analysis clarified the matter: Although biological controls had a statistically Significant effect, chemical controls were nearly four times as effective in terms of increased strawberry yield. The metaanalysis did not look at other benefits and harms that might be associated with yield.47
18
Making Order of Scientific Chaos
The Value of Meta~Analysis Despite the track record of meta-analysis, a number of scientists still scorn and belittle it. Some deride it as "garbage in, garbage out," arguing that combining studies, even using fancy statistics, merges trashy research with sound research and therefore degrades the whole exercise. Others say meta-analysis "crowds out wisdom," since assistants usually do the tedious work of evaluating and compiling the data, and although they lack knowledge and mature judgment; the senior researchers, however, then meta-analyze the data as confidently as if they had compiled it themselves. Still others see meta-analysis as a fancy set of techniques for achieving ever greater precision in answering questions of possibly dubious merit. David Sohn, a psychologist at the University of North Carolina, is even more caustic and rejecting. Asserting in a recent issue of American Psychologist that primary research is the only valid method of making discoveries, he ridicules the claim that meta-analysis is a superior mechanism of discovery: It is not reasonable to suppose that the truth about nature can be dis-
covered by a study of the research literature .... Meta-analytic writers have created the impression, with a farcical portrayal of the scientific process, that the process of arriving at truth is mediated by a literature review.... After some critical mass of findings has been gathered, someone decides to see what all of the findings mean by doing a literature review, and thereby knowledge is finally established. 48
Such is the minority view, however. The majority, as already documented, see meta-analysis as an important, even historic, advance in science. Below are a few of the major benefits that meta-analysis is widely agreed to yield: • Physicians can now make decisions as to the use of therapies or diagnostic procedures on the basis of a single article that synthesizes the findings of tens, scores, or hundreds of clinical studies. • Scientists in every field can similarly gain a coherent view of the central reality behind the multifarious and often discordant findings of research in their areas. • Meta-analysis of a series of small clinical trials of a new therapy often yields a finding on the basis of which physicians can confidently begin using it without waiting long years for a massive trial to be conducted. • In every science, meta-analysis can generally synthesize differing results, but when it cannot, it can often identify the moderator and me-
How Science Takes Stock
19
diator variables (about which more later) that account for the irreconcilable differences. By so doing, the meta-analysis identifies the precise areas in which future research is needed, a function of considerable value to science . • On the pragmatic level, meta-analyses of a wide range of social problems have profound implications for social policy; their findings about such issues as the value of job training for the unemployed, the effects of drinking-age laws, and the rehabilitation of juvenile delinquents offer policy-makers easily assimilated syntheses of bodies of research they have neither the time nor the training to evaluate on their own.
Whether one sees meta-analysis as a set of recondite techniques for getting precise answers to irrelevant questions or as an epochal advance in scientific methodology, it unquestionably has come to occupy a major place in contemporary scientific research. Yet it is not itself a science and does not embody any scientific theory. It is, rather, a method or group of methods by means of which scientists can recognize order in what had looked like disorder. As Ingram Olkin writes, "I like to think of the metaanalytic process as similar to being in a helicopter. On the ground individual trees are visible with high resolution. This resolution diminishes as the helicopter rises, and in its place we begin to see patterns not visible from the ground. "49 To use a different image, meta-analysis is a tool used by scientists. Far from being a belittling characterization, this is high praise, for as the illustrious physicist Freeman Dyson, a member of the Institute for Advanced Study at Princeton, recently commented, "The great advances in science usually result from new tools rather than from new doctrines. "50
Chapter 2
Settling Doubts About Psychotherapy Attack, Counterattack, and Stalemate
B
y 1952, the "talking cure"-psychoanalysis and related forms of psychotherapy-was nearly six decades old, highly esteemed by the avant-garde and intelligentsia, and rapidly growing in popularity 1 Its healing power had been proclaimed by such distinguished intellectuals as Andre Breton, Thomas Mann, and Arthur Koestler and popularized by M@ss Hart in Lady in the Dark, a success on Broadway in 1941 and later on the screen. New forms of dynamic psychotherapy, related to psychoanalysis but briefer and less costly, appeared almost yearly, and, mirabile dictu, nearly all were said by their practitioners to benefit a high proportion of patients-usually two-thirds, but sometimes more than ninetenths. 2 Psychotherapy, which the medical establishment long resisted, seemed to have broken through and overrun America. But in 1952 there came a sudden violent counterattack. An article in the American Psychological Association's Journal of Consulting and Clinical Psychology compared the results of nineteen studies of neurotic patients treated by psychotherapy with those of two samples of neurotics who received no treatment and delivered a shocking verdict: "Roughly two-thirds of a group of neurotic patients will recover or improve to a marked extent within about two years of the onset of their illness, whether they are treated by means of psychotherapy or not. The figures fail to show any favorable effects of psychotherapy."3 Who would so boldly attack the cherished belief in an established healing art-and do so in an official organ of the psychological establishment? He was H. J. Eysenck, 36, a German-born British psychologist on the staff of the Institute of Psychiatry, University of London. 4 He had early aeveloped an antipathy toward Freudian psychotherapy and its offshoots, as had some other British psychologists, but where they were soft-spoken about their reservations, Eysenck was, by nature, the very opposite. A tall, well-built, strong-featured young man, affable and mannerly in person, he was-and still is-intellectually a brawler and dissi20
How Science Takes Stock
21
dent; indeed, he titled a recent autobiographical essay "Maverick Psychologist" and a full-length autobiography Rebel with a Cause. As a schoolboy in Germany, Eysenck had been headstrong, flippant, and outspoken, infuriating a key teacher by describing the esteemed Frederick the Great as "an autocratic, warmongering homosexual." In his teens he held socialist views and, although not Jewish, so detested Hitler and Nazism that at age eighteen, when Hitler became chancellor, he chose exile, leaving his parental home and settling in England, where he studied psychology at the University of London. At the time there were two main schools of psychology in England, the psychometrists (measurers of individual differences) and the experimentalists; Eysenck found each of them too narrow and, always blunt and forthright, made himself persona non grata with the leaders of both. Nonetheless, thanks to his considerable research abilities, he was beginning to advance up the academic ladder when he suddenly became famous (or infamous) as a result of his 1952 denigration of psychotherapy. "The sky fell in," Eysenck has written of the reaction to his article. "I immediately made enemies of Freudians, of psychotherapists, and of the great majority of clinical psychologists and their students."5 Within a short time, many replies ranging from the icily analytical to the hotly abusive appeared in professional journals. Of the numerous flaws their authors found in Eysenck's article, two were predominant. First, his recovery figures for untreated patients were based on the release rate of psychoneurotic patients from state mental hospitals in the United States and on EqUitable Life Assurance disability claims filed by neurotics who had been treated only by general practitioners; neither of these, Eysenck's critics said, could legitimately be compared with the patients in the nineteen studies of treated neurotics. Second, although the nineteen dealt with many dissimilar patient groups and treatment methods, Eysenck had lumped all 7,293 patients together and offered a Single overall figure (64 percent) as the cured-or-improved rate. 6 Eysenck, of course, counterattacked; the other side replied; and for the next twenty-five years the defenders and critics of psychotherapy carried on a polemic and often nasty debate. The latter included not only Eysenck and certain sympathizers but the behavioral therapists, a new breed who, from 1958 on, derided the psychodynamic therapies and said they were ineffective compared with "desensitization" methods, derived from Pavlovian experiments in conditioning and deconditioning. Although the attacks on Eysenck's 1952 study pointed out its defects, Eysenck had put the burden on therapists to prove that the talking cure and its variants did work. In the fifteen years following the pub-
22
Settling Doubts About Psychotherapy
lication of his paper, hundreds of "outcome studies" were conducted, some of them rigorously controlled. (In a controlled study, researchers compare treated patients with "controls"-untreated persons who suffer from the same kind of ailment and are drawn from the same population, such as the same hospital or clinic.) By the early 1970s, a few researchers began collecting some of this mass of material and making statements about the overall weight of evidence. One review summarized the findings of twenty-three controlled studies and concluded that, in general, therapy was effective. Another and more comprehensive review reported that for both individual and group therapy, about 80 percent of controlled studies showed mainly positive results. 7 Other reviews came to similar conclusions. That might seem like enough to settle the matter, but these analyses also failed to resolve the conflict because they relied on a common faulty premise, asking, What do the majority of studies say? If most found that psychotherapy worked, it worked; if not, it failed. This method, known to statisticians as "vote-counting," may seem eminently reasonable but statistically is about as water-tight as a sieve. Vote~Counting-A
Plausible but Unreliable Way to Sum Up Research
A 1971 article in the Harvard Educational Review on methods of "accumulating evidence" said that of four general approaches to combining studies, vote-counting was "the best and most systematic."8 Better yet, it was qUick and simple: In effect, the researcher had only to divide a heap of studies of some treatment or experimental condition into two piles, those showing that it worked and those shOWing that it did not, the bigger pile being the winner. The method is intuitively so appealing that it is still used: a recent review of forty-three studies of yo-yo dieting in the Journal of the American Medical Association concluded, "The majority of studies do not support an adverse effect of weight cycling on metabolism ... [orl find a higher prevalence of unfavorable fat distribution among Weight cyclers."9 But by the 1970s, researchers who wanted to combine the evidence of diverse studies, especially of psychotherapeutic outcomes, were becoming more knowledgeable about statistics and recognized that this simple method would not do. "It doesn't seem to be all wrong," says Frederick Mosteller, "and it seems to be unbiased, but it's inefficient; it doesn't use the data well."
How Science Takes Stock
23
For one thing, in vote-counting every study counts as much as every other, even though one might be based on twenty cases, another on two thousand. Common sense as well as elementary statistical theory tells us that we cannot have as much confidence in the findings of a small sample as those of a large one, since the likelihood of sampling error is higher for small samples. A simple thought experiment will make this clear. Imagine that you have a large sack of marbles, some red and some white but in an unknown proportion. You reach in without looking and bring out a small handful of six marbles; four are red, two are white. You reach in again, this time with both hands, and scoop up a batch of thirty marbles; sixteen are red, fourteen are white. Which ratio is more likely to be correct or nearly so? It takes no mathematical expertise to sense that 16: 14 is closer to the mark than 4:2. A second major flaw is that vote-counting does not take into account the varying strengths of results across different studies. One might show that twenty-six patients benefited more from a new kind of therapy than they would have been expected to from traditional therapy, while twenty-four did not; that would put it in the positive pile. Another might show that twenty benefited more, thirty did not; that would put it in the negative pile. The problem, obviously, is that the second study reveals a more strongly negative effect than the first study does a positive one, but the vote-count overlooks this fact. This last flaw had long been correctable, however. Since early in the century, statisticians had known how to evaluate the findings in terms of "statistical significance." Very small differences are nonsignificant (that is, sampling error or chance cannot be ruled out as the cause), while larger differences are Significant (that is, it is implausible that all of the result is due to chance). Researchers can rather easily calculate the probability that the results of an experiment occurred only by chance, a probability deSignated as "p". The element of chance is related to the number of trials or tests involved. For example, suppose a magician claims he can mentally control how a flipped coin will land and in a simple trial produces heads each time in a series of three flips. Is it impossible that the result was due to chance? Not at all; as everyone knows, there is a one-in-two chance that any £lip will land heads, one-in-four chance that two successive flips will both be heads, and a one-in-eight chance that three successive flips will all be heads. Flipping heads three times consecutively is not Significant because it will happen by chance once every eight attempts (p < .13); flipping five heads consecutively is, by contrast, a much rarer occurrence because it
24
Settling Doubts About Psychotherapy
happens by chance only one in thirty-two times (p < .032, that is, there is only about a 3 percent likelihood that any such series of heads happened by chance). Among scientific researchers, the usual boundary between significance and nonsignificance is p < .05, meaning that there is less than a 5 percent possibility that the result is due to chance. * This limit is based on tradition rather than some statistical arcanum or tradition; in 1926 it simply seemed a handy number to the leading statistician Ronald Fisher, who wrote, "It is convenient to draw the line [of significance] at about the level at which we can say: 'Either there is something in the treatment, or a coincidence has occurred such as does not occur more than once in twenty trials."'l0 Researchers are content when their findings have a p < .05, better pleased when they fit with a p < .01, and delighted by a p < .001, signifying that their results would occur by chance only once in a thousand times. By means of significance testing, researchers can refine vote-counting: Rather than saying merely how many studies have positive or negative findings, they can say how many revealed statistically significant positive findings, how many were not statistically Significant (and therefore do not count either way), and how many revealed Significant negative findings. Significance testing represented a great improvement: if p < .05, the researcher knows that there is less than 1 chance in 20 that the observed difference between the experimental and control groups occurred by chance. Another way to think about it is that the true underlying means of the two groups do differ. But estimating the exact value of the difference between two means, or for that matter any population value based on a sample, is always just that, an estimate. Any such "point estimate" may, due to sampling error, be somewhat larger or smaller than the actual difference; in fact, the point estimate is the meta-analyst'S best guess of the value at which the chances are even that the real value is either higher or lower. How much higher or lower? The range of possibilities is known as the "confidence interval." The smaller the sample, the wider the confidence interval. Within a larger sample, the confidence interval
* Statisticians will find this explanation overSimplified. A more precise one (which readers can skip without harm): The test of significance in research starts with a "null hypotheSiS" that the results are exactly equal across all conditions-a hypotheSis the researcher hopes to prove false. P is the probability that this null hypotheSiS is true; if p is oS .05, the researcher can reject it and attribute the results to other causes. See discussion by Becker in Cooper and Hedges (1994) p. 217.
How Science Takes Stock
25
is narrower and researchers can be more confident that the true value is closer to their finding. Thus, even when a vote-count is based on statistical significance, if some of the studies are small it yields only a fuzzy, inconclusive verdict. Nor is vote-counting statistically "powerful." If a treatment produces only a small positive effect, vote-counting will fail to spot it if most of the studies have small samples; the effects are lost in the uncertainties of their wide confidence intervals. The cumulated results, therefore, may fail to reveal that most of the studies show a positive effect; the overview sees only the few large positive results and accordingly reaches the false negative conclusion that the treatment generally does not work. 11 Finally, vote-counting does not measure the size of the effect in the studies. If sample sizes are large, it may correctly conclude that taken together the studies reveal a statistically significant positive effect but fail to show how great the average effect is; the benefit of the treatment might be trivial. As one authoritative manual of meta-analysis puts it, "To know that televised instruction beats traditional classroom instruction in twenty-five of thirty studies-if, in fact, it does-is not to know whether television wins by a nose or in a walkaway." 12 All of which became academic in 1976, when Gene Glass, in his presentation of the concept of meta-analYSiS, sketched a far better method of combining the results of studies.
Genesis of Meta~ Analysis: Part I Anger at H.]. Eysenck rather than dispassionate scientific interest drove Gene Glass to devise the basic methodology of meta-analysis. "I demonized Eysenck in my fantasies," Glass told me when I visited him not long ago. "I revelled in the thought of really destroying him."13 He says such things with a chuckle, a twinkle in his blue eyes, and a broad smile that produces a dimple in his round, cherubic face. Glass laughs easily, talks like a whirlwind, uses elegant technical language but tosses in an occasional Bronx cheer or an expletive like "Nuts!" to show disdain. He appears genuinely good-natured although he says he is "extremely competitive," "self-promoting," and "a Type-A personality." He is definitely not a good person to anger. Glass came across Eysenck's work in his early thirties, when he was training to practice psychotherapy. He himself had been in therapy for a decade, though he is utterly unlike the Woody Allen image of the neurotic New York intellectual in perpetual therapy. Born in Lincoln, Nebraska, in 1940, Glass had a Midwestern upbringing and in his teens was
26
Settling Doubts About Psychotherapy
a high-school jock addicted to sports; he claims that he never read a book and graduated from high school at the bottom of his class. In view of his subsequent academic record, this must be hyperbole: He was graduated cum laude from the University of Nebraska, earned a Ph.D. in psychometrics and statistics at the University of Wisconsin in a mere three years, was immediately appointed an assistant professor at the University of Illinois, and within five years, at age thirty, was a full professor of educational psychology, specializing in statistical methods, at the University of Colorado. Clearly, a driven man, but in pain. "At twenty-four," he says, "as I was getting my doctorate, the torments and conflicts I was in spurred me into psychotherapy for the first time." It helped him so much that he stayed in treatment year after year; later, in his early thirties when his first marriage broke up, he returned to it and, he says, "it was especially important to me and got me through tough times." Like many people who have benefited from psychotherapy, he began to think of practicing it himself. A psychologist on the faculty gave him some training and directed his reading. Among the books he studied was The Evaluation of Psychotherapeutic Outcomes edited by Allen Bergin and Sol Garfield, and although much of the material took a positive view of psychotherapy, the book included a chapter by Eysenck that considerably expanded upon his 1952 therapy-bashing paper. Glass vividly remembers his reactions of more than twenty years ago: I was trained as a methodologist and statistician, and I was just outraged. It had no serious substance or content to it, nothing but rhetorical games and Jesuitical spinning ou t of specious arguments. He did unbelievable things with the data. My training was in statistics but my real interest was in psychotherapy and psychoanalysis, and I had to respond. Just about then, in 1974, I became presidentelect of the American Educational Research Association (AERA) and I knew that in two years I would have to give a presidential address. I put these two things together and came up with the meta-analysis stuff out of my interest in refuting Eysenck, and decided it would be the topic of my address.
The basic method, though not how it would be implemented, was obvious to him from the start: It would be "research of research"-a translation into higher-level terms "of exactly what you'd do in a single research project where humans are the subject." Specifically, (1) formulate the hypothesis (already done in this case), (2) define the "population" being studied (research studies of psychotherapy outcomes), (3) sample it, (4) evaluate the items in the sample and discard what was un-
How Science Takes Stock
27
trustworthy, (5) statistically combine the findings of the qualifying items, and (6) report the method and results. Glass had recently remarried, and his wife, Mary Lee Smith, was as professionally motivated as he to see Eysenck debunked. A tall, handsome blonde, as reserved and cool as Glass is exuberant and intense, she had been trained in counseling psychology and done student counseling-essentially, psychotherapy-at the University of Colorado in Denver for four years and later at the Boulder campus. "The analyses by Eysenck and others of that viewpoint were very distasteful and unfair," she says. "I didn't know how Gene's project would come out, but if there were positive evidence about the benefits of psychotherapy, that would settle a lot of disputes and make a big contribution. Even if the evidence were negative and Eysenck seemed to be right, it wouldn't change my personal underlying belief in the effectiveness of psychotherapy-I'd been through it, I practiced it, and I knew it works. Still, I'd keep a very detached and objective stance toward the outcomes evidence and we'd report whatever we found." Originally meaning only to assist Glass, she ended up sharing the work, later co-authored a longer treatment of the subject with him, and still later co-authored the first manual of metaanalysis. The first order of business for Glass and Smith was to locate and collect the studies to be combined. Glass's guiding principle was to cast a wide net, sampling all sorts of studies on the outcomes of psychotherapy and counseling, since any advance decision about what kind to collect would bias the findings. He would winnow out the unsuitable ones later. Glass and Smith spent countless hours in the library, painstakingly poring through many annual volumes of two ponderous, small-print indexes, Psychological Abstracts and Dissertation Abstracts, in search of promising titles. Each time they found one, they copied the citation by hand (in 1974 the indexes had not yet been converted to searchable computer data bases). Next, they prowled the stacks to find the bound volumes containing the articles and scrutinized their bibliographies for potentially useful titles not listed in the two indexes; this is often called the "ancestry method" of adding to a collection of studies. The whole process sounds dreadfully tiresome, but as every scholar and researcher knows, the tedium is offset by a burst of exhilaration at every find; it is somewhat like panning for gold and every now and then coming up with a nugget. Glass actually found the library search exciting, he says, because with his intensely competitive nature and his goal of bashing Eysenck, "It had the thrill of battle and competition."
28
Settling Doubts About Psychotherapy
Eventually, he and Mary Lee Smith amassed a thousand titles-a sample of a wholly different magnitude from any review of psychotherapy up to that point. As they constructed the sample, they also undertook the monumental chore of obtaining physical copies of the articles. They photocopied scores of articles in the library, a dreary, fatiguing job, wrote hundreds of letters to authors for reprints and to library services for microfilm copies of unpublished dissertations, and hunted for and bought books. The documents, books, and 150 boxes of microfilms and microfiches all wound up in crammed file cabinets and in heaps in a large office in the home they had just built in Boulder. The third phase, evaluating the materials and extracting the data, went on concurrently with the continuing search. Even though every title Glass and Smith jotted down sounded promising, when they actually got hold of the documents and read them it was disheartening to find that about half of them were useless for meta-analytic purposes because of the lack of any control or comparison group. The same was true of dissertations. Smith: "We'd have a box of ten microfilms come in-for which we'd paid good money-and, always hoping to find something wonderful, I'd put each one in the microfilm reader only to find that three or four, contrary to what the abstract said, weren't comparative studies at all and another one or two didn't have any quantitative data. So all those would be out. I developed a very jaundiced view of the state of research reporting." They then carefully read every study that did have a control or comparison group and weeded out all those that lacked usable data. Many studies reported only a p value, or, even worse, offered a purely qualitative (and hence useless) statement such as, "Although treatment effects did not reach the level of significance, the majority, compared with the control group, showed a tendency toward positive effects." Other studies gave a variety of statistical data-such as ratios of one group to the other-but no actual numbers; these, however, Glass could transform into numbers that could be analyzed. "It was hammer-and-tongs statistical work," he says. I dug up, invented, or reinvented about every technique you can think of to turn some of their esoteric reporting into usable measures of the impact of therapy. But sometimes there was just no way to use their results. Carl Rogers, whose work I admire, published a whole book on his client-centered therapy, but there was not one quantitative report in it except that results were significant at the 5 percent level. It killed me! I couldn't use it, because the differences between the treated group and the control group, though they were significant-they hadn't occurred by chance-could have been niggling and of no importance, and I had no way to tell.
How Science Takes Stock
29
Next, the information in the 375 studies that survived the screening to this point had to be numerically coded so that the computer could understand the information and analyze it. Every piece of information that would playa part in the meta-analysis had to be entered by hand, in the form of a number, on a three-page form; later, employees of a computer service would punch holes in IBM cards to represent those codes-a method that sounds antediluvian today-and an IBM machine would then mechanically read the holes in the cards and transfer the data onto a tape that the computer could read. A number from 0 to 5, for instance, signified the amount of training the therapist had had (0 for a lay counselor, 3 for a Ph.D. candidate in psychology, 5 for a well-known Ph.D. or psychiatrist, and so on); 1 to 5 designated the kind of therapy used, or none if none was used; 1 to 8 specified the outcome measured (anxiety level, self-concept, school or work achievement, blood pressure or other physiological indicators, and so on). Both Glass and Smith did the coding, and since many of the choices in assigning numbers were somewhat subjective, they took time at several points to check that they were coding alike: Each coded the same subsample of studies, and they then compared their codings. This, a common procedure in many kinds of psychological and social research, is known as checking on "intercoder agreement" or "interrater reliability." Glass and Smith found their codings gratifyingly similar-perhaps because they were both therapists, or perhaps because, as husband and wife, they shared so much of their thinking and daily experience. Shared, indeed, to an extreme-they worked much of the time at the same large table in their home office and discussed at breakfast, lunch, and dinner, and almost any other time when they were together, the incoming material and all the manifold problems of working out the first full-dress meta-analysis in history. Such an intense working partnership has wrecked some marriages, but that was not the case here. "Far from causing any friction," Glass says, "it was sort of a high, really." The marriage did dissolve some years later, but for reasons unrelated to their working together. The crucial step in all this was evaluating the outcomes in the different studies in order to have a basis for combining them. Glass, a proficient and advanced statistician since his graduate school days, regarded the method of evaluation used in almost all psychological research-the measure of statistical significance-as worth little, especially when trying to syntheSize studies. "It's ridiculous, this business of combining the p values of studies," he says. "Statistical significance is the least interesting thing about the results. You should describe the results in terms
30
Setrling Doubts About Psychotherapy
of measures of magnitude-not just, does a treatment affect people, but how much does it affect them? That's what we needed to know." Their sample did have a wealth of data on how greatly treatment had affected clients; in fact, Smith and Glass found a total of 833 effect sizes (measures of change) in the 375 studies. Some have been listed above; others included the degree of neuroticism before and after treatment as measured by the MMPI (the Minnesota Multiphasic Personality Inventory, a standard test), palmar sweating, the clients' own appraisals of how they felt before and after therapy, the therapists' appraisals of their clients' condition before and after therapy, and many more. But does it make sense to try to combine such different measures of change? Is it legitimate to combine palmar sweating and success at work? "Mixing different outcomes together is defensible." Glass and Smith wrote later. "All outcome measures are more or less related to 'well-being,' and so at a general level are comparable."14 Yes, but how can one combine such diverse and differently measured results and arrive at a combined measure of change? What possible statistical legerdemain could combine a $20 raise on the job, a ten-point decrease in blood pressure, a client's own statement that he felt much more at ease in social situations, and any of the hundreds of other outcomes cited, into a single, synthesized measure of effect? Glass's solution was to "standardize" the many different measures of effect: transform them into common coin, or the same statistical terms, so that they could be added, averaged, divided, or otherwise manipulated. He did so by converting the measures of effect from their reported form to a common form, namely, how many "standard deviation units" separated the treated group from the control group. Standard deviation (sd) is not a difficult concept. Within any group, say, all first-grade boys in a city school system, there is a distribution of IQs: These scores, plotted on a graph, take the form of the familiar bell curve, the high center being the most common IQ, 100, with the curve sloping down on both sides. Similarly, the heights of the boys would also take the form of a bell curve, though it might have a taller and narrower shape than the boys' IQ curve, since heights at any age do not vary as widely as IQs. The sd of any group is simply a measure of the spread on either side of the mean (central) score in the distribution: It is a measure of how far from the mean one must go to encompass about a third of all the scores. Some curves are tall and narrow, like that for boys' heights, and have a small sd, while others are more spread out, like that of IQs, and have a larger sd. But in any case, sd units, like percentages, are a standardized and combinable "metric."
How Science Takes Stock
31
Glass's breakthrough was to turn his hundreds of incompatible measures of effect into compatible and combinable sd units. To understand his method, consider the following three hypothetical studies testing whether psychotherapy improves "well-being." In each there is a treated group and a control group, but the studies measure the effect of therapy in different ways: Study A presents the psychologist's assessment of the client's functioning; study B uses the client's own rating of life satisfaction; and study C uses the client's job performance as rated by a supervisor. Figure 2-1 depicts the distribution of scores for the treated clients and the controls in each study. The vertical lines mark the centers (means) of each distribution. Can we average the results from the three studies to get an overall sense of the efficacy of therapy? Not yet. Nor can we say that therapy had more effect on patients' self-ratings than on psychologists' ratings, because the psychologists' ratings were made on a one-hundred-point scale and the self-ratings on a seven-point scale. Thus, while a fifteenpoint difference on the psychologists' scale might look bigger than a two-point difference on the self-rating scale, it may actually be smaller. By turning the different units of measurement into a common metric (sd units), we can overcome problems of incomparability. Suppose that in study A, about one-third of the controls had scores ranging from forty to sixty; that difference of twenty points represents roughly one sd. Next, suppose that the psychologists' mean rating for the control group is forty and for the treated group is forty-eight; to express that difference in sd terms, we divide eight by twenty-the treated group is 8/20ths of an sd better than the control group; in other words, therapy resulted in an average difference of .40 sd. In study B, suppose that about a third of the control subjects have scores between three and five, meaning that the sd is two. Suppose, too, that the average self-rating of the controls is 3.00 (on a scale of zero to seven) and of the treated group is 4.70. The groups are 1.70 scale units apart; in terms of sd units-I. 70 divided by 2.00-the difference is .85 sd. In study C, there is no difference between the control group and the treated group; the difference in sd units is zero. Now the results of all three studies have been translated into a common metric. Glass called this metric the "effect size" because it is a way of expressing the effects that treatment had on scores; it is represented in statistics as "d". Using his method, it is now possible to compare the effect sizes of the three studies (in study A, d = .40; in study B, d = .85; and in study C, d = 0). To combine the results of the three studies take the average of the effect sizes across the studies-add the effect sizes,
32
Figure 2-1
Settling Doubts About Psychotherapy
Three Hypothetical Studies of Psychotherapy
Study A
d =.40
Therapy Group
,..-A-,
Untreated Group
.- "
I I
" ""
I I
I
100
I
""
.- .-
I
o
4048 Psychologist Rating
d = .85
Study B
~
Untreated Group
,
,
I
" .- .-
~
I
Therapy Group
I
I I
I
7
I I I
... " "
o
4.7 3 Self-Rating
Study C d = 0 Therapy and Untreated Groups Are Identical
o Supervisor Rating Note: The letter d stands for effect size.
Untreated Group Therapy Group
How Science Takes Stock
33
then divide by three-which is d = .417. In other words, the well-being rating of the average treated client was about .42 sd higher than that of the average untreated control. As logical, and even obvious, as all this sounds now, it was not clear to Glass, in 1976, that his colleagues would see his meta-analytic approach that way. "I thought maybe I'd get laughed at for suggesting such an idea," he recalls. Smith, too, because she would be mentioned as Glass's coworker, recalls having "a lot of fear. At the time, this was the most out-there-on-the-edge kind of analytic activity for anybody to be in. I was afraid it would be badly criticized, or just rejected as insane. I think Gene's real contribution was the courage to do something so completely different from what was then standard accepted practice." Many meta-analysts have said that the concept of effect size is Glass's deserved claim to fame and his central methodological contribution to meta-analysis. Glass, ever the nonconformist, snorted when I quoted a distinguished statistician to this effect. "That kills me!" he said. "It's a nothing idea, an obvious thing. I'd known about it since graduate school." (Indeed, the psychologistjacob Cohen and several statisticians had already written about effect size analysis.) "The overall idea of metaanalysis, the whole process, first to last-that was my contribution." Both Smith and Glass calculated the effect sizes as part of their coding. Doing so was no routine matter: the 375 studies reported their findings in an exasperating variety of ways-sometimes as correlations, sometimes as t-ratios, sometimes as regression coefficients and other arcana-and it was often necessary to turn these into standard differences so Glass could combine effect sizes, the major goal of the project. He minimizes the difficulties of this part of the work-"mostly it was like solving a little algebra problem"-but in actuality, in addition to using standard algebraic procedures, he had to invent at least a dozen different new formulas for solving the different data problems. After nearly two years, and with the AERA convention looming ever closer, Glass and Smith were finally ready to feed their data to the mainframe computer at the university and see what they had found. Actually, they did not expect to be greatly surprised, only enlightened as to the specifics. "1 already had some idea," Glass says, "from working with all the studies and sort of tallying it up in my mind." Such a tally was, of course, only a subjective and pOSSibly very distorted impression; as psychologist George Miller had much earlier shown, the human mind can pay attention to only about seven things at anyone time, so that an effort to see scores of results in the mind is impossible. The results would become clear to Glass only when the mainframe computer processed the tape in response to the complex instructions he had given it about how
34
Settling Doubts About Psychotherapy
to combine the basic data and what cross-tabulations to make. (The latter are computations of the relation between anyone feature of a study, such as some client characteristic, and another feature, such as the outcome of therapy.) The process of inputting the data took place in batches, with Glass and Smith taking material down to the computer center as it was ready. There they would load the reel of tape in the machine, type in a stack of specific instructions, and push the button. Then came the nail-biting wait, usually late at night, as the big reel, locked into place, spun back and forth and a high-speed printer spat out sheet after sheet of results. It took months before they had all the meta-analyzed numbers in hand. Neither of them recalls any moments of epiphany, although Smith says that in a way everything was a surprise. "You can't look at that many points of data and get any sense of the whole. That's why you do the data analysis." Glass has a rather different recollection; he felt sure that the result would firmly establish the effectiveness of psychotherapy. And it did so to an extent they might have hoped for but not really envisioned. In simplest terms, the combined effect of psychotherapy in their 375 studies, comprising about forty thousand treated and untreated subjects, had an effect size of .68-over two-thirds of a standard deviation. This meant, as Glass wrote for his AERA address, that "[ on average], slightly under twenty hours of therapy, by therapists with twoand-a-half years experience ... can be expected to move the typical client from the fiftieth to the seventy-fifth percentile of the untreated population." In lay terms: While the median treated client (at the middle of the curve) was as mentally ill before therapy as the median control individual-healthier than half of them, unhealthier than the other half-after therapy, the treated client was healthier than three-quarters of the untreated group. In the social sciences, so large an effect of any intervention-such as a new educational program, or one to rehabilitate criminals, or one to retrain the unemployed-is almost unheard of. Glass and Smith were deeply fulfilled; the two years of effort, the laborious details, the countless frustrations had been worth it. Among the many other findings of their meta-analysis, one was particularly gratifying. Eysenck had long since become an ardent advocate of behavioral therapy, which he, like other behavioral therapists, claimed was far more effective than other forms of therapy, especially the psychodynamic. The meta-analysis, which included some studies directly comparing behavior therapy with other kinds, showed otherwise. As Glass would say in his AERA address. "The findings are startling. There is only a trivial.07 U x [standard deviation] superiority of the behavioral
How Science Takes Stock
35
over the nonbehavioral therapies .... The available evidence shows essentially no difference in the average impact of each class of therapy."
Mining and Smelting the Data Selecting the data in the collected studies to be meta-analyzed-the fourth major phase Glass outlined in his AERA address-is somewhat like the mining of stubborn raw materials; extracting and combining their data, the fifth major phase, is analogous to smelting or freeing the valuable material from the often recalcitrant ore. The first step in extracting the data from a group of studies is deciding which items of information to code and which to bypass. A study'S conventional introductory material and literature review does not affect its results and can be ignored, but a great many other kinds of information are relevant. Among them are the "independent variable"(the type of intervention or condition being tested), the characteristics of the subjects and those of the researchers (even their sex, since male and female researchers sometimes get different results)-and the "dependent variable" (the outcome or result). The task of coding consists of reading each study word by word, searching for each item of information called for by the code sheets (or code book, since some meta-analyses call for fifty or more pages of coded data), and writing on the coding form the number that represents that item. To code a simple study may take two or three hours of intense and unflagging attention, a complicated one as long as two days. John Hunter and Frank Schmidt, a team of meta-analytic statisticians, estimate that coding can be 90 to 95 percent of the work of synthesizing data. IS Researchers and their assistants are apt to sigh or groan when asked about coding. Judith Hall, a social psychologist at Northeastern Universitywho has coded a number of her own meta-analyses, says, "It's the most laborious task. Every time you get halfway through you say, 'Why am I doing this?' It really wears you out." Bu t that is true of almost all scientific research. In a movie, it may be presented as a white-knuckle race to discover some remedy before a lethal plague does away with humankind; in a book it may appear as a sweaty, sleepless effort to be first with a great breakthrough, as in James Watson's autobiographical The Double Helix. But the reality is closer to what the scholar Richard Altick has written about literary research: "The researcher pays for every exultant discovery with a hundred hours of monotonous, eye-searing labor." 16 MINING
36
Settling Doubts About Psychotherapy
What is most fatiguing about coding is not its tedium but the frequent need to make Solomonic decisions. An example: A study tested the effectiveness of a particular form of psychotherapy used with one group against two other groups, one receiving a placebo and the second receiving no treatment. Several participants in the psychotherapy group missed a number of sessions, a few missed as many as four in ten. The coder's problem: Should such participants be included among those having received treatment or would doing so unfairly dilute the effect size of the treatment? If they should not, what should be the cutoff for participation in the study? Another example: Three forms of therapy to reduce fear of snakes were tested and their results compared, but the research on one form was conducted by a psychologist who had previously published articles on one of the other therapies being tested. The coder's problem: Is that researcher's conflict of interest likely to have distorted his results, and if so, by how much should his or her data be adjusted?17 Next comes what is often a very difficult but potentially the most exciting step of coding, the part that sometimes yields what Altick called an "exultant discovery." This is the calculation of effect sizes, particularlywhen the results were not given in terms of standard differences but as medians, correlations, F-ratios, and so on. Many of these data can be transformed into effect sizes by formulas that statisticians have devised in the years since Glass's initial presentation. But for some kinds of data no ready-made procedures exist: they are puzzles the coder must solve by finding some crafty way of converting them into effect sizes. "What's fun about data analysis," Robert Rosenthal, a long-time meta-analyst and the Edgar Pierce Professor of Psychology at Harvard, told me, "what I really enjoy about it, is playing this detective game. There are moments of sheer delight when I invent a one-time procedure-I may never see this particular constellation of givens again-and come up with a sensible solution to this particular problem. That's exciting!" Less exciting but equally problematic is deciding what to do when a study yields more than one effect size-when, for instance, several different treatments or different groups of subjects have been used and the study consequently has a number of outcomes, each of which can yield a valid effect size. How valid is it to combine results when one study is the source of half a dozen effect sizes and another is the source of only one?18 Some meta-analysts use every effect size when combining findings even though several come from a single study; others, when a study has multiple effect sizes, aggregate them and use the average so that every study in the metaanalysis has an equal input. Each approach has its hazards and costs, the details of which are beyond the scope of this book.
How Science Takes Stock
37
With so many decisions to be made in extracting the data, it is of vital importance that the coders reach similar judgments about the raw data; the more they agree, the less error gets built into the data. Glass and Smith, as already mentioned, compared their figures to make sure their coding was compatible; without some test of intercoder agreement, readers cannot be sure how reliable the conclusions of a meta-analysis are. The most widely used and simplest measure of compatibility is the agreement rate (AR), or percent of agreement, the formula for which is as follows: AR=
number of observations agreed upon . total number of observations
Generally, an agreement rate of 80 percent is considered acceptable but 90 percent or higher is preferable. 19 Although this method of measuring agreement is simple, it has statistical shortcomings. If, for instance, only a few studies of a smokingcessation program report weight gain, coders might differ in their interpretation of the data in those studies but still have a high agreement rate because they reliably agree about all those other studies in which weight gain is not reported. Some meta-analysts have therefore recently devised more trustworthy, but mathematically more complicated, ways of measuring intercoder agreement. 20 SMELTING Gene Glass, scorning statistical significance as a measure of what an intervention achieves, based his meta-analytic findings in his AERA address on the 830 effect sizes that he and Mary Lee Smith extracted from the 375 studies in their sample. It is hardly possible to overstate the importance to meta-analysis of the concept of effect size; one recent authoritative overview of the method called it "the key variable in meta-analysis, the hub around which the whole enterprise revolves. "21 The critical step in the analytic phase of meta-analysis is combining the data-smelting all the effect sizes into one summary effect size. As already described, Glass achieved this by transforming all the varied measurements of effect into standard deviation units, a commensurable metric. In the two decades since Glass introduced meta-analysis, a good deal of careful examination of his method-simple adding and averaging of effect sizes-has revealed certain weaknesses in it and led to the development of more advanced and reliable techniques. The details are beyond the ken of the nonstatistician, but at least two of the major improvements are easy enough to understand.
38
Settling Douhts About Psychotherapy
The first: Because the chance of sampling error in a small study is greater than that in a large one, adding and averaging effect sizes with every study on the same footing can distort the overall result. Formulas have therefore been developed to "weight" the effect size for each study according to the size of its sample; this gives small samples less input and so corrects for their greater sampling errors.22 The second: Because the information from which effect sizes are calculated may incorporate errors due to unreliable measurements and other methodological factors, a number of statisticians, Glass among them, have worked out ways to adjust the raw effect sizes that are less vulnerable to such distortions. As we have seen, Glass was averse to using statistical significance to appraise the effects of any intervention and so included no combined significance data in his 1976 presentation. Nonetheless, combined significance testing has remained a major method of aggregating data ever since it was first used by Tippett in 1931, and by one authority's reckoning there are now nine methods for doing SO.23 It remains an important tool in the meta-analytic workshop because some studies do not report enough information for an effect size to be calculated; research reports often give group means but no standard deviations. Other times p values are given without accompanying test statistics. In such cases, a significance approach can at least combine the p values of the findings. 24 Much as researchers performing original studies use p values to test the null hypothesis (that any observed effect happened by chance), so meta-analysts use combined p values to determine whether the combined result of a group of studies may be only a chance result. Take a hypothetical example: We want to know whether a particular kind of psychotherapy is effective and find sixteen studies that compare a therapy group to an untreated group. Ten report that the therapy did not significantly affect the patients' well-being but unfortunately give no information on the exact p level associated with this result; they also fail to give the means and standard deviations from which one could calculate effect sizes. The other six studies report a significant positive effect of the therapy, four finding it Significant at the p < .05 level, one at p < .01, and one at p < .005. Given these results, can we conclude that the therapy has an effect? First, the data needed to calculate effect sizes is missing from too many of the sixteen studies to permit a confident overall estimate of effect size. Failing that, we can conduct a simple vote-count, concluding
How Science Takes Stock
39
that, since studies finding no significant result outnumber those that do, the therapy has no effect. But for reasons discussed earlier, we prefer to ask a more precise and trustworthy question of our data, namely, "What is the likelihood that we would find sixteen studies with these p levels if the therapy is ineffective?" To answer this question, we have to assume that the ten studies with nonsignificant findings have exactly equal results for both the therapy and nontherapy groups (that is, we assume these ten have a p level of .50). Next, we combine the p levels by the "adding-Zs method," a statistical procedure that converts the p levels to another statistic (Z score), which also expresses how unlikely it is that no difference exists but further permits the results to be combined. When we carry out this procedure, we get a combined level of p < .02, from which we confidently conclude the results were not due to chance alone; the patients in the therapy groups scored reliably higher than patients in the control groups. Thus, even when effect sizes cannot be calculated, combining p levels can lead to a finding in which researchers can place considerable confidence. An extreme example: A 1995 meta-analysis synthesized eightyeight studies on the long-debated issue of whether attitudes predict future behavior and reported the combined probability of the null hypothesis as p < .000000000000l, a figure that would convince even Doubting Thomas that attitudes do predict behavior. 25 On the other hand, even the most extreme combined p value tells nothing about the magnitude of the effect. According to a monograph by the American Statistical Association, combined p values make no use of the "information content" of the separate data sets and, considered by themselves, can be remarkably misleading. If the effect of a new treatment is statistically highly significant but trifling in actual size, policymakers and physicians may be misled into using it although a more precise analysis would have gUided them away from it. 26 The two meta-analytic methods are thus complementary and serve best to reinforce each other. A combined significance test of p < .05 supports the hypothesiS that the combined effect size is not a chance result, while the larger the effect size, at any given level of confidence, the more important the findingY Combining significance levels and combining effect sizes are therefore the two principal ways of synthesizing the results of a group of studies, although several other methods, more complicated and less often used, are also presented in books of meta-analytic technique.
40
Settling Doubts About Psychotherapy
Genesis of Meta,Analysis: Part II In his 1976 AERA address, Gene Glass presented the meta-analysis of psychotherapy outcomes briefly-in about 1,100 words-and only by way of illustration. But he and Smith were already busily reanalyzing their 375 studies with an eye to publishing the first full-blown metaanalysis. Based on what they had learned and what they had not, they developed a considerably longer and more detailed coding form. It called for nearly one hundred items of information, some of which required the coder to choose among ten or twenty options. Coding the effect size was the most difficult and time-consuming part of the work; it necessitated anyone of a dozen different computations for each of the 833 reported treatment effects. Despite Glass's poor opinion of Significance ratings, the new coding form also required the calculation of the statistical significance of each effect size. Four graduate students helped with the coding but Smith did much of it. I asked her how she managed to stay in good humor during that arduous process. She sighed, "Let's just say I didn't. It was incredibly tedious and I would never do it again." (She did take part later in a few small meta-analyses, then stopped doing them altogether.) Glass, along with his solution to the problem of converting effect sizes to a commensurable form, also resolved, at least in his own view, another issue of equal magnitude, namely, whether it was legitimate to combine measures as dissimilar as self-esteem ratings, anxiety, sobriety, disruptive behavior, job achievement, dating, galvanic skin response, and others. With his experience both as a therapy client and as a therapist, Glass was certain that the answer was yes; as he and Smith wrote in their report of the second version of the meta-analysis: Mixing different outcomes together is defensible. First, it is clear that all outcome measures are more or less related to "well-being" and so at a general level are comparable .... [Moreover,] each primary researcher made value judgments concerning the definition and direction of positive therapeutic effects for the particular clients he or she studied. It is reasonable to adopt these value judgments and aggregate them in the present study. 28 This justification notwithstanding, the combining of different effects would become a major bone of contention between the advocates and the critics of meta-analysis. It is often referred to as the "apples and oranges" problem, which we look at more closely in the next chapter.
How Science Takes Stock
41
Despite the finer-grained analysis Glass and Smith performed in the expanded version of their study, their major finding was consistent with that of the first: "On average, the typical therapy client is better off than 75% of untreated clients." And although their second version sorted out and evaluated the effectiveness of ten forms of therapy-the first version did so for only four-their conclusion was, again, that "few important differences could be established among many quite different types of psychotherapy." They stressed that they found virtually no difference in effectiveness between the behavioral therapies and the nonbehavioral ones, a conclusion certain to infuriate all partisans of behavior therapy, Eysenck in particular. Their report of the meta-analysis was immediately accepted by American Psychologist, a principal journal of the American Psychological Association, and appeared in its September 1977 issue. In this medium, the methodology of meta-analysis and the findings about the outcomes of psychotherapy reached a far wider audience than had Glass's 1976 address; it had a major impact on the worlds of psychotherapy and the social sciences, and to a lesser extent other sciences. Clinical psychologists and other psychotherapists were deeply gratified. "I was thrilled by it," Robert Rosenthal told me, "I was doing therapy at the time and I felt sure I was benefiting people, but"-and he chuckled disarmingly-"I wasn't at all sure that any other therapists were. It was wonderful to learn that they were." A number of people in other disciplines were impressed by the methodology of meta-analysis and saw its applicability to their own fields. An exemplar is the case of Larry Hedges, whose life course was determined by the 1976 and 1977 papers. As a graduate student in statistics at Stanford University, he read Glass's preSidential address and, the following year, Smith and Glass's meta-analysis and saw his own future. He explains: I was struck by the argument that so much of the equivocation about research findings in the social sciences is due to a failure to apply quantitative methods to summarize the research. When you do so, as Glass did, the picture often gets strikingly clearer. I was convinced that metaanalysis was ultimately going to prove absolutely fundamental to scientific work-and that there was a terrific opportunity for me to do something in this new and unplowed field.
Hedges soon became a leading developer of the statistical methods and theory of meta-analysis, and as a full professor in the department of
42
Settling Duubrs About Psychutherapy
education at the University of Chicago has conducted his own metaanalyses of educational research. Although most reactions to the Smith and Glass article were highly favorable, some were harshly critical. This was to be expected; academics often disagree for good and honest reasons but also because one way to get one's name known in academia is to vigorously attack a new and highly regarded piece of research. One or perhaps both of those motives inspired a vitriolic attack by Philip Gallo, a professor at San Diego State University, in the May 1978 issue of American Psychologist; he reworked certain Smith and Glass data and proved to his own satisfaction that psychotherapy has a "quite weak effect." He thereupon summarily dismissed the whole Smith and Glass meta-analysis: Their findings, he said, were based on aggregating a great many different measures of effect (the apples and oranges argument) and "any attempt to extricate meaningful information from such a hodgepodge is impossible."29 Other critics carped about other aspects of the meta-analysis, often in caustic terms, as has become the norm for academic disagreements in recent years, but Eysenck outdid them all. Understandably nettled that Smith and Glass had called his papers on psychotherapy "tendentious diatribes" and said that the "Eysenck myth" was thoroughly disproven by later studies, he waded in like a street fighter. In the May 1978 issue of American Psychologist, as already noted, he called their meta-analysis "an exercise in mega-silliness" and, characterizing their inclusion of both high-quality and lesser-quality studies as "garbage in, garbage out," said that if their analytic methods were to be taken seriously "it would mark the beginning of a passage into the dark age of scientific psychology." He was optimistic, however, that that would not happen: The notion that one can distill scientific knowledge from a compilation of studies mostly of poor deSign, relying on subjective, unvalidated, and certainly unreliable clinical judgments, and dissimilar with respect to nearly all the vital parameters, dies hard. This article [Smith and Glass, 1977], it is to be hoped, is the final death rattle of such hopes. As to their findings about the effectiveness of psychotherapy*: "It would be highly dangerous to take seriously the 'results' reported by Smith and Glass .... I must regretfully restate my conclusion of 1952, namely that there is still no acceptable evidence for the efficacy of psychotherapy."3o * Eysenck and some other British psychologists use the term "psychotherapy" to mean nonbehavioral techniques but not behavioral treatment; in America, behavioral treatment is considered a form of psychotherapy.
How Science Takes Stock
43
For more than a decade after the appearance of the 1977 Smith and Glass paper, other researchers reanalyzed their data or subsets of it, either working from the published paper or obtaining copies of the data tapes from Smith and Glass. To say whether more of these studies backed Smith and Glass or rebutted their conclusions would be to fall into the trap of vote-counting. But perhaps two other observations are pertinent. First, those who selected special portions of the Smith and Glass data and used statistical techniques different from theirs were likely to arrive at conclusions differing from, or amending, Smith and Glass's; those who used their entire body of data or their methods almost always confirmed their findings.3l Second, a vast 1993 study meta-analyzed 302 meta-analyses-yes, you read that correctly-of a total of nearly five thousand primary studies and came to "a strongly favorable conclusion about the efficacy of well-developed psychological treatment."32 An authoritative comment on the matter appears in the scholarly volume Meta-Analysis for Explanation: A Casebook. In the introductory chapter, an advisory committee of experts in meta-analysis writes: The achievements of meta-analysis have been considerable for a method with such a short history. Some practical questions that formerly fomented wide disagreement now seem to have been resolved by the method. Gone, for instance, are the days when a conference on individual psychotherapy would devote many hours to discussing whether it was effective in general. Since the work of Smith and Glass (1977), and its follow-up by Landman and Dawes (1982), among others, the debate is stilled. 33
That is not to say that no die-hard dissenters remain; Eysenck and a handful of others continue to denigrate meta-analysis and to assert that psychotherapy is ineffective, and a new book by William Epstein, a professor of social work at the University of Nevada, The Illusion of Psychotherapy, says flatly that there is no credible clinical evidence of its effectiveness. As Jonathan Swift said long ago, "There's none so blind as they that won't see."
A curious postscript: Although Glass and Smith co-authored a book in 1980 meta-analytically demonstrating the benefits of psychotherapy, and another in 1981 on methods of meta-analysis in social research, the two of them lost interest in the subject thereafter. Glass, who has switched from one major interest to another several times in his life, simply had enough of meta-analysis; these days his primary interest is edit-
44
Settling Doubts About Psychotherapy
ing a journal of education policy analysis on the Internet. Smith, divorced from Glass for some years, says that her interest in meta-analysis was fulfilled by the psychotherapy study; she is happy to have played a part in the development of a new methodology but has no continuing interest in it, and in recent years has been studying the effects of testing on various aspects of schooling and teaching. In their different ways, both Glass and Smith seem a trifle amused and perhaps rueful at how distant they now are from the thing they gave birth to and which has grown and flourished without their further care.
IN BRIEF ... To illustrate the subsequent wide-ranging use of meta-analysis to resolve other issues about psychotherapy, here are short accounts of two other, quite dissimilar, case histories.
Does Marital and Family Therapy Work? "For years, researchers have debated whether marital and family therapies are effective," began a 1993 paper in the Joumal of Counseling and Clinical Psychology. This was a low-key restatement of an issue that had long troubled the study'S primary investigator, William Shadish. In 1985, when he was a young research associate at the University of Memphis (where he is now a professor of psychology), he got into a heated argument with a colleague, a marriage and family therapist, who claimed that research showed that his form of therapy worked better than any other and with any kind of problem. Shadish's own experiences as a patient in individual psychotherapy during his twenties and a psychotherapist for a few years afterward made him feel certain his colleague was wrongbut the colleague felt just as certain that he was right. The research literature, which Shadish browsed through for an answer, proved a vast jumble out of which one could pick whatever results one preferred. Exasperated, he decided to do a meta-analysis to wring some sense out of the untidy mess although he had no training in the methodology. "I had become aware of meta-analysis when it was invented," he says. "I thought the 1977 article by Gene Glass and Mary Lee Smith was a stupendous achievement. But I didn't try to do any metaanalysis myself until that discussion with my colleague." To train himself, Shad ish studied two books on the subject by Glass, Smith, and coauthors, and once, when the 1989 San Francisco earthquake kept his
How Science Takes Stock
45
plane grounded, he spent two whole days in a hotel room beating his way through Hedges and Olkin's dense and difficult Statistical Methods for Meta-Analysis, a feat comparable to the mortification of the flesh practiced by early Christian anchorites. Shadish, already skilled at writing grant proposals, won a substantial grant from the National Institute of Mental Health, and later another from the Russell Sage Foundation. He needed them: The project, even with the help of more than half a dozen graduate students, lasted nearly ten years. In collecting references, he and his graduate students had an advantage over Glass and Smith: By the mid-1980s Psychological Abstracts, Dissertation Abstracts, and three other major indexes were available online and could be swiftly and efficiently searched by computer. The team also read, the old-fashioned way, the bibliographies of relevant articles and the tables of contents of journals, hunting for items not contained in the on-line indexes; they also wrote hundreds of letters to specialists asking for yet other suggestions. A year and a half of such efforts netted Shadish a mighty haul of roughly two thousand references. On examination, however, it turned out that fewer than 10 percent met his requirements. He wanted only studies that specifically employed a form of marital or family therapy, randomly aSSigned subjects to treatment or control conditions, and dealt with subjects who were actually distressed rather than merely seeking marital enrichment. "We tossed out more than half the references immediately either on the basis of the title or the abstract," Shadish recalled, "things with titles like 'A case study of' and so on. Then we read the remaining ones and again eliminated most of them. I still have two boxes of dissertations"-he waved toward a far corner of his large, cluttered office"that we couldn't use and that cost me forty dollars each." In the end, from the two thousand references he had distilled a set of 163 studies that met all his conditions. Shadish and a student assistant then developed a code book. The art of meta-analysis had advanced so much in a decade that the book ran to twenty-three pages and well over two hundred items. The coding, done primarily by students, was considerably more complex and time-consuming than it had been for Smith and Glass. To answer some questions, the coders had to search for clues and judge what to make of them. One question, for instance, asked whether the researcher was blind to the condition (that is, knew whether the subject was being treated or was a control, since knowing might have influenced his or her observations); many studies did not explicitly say, and only a painstaking reading might find hints.
46
Settling Douhts About Psychotherapy
The most difficult part of coding was the calculation of effect sizes. Shadish taught his coders to use several methods, developed after the 1977 Smith and Glass meta-analysis, that, among other refinements, corrected for small-sample bias. This, too, put a burden on the coders. "If the study gave the actual number of subjects, calculating effect size was easy," says Ivey Bright, a young psychotherapist who, as one of the coders, spent twenty hours a week at it for a year. "But sometimes we'd read that maybe ten subjects dropped out before the end, and the study didn't say when they did, and we'd have to figure out from one clue or another how many subjects there were at any point in the study. It was really hard." When it came to combining the data, Shadish again benefited from a decade of statistical innovation. "1 followed the Hedges and Olkin procedures because they weighted for sample size," he says. "1 used outlier techniques to check the central tendencies in the effects." (Outliers are atypical cases, far from the median, that distort the averages; sometimes a truer picture emerges when outliers are dropped.) He continued: I looked at moderator variables to see if they, rather than the kind of treatment, accounted for differences in effect size, and found that a number of them did. For instance, effect sizes were higher if they were based on behavioral measures rather than nonbehavioral measures, and higher if based on ratings by others rather than self-reports. And if variables in a study were correlated, I used regression techniques to sort out and measure the effects of each one separately.
When Shadish put all this through the computer-not once but several times, refining and reworking the project over the years-he got a number of clear-cut and gratifying results: 34 • The average effect size was roughly half a standard deviation, which meant that the typical client of marital and family therapy was betteroff than 70 percent of control clients. But, according to a subsample of studies that compared marital and family therapies with individual therapies, the former work no better than the latter, as Shadish had thought from the start. • Marital therapies had higher effect sizes than family therapies, though not to a significant degree; this minor difference was probably due to the fact that the problems requiring family therapy are more difficult than those requiring marital therapy. • Some types of marital and family therapy appeared to work better than others, but when the differences in the quality of the methods used were eliminated by regression analYSiS, the differences in effectiveness disappeared. Humanistic therapies-Rogerian "client-centered" ther-
How :Science Takes Stock
47
apy and others emphasizing such 1960s values as "authenticity," "selfactualization," and "personal growth"-showed no positive effects in any analysis, however, a bit of a disappointment to Shadish, whose original training in psychotherapy had been Rogerian . • The meta-analysis also yielded some valuable methodological findings. First, effect sizes were larger when researchers were not "blind" to treatment; apparently, if they knew what was going on, they unwittingly saw more improvement. Second, results reported in dissertations, most of which had not been published, were nearly 40 percent smaller on average than results in published reports; whatever the reason, it meant that meta-analysts who ignored unpublished studies would likely overestimate effects. Shadish's meta-analysis, listing five graduate students as co-authors, was published in 1993 in the Journal of Consulting and Clinical Psychology and won the 1994 Outstanding Research Publication Award of the American Association for Marriage and Family Therapy. Second~ Level Meta~ Analysis
Although the primary goal of meta-analysis is to combine studies, its secondary goal, one of increasing importance in recent years, is to disentangle the knotted skein of causal influences in a set of combined studies in order to find out why they differed in their results. This is what Shadish was doing when he analyzed the "moderator variables." Moderator variables are any characteristics of the studies that are associated with differences in effect size. Year of publication is one: Recent studies of a particular subject may, for various reasons such as changes in methodology, tend to report larger-or smaller-effects than earlier ones. The measure (criterion) of effect is another: As mentioned above, self-reports of changes brought about by marital and family therapy yield smaller effect sizes than reports by observers. Race, ethnicity, and IQ of school children can importantly influence the effects of programs meant to improve teaching and learning. The sex of the researcher can make a difference: Many studies have shown that women are more easily influenced than men by persuasive arguments, group pressure, and so on, but a meta-analysis has revealed that male researchers regularly find a larger effect of this nature than female researchers. 35 Correlations between moderator variables and effect size sometimes point to associations that are of little or no interest, such as when they indicate that studies appearing in journals reveal different effects from those appearing in books. But in other cases, the analysis of moderator
48
Settling Doubts About Psychotherapy
variables enables the meta-analyst to judge how much of the effect is due to substantive issues-the treatment itself, the setting, the kind of outcome observed, and the like. 36 In still other cases, it yields crucially important information about how and when the treatment works best; indeed, meta-analysis can reveal such interactions far more effectively than a single studyY This is perhaps most easily seen in medical research. "Meta-analysis gives you answers at a first level," says the eminent statistician Ingram Olkin. That level says, for instance, lumpectomy plus drugs is as effective as radical mastectomy. But that's only a general, overall statement. Now we have to go to a second level and find out if it is as effective for heavier women as for lighter women, for younger women as for older women, and so on. We have to go into what the medical people call subgroups and statisticians call covariates-other variables that affect the situation.
Among other covariates, in medical meta-analyses, are factors such as when and how a medication is used, as a recent discussion of the treatment of heart attack spells out: Meta-analysis should not be regarded simply as a "pooling process" of available trials that address a similar question. Specific questions (e.g., Do l3-blockers, when started in the acute phase, reduce mortality after myocardial infarction?) are much more useful for patient care than broader ones (e.g., What are the pooled results of the available trials of l3-blockers in ischemic heart disease?).38
Searching out such information is done by analyzing the relationships between moderator variables and effect sizes. Meta-analysts do this primarily by means of special statistical techniques that test variables to see how they are associated with differences in effect sizes from study to study. The standard methods of looking for such relationships-analysis of variance and multiple regression-are prone to certain kinds of error when used in meta-analysis, but techniques analogous to them have been developed by statisticians for metaanalytic use. 39
Trying to Stay Sick in Order to Get Well Not all successful meta-analyses examine hundreds or even scores of studies. One that was sufficiently meritorious to have been published in theJoumal of Counseling and Clinical Psychology in 1989 was based on only ten studies. In turn, those studies contained a dozen data sets, all
How Science Takes Stock
49
of which lent themselves to partial meta-analysis but only two of which were suitable for full meta-analysis. The principal researcher was Varda Shoham*, then a postdoctoral fellow working with Robert Rosenthal at Harvard and now a professor at the University of Arizona. Shoham, a tall, dark, intense Israeli, had earned her doctorate in clinical psychology at Tel Aviv University in 1982, and by training and preference was a psychodynamic therapiSt. But during her internship in 1977-78 at the Mental Research Institute in Palo Alto, psychologists Paul Watzlawick and Carlos Sluzki introduced her to the bizarre, noninsight, short-term form of therapy known as paradoxical intervention, used by some therapists since the mid-1960s to treat a limited number of disturbing symptoms or conditions and reported by them to be highly effective with some patients. What makes this technique bizarre is that in one way or another, the therapist encourages, praises, or prescribes the very behavior the client is seeking to change. Typically, the therapist will tell an insomniac to stay awake, a procrastinator to put things off, and a depressed person to stay depressed.40 Remarkably, this often has the opposite effect and the client ceases to suffer from the symptom. As improbable as the outcome seems, Shoham experienced it personally the first time she used it. She was treating a couple who chronically fought with each other, and she was supposed to tell them to go ahead and fight. But I felt that it was a form of trickery, and it went against all my training and values. I just couldn't do it. My supervisor, John Weakland, was watching through a one-way mirror, and finally he called me on the telephone and said, "Okay, don't do it"-and suddenly I was able to do it. His call had acted as paradoxical intervention with me! He told me it was all right for me to be unable to use paradoxical interventionand that was just what made me able to use it.
This is only one of many ways in which paradoxical intervention can be applied. The therapist can tell the client to do just what the client wants to stop doing, or merely tell the client not to change; in either case, the therapist can directly prescribe the bothersome behavior ("Do this!") or can reframe the symptom as a good and useful thing. There are different explanations of why the method works. One is that the client's effort to continue having the problem redefines it in his mind as controllable behavior. * At that time Shoham-Salomon.
50
Settling Doubts About Psychotherapy
Another is that the client resists and defies what he takes to be the therapist's attempt to control him. A third is that if the client cannot produce the symptom on demand, the problem is lessened, while if he can, he gains a sense of mastery; as Shoham and two colleagues titled an article they wrote on the subject, "You're Changed if You Do and Changed if You Don't." Shoham, despite her experiences at Palo Alto, remained unconvinced, however, that paradoxical intervention was as effective as other forms of treatment and unsure it worked. In 1983, when she was a postdoctoral student at Harvard, she suggested a meta-analysis of the matter to Rosenthal, her mentor, who enthuSiastically agreed to work with her on it. After the usual efforts to collect and screen studies, Shoham found herself with twelve usable sets of data, from which, with Rosenthal's help in data analysiS, she and he were able to draw a number of worthwhile meta-analytic conclusions: • Overall, paradoxical interventions were as effective as a large variety of other therapeutic procedures. • The phenomenon was robust: The method worked as well in real clinical settings as in university research settings. • A "positive" approach to paradoxical intervention (changing the meaning of the symptom) was more effective than the more commonly used "neutral" approach (telling the client to experiment by deliberately producing the unwanted symptom). • Paradoxical interventions remained more effective a month after treatment than other treatments. • Finally, and most interestingly, only two studies (an earlier one of Shoham's and one by Michael Ascher of Temple University) provided correlations between symptom severity and treatment effectiveness. Although Ascher's data were given in a raw form, Shoham and Rosenthal were able to use the data to compute correlations that could be compared with hers-and, in fact, turned out to be virtually identical. Using these two sets of findings, they performed a mini-meta-analysis; though it consisted of only two studies, it lent important strength to the conclusion that severe cases benefit more from paradoxical intervention than mild cases-a paradox piled upon a paradox. The data on hand did not permit Shoham and Rosenthal to reach any firm conclusions about how and why paradoxical intervention works, but perhaps their most interesting finding was that paradoxical intervention does not directly change the patient's behavior; rather, it produces a change in the patient's situation-his perception of the meaning of the symptom, his feeling of resistance to being manipulated, or his
How Science Takes Stock
51
sense of his own ability to produce the symptom. They hypothesized that one or more of these changes mediates the therapy; that is, these situational changes, resulting from the therapy, become the cause of behavioral change in the client. (Shoham later put this and other hypotheses to the test in further investigations, with confirming results') As Shoham and Rosenthal conclude: This small-scale meta-analysis is not designed to reach firm conclusions on the state of the field. Rather, its purpose is to provoke future research to ask more focused questions .... In addition to further study of resistance as a potential mediator and moderator for the operation of paradoxical interventions, the time is ripe for the study of other process variables that can provide more understanding of the way these frequently applied but little-understood interventions operate. 4 !
Between Treatment and Effect: A Rube Goldberg View Finding the average effects of any form of treatment is the primary goal of meta-analysis, but this reveals nothing about when, where, and how the treatment worksY For that vital information, researchers must turn to the analysis of moderator and mediator variables. Moderator variables, as we have seen, may either accentuate or minimize the influence that the independent variable (treatment or intervention) exerts over the dependent variable (outcome). Mediator variables, though less obvious and less often studied, playa role that is just as significant, if not more so. They are effects of the independent variable which become causes of change in the dependent variable. 43 Shadish and a collaborator, Rebecca Sweeney, offer an example: A therapist treating a troubled couple may, because of his or her therapeutic technique (the independent variable), choose to assess how they communicate (the assessment is a first mediator) and decide to change some aspects of it (a second mediator); the result is an increase in their level of marital satisfaction. 44 Most meta-analyses of psychotherapeutic outcomes hypothesize that there is a direct relationship between the therapeutic orientation. (technique) and the outcome; this is a moderator-variable view of the matter. But Shadish and Sweeney argue that mediational models of the process are far more realistic and plausible. By "mediational models" they mean verbal or diagrammatic schemes showing how the several variables are linked. In a simple nonmediational model (see figure 2-2a), the variables are portrayed as boxes with arrows leading directly to the outcome (effect size); the strength of the connection between each vari-
52
Settling Doubts About Psychotherapy
Figure 2-2a
Model Without Mediating Variables
Behavioral Dependent Variable
Treatment Standardization
.03 Behavioral Orientation
Effect Size Treatment Implementation
-.26 Dissertation Status
Figure 2-2b
*p < .05
Model with Mediating Variables
Behavioral Dependent Variable
(.30') .32' Behavioral 1 - - - - 1
(.14)
*
p < .05
Source: Shadish and Sweeney (1991). Note: Numbers next to arrows are path coefficients, a form of correlations that have been "partialled out" or computed with other variables held constant.
How Science Takes Stock
53
able and the outcome is a correlation derived by regression analysis and other methods. In contrast, a "path model" is a sort of Rube Goldberg scheme showing how some variables influence other variables, then the latter influence still others, and those still others that eventually affect the outcome (see figure 2-2b). Many scientists-social scientists in particular-are hesitant to call such correlational chains causal pathways. Too often they have seen correlations merely turn out to be co-occurring effects of some other cause. Even more confusing, a correlation does not always specify the direction in which the process is taking place; marital conflict, for instance, may be correlated with inability to communicate, but whether faulty communication produces conflict or conflict produces faulty communication is often unclear, and sometimes the process appears to be a reciprocal interaction-in common terms, a vicious circle. Nonetheless, path models are possible explanations of how the treatment is related to the effect; the crucial question is whether the explanation can be construed as a causal one. The question is not only crucial but highly arguable. We will therefore return to the matter of explanations and causality in the next chapter.
Chapter 3
Clarifying Murky Issues in Education Is it True That "Throwing Money at Schools" Does No Good?
E
ric A. Hanushek, professor of economics and political science at the University of Rochester, had written technical studies on sundry public policy issues for fifteen years without making any great stir until, at age thirty-eight, he hit on a topic that made his name in the world of educational policy. In 1981, at the beginning of the Reagan administration, he wrote an article, "Throwing Money at Schools," published in theJoumal of Policy Analysis and Management. In it, he maintained that although citizens' and teachers' groups constantly urge government bodies to increase school budgets, empirical studies show that more spending does not increase student achievement. This startlingly counterintuitive finding was catnip to conservatives, who favored scaling back taxes, budgets, and government control over public processes. Hanushek soon became an expert witness for the defense at hearings and court cases brought by citizen groups against school boards they accused of miserly budgeting. Having hit upon a good thing, Hanushek continued to research the issue and to lecture and write about it. In 1989, using data from 187 studies published in thirty-eight articles and books over the years, he wrote his most influential article yet, "The Impact of Differential Expenditures on School Performance," which appeared in Educational Researcher. Repeatedly stating that the available evidence fails to support conventional wisdom, Hanushek bluntly summed up his findings: Two decades of research into education production functions have produced startlingly consistent results: Variations in school expenditures are not systematically related to variations in student performance .... The concentration on expenditure difference in, for example, school finance court cases or legislative deliberations, appears misguided, given the evidence. l
54
How Science Takes Stock
55
Not surprisingly, Hanushek's notions distressed and angered parent groups, educators, and education policy-makers. One of the latter was Richard Laine, a tall, serious, mature-looking man in his late twenties who, (after serving as an aide to California's Democratic delegation to Congress), started graduate school in 1990 at the University of Chicago. Once in Chicago, Laine became involved with several school reform organizations and began working with them to increase funding for Illinois's public schools through legislative pressure and a lawsuit against the state. From his graduate studies and work with the school reform organizations, Laine soon became aware that Hanushek exerted a major influence on school finance policies. "He had not only published influential articles stating that money does not matter," Laine told me, "but had been the expert witness for the state in many school finance lawsuits. He was the brick wall I kept running into. So I read his work, and my reaction was that at the gut level it just didn't make sense. But I needed a better statistical background to deal with it, so in the spring of 1992 I enrolled for a seminar with Larry Hedges on research synthesis."* The dozen seminar students used as their textbook the page proofs of The Handbook of Research Synthesis, then in press, of which Hedges was a co-editor (with Harris Cooper). Within the first two weeks of the seminar Laine learned from the Handbook that meta-analysts view votecounting, the method used by Hanushek, as simplistic and apt to produce wrong conclusions, and that the statistical techniques of meta-analysis were far more trustworthy and informative. Galvanized, Laine persuaded three friends in the seminar, Rob Greenwald, Bill McKersie, and Rochelle Gutierrez, to collaborate in a meta-analysis of Hanusheks data as their required course paper. Greenwald recalls Laine's compelling plea: Rich said, "Hanushek has been getting away with this for a long time. He's been saying money doesn't matter, and [Secretary of Education] Bill Bennett has been saying, 'Hanushek shows that money doesn't matter and don't throw more money down the drain.' But there's got to be something wrong with this idea; it doesn't make sense. Maybe metaanalysis can determine what the data really show." We were all friends, we'd all worked in groups with policy applications, and we were easily sold on the idea of coming in with Rich.
* "Research synthesis" is synonymous with "meta-analysis" as the latter tenn is used in this book. Hedges and some others, however, use meta-analysis to mean only the statistical, dataanalyzing phase of the process.
56
Clarifying Murky Issues in Education
First, however, they had to get Hedges' approval. "At one seminar session," Hedges told me, "when I asked the students what they planned to write about, Laine said that he, Greenwald, McKersie, and Gutierrez had read Hanushek's work, and Hanushek had used votecounting!-and wasn't that a terrible thing to do, and wouldn't this be a great opportunity to carry out one of the first meta-analyses in education and see whether Hanushek's findings would hold up or not?" Hedges, youthful and fit in appearance but judicious and thoughtful in manner, expressed his concern that it would be too tough a job; he warned them that carrying out a meta-analysis was very difficult and that they had only ten weeks to do it. But Laine laid out the enticing challenge: In the 1989 article, Hanushek had written that of sixty-five studies for which per pupil expenditure figures were computable, thirteen showed a statistically significant positive relationship with pupil achievement, three a significant negative relationship, and forty-nine a nonsignificant relationship. From this vote-counting Hanushek had concluded, "There is no strong or systematic relationship between school expenditures and student performance."2 Hedges, who was not particularly familiar with Hanushek's work, was intrigued. "It was clear," he said, "that what Hanushek had done wasn't sound and that the four students were deeply interested in the substantive issue. So was I, as a professor of education. I asked them to write me a memo of their plans and advised them not to tackle all seven factors that Hanushek analyzed but to limit themselves to the single most important resource variable, 'per pupil expenditures.'" What made it feasible for the team to attempt the project in the time allotted was that they would do only a fragment of a meta-analysis: They did not need to formulate the problem (it was implicit from the start) and, far more important, they did not have to conduct a literature search, since, as they said in their memo to Hedges, In order to do a research synthesis of high quality with the utmost reliability, it is necessary to undertake an exhaustive search of the literature in order to fully define the universe of research studies which can be synthesized .... Fortunately, this project's initial scope will focus on a limited and previously defined universe of studies ... [those] used by Professor Eric Hanushek in his articles .... [Each was]l) published in a book or refereed journal, 2) relates some objective measure of student output to characteristics of the family and schools attended, and 3) provides information about the statistical significance of estimated rela tionships.
How Science Takes Stock
57
To which they added, with a touch of hubris: While it is this group's belief that Hanushek's definition of his universe was adequate, we believe that the methodology he used to synthesize his universe was flawed. It is upon this premise that we are limiting our universe in this research synthesis project to that used by Professor Hanushek. It is our intention to replicate his work with more emphasis on the methodology of synthesizing the studies, rather than on achieving a preconceived conclusion. 3
Of Universes and Their Boundaries The term "universe"-some methodologists prefer "population"-has a special meaning to meta-analysts: It signifies the total body of studies of a subject about which the researchers wish to make generalizations. 4 But is it actually possible to assemble such a universe? Many studies are published in obscure journals or books and difficult to locate except by onerous, time-consuming processes. Others, never published, do not appear in indexes and exist only in file drawers, unknown to outsiders and virtually undiscoverable. Even assuming all the relevant studies could be found, it might be a ruinous waste of time, energy, and money to collect them all for a metaanalysis. In surveys, polls, and many census studies, researchers rely on Scientifically gathered samples, maintaining that a properly gathered sample represents the universe they are studying (within a known, narrow margin of error) and that one can therefore draw from it generalizations that can be validly applied to the universe. Frequently, however, it is not certain that a sample accurately represents the universe because its boundaries are vague; when that is the case, researchers can make generalizations from the sample only at a lower level of confidence. 5 Not only for that reason but also because every sample involves the possibility of sampling error, some researchers prefer to try to collect everything relevant. Yet even if they more or less succeed in that aim, what they assemble is a universe of studies of a phenomenon, which may not be identical with the universe of actual instances of that phenomenon. If, for instance, a researcher collected all studies ever conducted of the outcome of psychotherapy with neurotics, would that universe of clients be identical with the universe of all neurotics? Surely not; the universe of studies deals with neurotics in treatment, and they may differ in some way from neurotics who do not seek or cannot obtain it. Further, the types of therapies, therapists, and clients who take part in
58
Clarifying Murky Issues in Education
research may be different from the entire universe of all therapies, therapists, and clients. One solution to this difficulty is to define the universe of studies narrowly so that it is small, distinct, and more likely to be coterminous with the universe of reality it is said to represent. Hanushek had done so in his study, and Laine and his team accepted Hanushek's defined and bounded universe for their experimental meta-analysis. Meta-analysts who conduct their own document search start by defining the boundaries of the universe they mean to meta-analyze. They may choose to include case histories in which subjects serve as their own controls, or to limit their universe to studies with separate control groups, or more narrowly to controlled studies in which subjects were randomly assigned either to treatment or control, or still more narrowly to studies meeting these conditions that were also published by a leading journal in the field. But although such criteria can bound the universe, almost every universe grows the longer the meta-analysts press their search. Some years ago, William Stock, an educational psychologist and methodologist at Arizona State University, was asked by a gerontologist colleague to join him in a meta-analysis of a subject whose literature, he assured Stock, consisted of about ninety articles. Stock was willing but said they had better do a comprehensive search to make sure. The more widely and deeply they searched, the more studies they found; their search dragged on for years and eventually netted over eight hundred items. Whether to sample or to try to collect the whole universe of studies is the question that bedevils meta-analysts. Some cannot rest until they have found everything; others consider completeness unnecessary and misguided. "Why are we trying to get all these data?" grumbled the Harvard statistician Frederick Mosteller at the 1986 National Research Council workshop on the future of meta-analysis. "Why shouldn't we have some statistical device that tells us when we've got enough? Or tells us we've got the core of it which is important?"6 Whatever meta-analysts decide, some basic rules must be followed. The search must be systematic to avoid gross errors due to biases of one sort or another. "Those planning to undertake meta-analyses," writes a research team in the British MedicalJoumal, "should not underestimate the difficulty or expense of performing a well conducted systematic review. There is no question that the choice of methods used for data collection is the key to the validity of such a review."7 The Nobel Laureate Linus Pauling claimed in his 1986 book How to Live Longer and Feel Better that vitamin C prevents colds and cited
How Science T akcs Stock
59
some thirty studies, nearly all of which reported that it does. But Pauling said nothing about how he had collected the studies; it was unclear whether his sample genuinely represented the universe of such studies or comprised those he found within easy reach. Paul Knipschild, a Dutch epidemiologist who was interested in the question, undertook a systematic search; he and several associates started with MEDLlNE (the on-line medical data base), went on to other indexes, pored through textbooks, contacted researchers in the field, and visited special medical libraries. They gathered sixty-one studies (including Pauling's thirty), graded them according to methodological soundness, and discovered that five of the best fifteen had not been mentioned by Pauling. The fruits of this systematic search led them to a conclusion very different from Pauling's: "Vitamin C, even in gram quantities per day, cannot prevent a cold."8 Knipschild's result may be disappointing, but it is almost certainly closer to the truth.
"Dollars and Sense: Reassessing Hanushek" Returning to the reworking of Hanushek's research: Richard Laine and Rob Greenwald located Hanushek's thirty-eight sources in several university libraries, photocopied them, and divided up the copies among the four members of the team. The first task was to comb through the 187 studies encompassed in these sources, extract the information from which Hanushek had calculated the per pupil expenditures, and check his figures. The team members, having few clues as to how Hanushek had derived per pupil expenditures from such data as budgets and school enrollments, had to work out the calculations for themselves, an arduous chore. Their aim was to take nothing for granted but to replicate Hanushek's data extraction as closely as possible; in the end, their figures differed only minimally from Hanushek's.9 Similarly, they also extracted pupil performance data from the studies. Then, having both sides of the equation-money input and student achievement-they were ready to put Hanushek's findings to the test. Greenwald, a lean, clean-cut, highly charged young man, practically bounces out of his chair as he recalls the experience: We knew fairly early that combined significance testing would be the first leg. If Hanushek's conclusion, based on vote-counting, was COfrect, our conclusion, based on combined significance methodology, should be the same as his. That was the first question. But if our con-
60
Clarifying Murky Issues in Education
clusion differed from Hanushek's-if we found that there was a significant relationship-this would lead to the second question: If there are effects, how big are they? Big enough to really count? Effect-size estimation would allow us to answer that question. Laine had a computer at home into which he fed the per pupil expenditure figures and the pupil performance data the team had compiled. He was proficient in the spreadsheet program Lotus 1-2-3 and was able to compute the combined significance figures following a procedure given in The Handbook of Research Synthesis. His recollections: Once we had dumped all the data into the computer, the actual run was pretty quick. Almost as soon as we put the data in, calculated the p values, and started combining them, we could see that Hanushek's data didn't support his conclusions. With our meta-analytic methods, we were getting findings different from his and that really spurred us on. We worked day and night. Hanushek's stuff had been carrying a lot of weight, and it was exciting to see that using his own data and better methodology, we could rebut his argument that money doesn't matter. Combining the p values showed that the overall relationship between PPE (per pupil expenditure) and pupil achievement was positive, and Significant at the .05 level-that is, there was only a 5 percent likelihood that the result was due to chance and a 95 percent likelihood that increasing PPE does bring about an increase in pupil achievement. 1o But how great an increase? That was the crucial question, since rebutting Hanushek would mean little if the effect was trivial. In the original studies the output variable, student achievement, was given in many different forms, so the team used the standard meta-analytic method introduced by Gene Glass in 1976 to standardize the different forms, converting them into units of standard deviation. As for the input variable, PPE, Hedges pointed out that the data were all given in dollars-and thus already standardized and usable as is. When the data were digested by Laine's computer, the emerging effect sizes were dramatic: An increase of $100 spent per pupil per year (in 1989 dollars, the year of Hanushek's analysis) would raise pupil achievement by one-fifth of a standard deviation. Put another way, such an increase would reduce by almost half the number of students whose achievement was in the lowest tenth of all studentsY "We were amazed at the relationship between PPE and effect size," Greenwald says. "1 was even afraid at times that people in the field would think we had cooked the books. We said to each other, 'We
How Science Takes Stock
61
don't believe this!-yet this is what the data say!' We kept telling each other that Hanushek's data don't support his conclusions but exactly the opposite." Armed with these results, the team drafted their paper, "Dollars and Sense: Reassessing Hanushek," each of the four students writing a section and editing the others' contributions. Making no effort to be tactful or diplomatic, they concluded, Falling victim to three of the main weaknesses of vote counting, Hanushek failed to identify an existing significant and positive effect between student performance and expenditure levels .... He stated that he did not believe that a more sophisticated analysis would alter his findings. Our preliminary meta-analysis shows that he was mistaken: three different meta-analytic techniques refute his central conclusion about the lack of a connection between expenditures and achievement. 12
Apples and Oranges Although Laine and his colleagues accepted Hanushek's universe as suitable for meta-analysis, a critic of the new methodology might scoff at their combining results from so diverse a collection of materials. Some of the studies in the sample dealt with single school districts, others with multiple districts; half were based on data from primary schools, half from secondary; most of the figures on student achievement were derived from students' individual scores but others from data aggregated at any one of three levels-the school, the district, or the state. In short, the universe would seem to exemplify the "apples and oranges" criticism of meta-analysis: If you add apples and oranges and then average their weights, sizes, flavors, and shelf lives, you get meaningless figures. Eysenck, who is still attacking meta-analysis as vigorously today as in 1978, recently wrote, "Meta-analysis is only properly applicable if the data summarized are homogeneous-that is, treatment, patients, and end points must be similar or at least comparable. Yet often there is no evidence of any degree of such homogeneity and plenty of evidence to the contrary. "13 He added, using his favorite whipping boy, the Smith and Glass psychotherapy study, as an example: "The resolute search for some general effect for psychotherapy appears fruitless; the data used are too heterogeneous to be analyzed." Gene Glass has impatiently dismissed such criticisms by saying it is a good thing to mix apples and oranges when we are trying to generalize about fruit. 14 (One might, for example, generalize that fruits are the
62
Clarifying Murky Issues in Education
results of pollinated ova, about where fruit seeds are located, how the edible parts auract creatures who will distribute the seeds, and so on.) Glass has also faulted the apples and oranges criticism for being at odds with the very nature of research: The claim that only studies which are the same in all respects can be compared is self-contradictory; there is no need to compare them since they would obviously have the same findings within statistical error. The only studies which need to be synthesized or aggregated are different studies. Generalizations will necessarily entail ignoring some distinctions that can be made among studies. Good generalizations will be arrived at by ignoring only those distinctions that make no important difference. But ignore we must; knowledge itself is possible only through the orderly discarding of information. IS It is when meta-analysts incautiously combine studies in which the differences in the raw data are very great that they may produce meaningless or misleading findings. Judith Hall and three co-authors of a chapter in The Handbook of Research Synthesis rebut the apples and oranges criticism but go on to warn: Synthesists must be sensitive to the problem of attempting aggregation of too diverse a sampling of operations and studies. Combining apples and oranges to understand something about fruit may make more sense than combining fruits and humans to understand something about organic matter. The syntheSist must ask, "Does this level of generalization add to our explanation and understanding of a phenomenon?" Too diverse a sampling of studies could obscure useful relationships within subgroupings of the studies and not provide information at the level of the more abstract categorization. 16 By paying attention to the narrative and qualitative aspects of individual studies, meta-analysis can avoid lumping together those that are excessively diverse. There are, moreover, precise ways of evaluating the diversity of any group of studies, the most common being a statistical tool known as "the homogeneity test." Before combining the effect sizes of a group of studies, the researcher checks their variability-how widely the effect sizes in the studies differ. The homogeneity test compares the variability in the effect sizes to the variability that would be expected if sampling error alone were responSible. In essence, the homogeneity test asks, "What are the chances that this much variation in effect sizes is due to sampling error?" If that probability is p = .05 or more, it becomes necessary to look for other factors-moderator and mediator variables-
How Science Takes Stock
63
that might account for the excess in variation. If such factors can be linked to the disparities in effect sizes, a combined effect size may be meaningless and misleading; the meta-analyst would do better to divide up the studies into subsamples and process each for a more meaningful meta-analytic finding. 17 Apples and oranges are manageable; apples, raspberries, watermelons, kiwi fruit, seedless grapes, and strawberries generally are not.
Going Public Against Hanushek As the seminar paper took shape, Laine and Greenwald began talking about a grander goal, a full-scale meta-analysis of Hanushek's work. (The other two students, preoccupied with work on their dissertations, dropped out of the project.) After Hedges read the paper, he encouraged Laine and Greenwald to go ahead with the larger effort and, when they asked whether he might join them, agreed to do so. It was a golden opportunity for them: two graduate students would be joined by an eminent professor in producing a controversial study with direct policy implications. For Hedges, it was a fine chance to apply his methodology to a critically important issue in education. In a memo dated June 16, 1992, Hedges outlined for Laine and Greenwald the analyses they were to perform: They would do for each of Hanushek's seven variables what previously they had done only for per pupil expenditures. The project, he told them, had the potential to be important. This was no routine compliment; Hedges considered it so significant that he discussed it with Eric Wanner, president of the Russell Sage Foundation, which awarded him a $44,000 grant for salaries and expenses. During the summer, the three worked separately, Laine in San Diego, Greenwald in Chicago, and Hedges in New York, and conferred regularly by phone. By fall, they had completed 90 percent of the coding, but what was left was the hardest part, requiring high-order statistical skills. Laine recollected, Larry was the driving force on methods. He had an amazing knowledge of how to pull the data we needed out of what was there but not obvious, and to fill in data where there were gaps. He could pull out standard deviations where we had different subcategories of kids but needed standard deviations by overall categories; for him, that was simple. Or he could show us in minutes how to extract standard deviations by separating different means from an aggregated mean.
64
Clarifying Murky Issues in Education
Hanushek had investigated seven cost variables-per pupil expenditures, teacher-pupil ratio, teacher education, teacher experience, teacher salary, administrative inputs, and facilities-and by means of vote-counting found that only teacher experience had a possibly significant positive effect on pupil achievement, although even that was doubtful. 18 Hedges, Laine, and Greenwald would retest the relationships of all seven variables to pupil achievement by meta-analytic methods. As the work got under way, they felt obliged to inform Hanushek of the project. Hedges wrote him that he, Laine, and Greenwald were attempting to duplicate Hanushek's table showing how many studies yielded statistically significant positive, negative, and nonsignificant relationships between each of the seven variables and pupil achievement: they also asked for his help in resolving a few discrepancies. Hanushek replied by letter and phone; he provided some help, but perhaps sensing trouble ahead asked Hedges to send him copies of their work sheets. None were ready until late September, by which time exciting data were emerging from the computer. (The team, now meeting regularly in Hedges' office to review their results, were using a work station tied to a university mainframe equipped with high-powered statistical programs.) When they began assembling the data, Hedges alerted Hanushek to their progress and offered him the opportunity to criticize and correct their preliminary findings: We would like to carry out a "reanalysis" or a resynthesis of the studies that you have examined. One reason is that while vote count or box-score reviews of the type you did are certainly sensible, we do know that they can sometimes be misleading. Thus we were interested to see what might be found if a different statistical analysis strategy were used to summarize the results of the studies. We want to try two things: formal combined significance tests and some summary of an index of effect magnitude.
He then hinted at what was coming by adding that "very preliminary analyses on the relation between PPE and outcome seem interesting," that combined significance testing showed at least some positive relations and no negative ones, and that effect size analysis indicated "a typical effect that may be big enough to have practical significance." Not unexpectedly, his letter and the accompanying preliminary draft of the article elicited a bristling reply. Hanushek objected to their analysis of his data on a number of highly technical grounds and was dismissive of the entire proceeding:
How Science Takes Stock
65
I realize there is a market for strong conclusions, but at least the exposition of your current draft seems to go beyond the evidence and data you describe. I presume that part of this is [due to] the exuberance of graduate students. While you may be completely comfortable and confident that the results will hold up under full analysis, it seems that this draft piece is a bit premature. I personally think that you are putting out some potentially misleading conclusions that you would want to be very confident of before publishing.
The correspondence continued but led to only minor modifications of the article, and by May 1993, Hedges, Laine, and Greenwald crossed the academic Rubicon and submitted their article to Educational Researcher. After receiving the criticisms of peer reviewers, they revised it, and it was accepted and published in the April 1994 issue under the title, "Does Money Matter? A Meta-Analysis of Studies of the Effects of Differential School Inputs on Student Outcomes." The article reviewed Hanushek's work in detail, discussed the shortcomings of vote-counting, and presented the results of the meta-analysis of all seven factors, utilizing both combined significance tests and effect size analyses. The findings of the combined significance tests flatly contradicted Hanushek's conclusions: [The] data imply that over all the studies, with the few exceptions noted above, there are at least some positive relations between each of the types of educational resources inputs studied and student outcome .... These analyses are persuasive in showing that, with the possible exception of facilities, there is evidence of statistically reliable relations between educational resource inputs and school outcomes, and that there is much more evidence of positive relations than of negative relations between resource inputs and outcomes. 19
The article then turned to the effect size analysis. Not all of the seven factors were related to a positive effect on pupil achievement, but "taken together the effect size analyses suggest a pattern of substantially positive effects for global resource inputs (PPE) and for teacher experience." The effects of several other factors were usually, though not always, positive; that of class size was mixed; and one factor, teacher education, was mystifyingly correlated with lessened pupil achievement. Combining all the results, however, led to an unequivocal conclusion: This [overall effect] coefficient is large enough to be of considerable practical importance. It suggests that an increase of PPE by $500 (approximately 10% of the national average) would be associated with a
66
Clarifying Murky Issues in Education
0.7 standard deviation increase in student outcome. By the standards of educational treatment interventions, this would be considered a large effect. 20
In more familiar terms, students in a school that raised per pupil expenditure by $500 would enjoy a nearly 24 percent increase in achievement compared with similar students in a school that pursued no spending increase. 21 For good measure, the article included the results of "robustness tests," or sensitivity analyses, of their findings. Such tests examine what happens to the results when potentially distorting or deceptive factors such as outliers-rare, extreme cases-are removed. If the results remain substantially the same, the findings are said to be robust or sound. Three such tests showed this to be the case. "No matter how you slice the data," says Hedges, "you get essentially the same answer." For Hedges, publication of the article, though gratifying, meant only one more item in his long list of writings; for Laine and Greenwald, it was an epochal and thrilling event. Less thrilling, however, was Hanushek's reply in the next issue of Educational Researcher (the editors had sent him advance proofs and requested his reply). Hanushek said that Hedges, Laine, and Greenwald claimed their meta-analytic methods were "more sophisticated" than vote-counting but that "more sophisticated is not synonymous with correct." He called their interpretation of the data "absurd," "unwarranted," and "potentially very misleading when it comes to policy matters. "22 After finding fault with a number of their procedures, he summarily rejected their meta-analytic techniques as fancy terms of little substance: The couching of their analysis in technical phrases like "combined significance tests," "robustness testing," and "median half-standardized regression coefficients" gives the misleading impression that sophisticated statistical methodology has led to conclusive results, where the previous analysis did not. Such a conclusion is clearly wrong .... [Their] conclusion is, I believe, misleading and potentially damaging .... It would be very unfortunate if policy-makers were confused into believing that throwing money at schools is effective. More serious reform is required if we are to realize the full benefits of our schools.23
Hedges, Laine, and Greenwald had the last word, however, in the form of a brief "Reply to Hanushek." Rebutting each of his criticisms, they defended meta-analysis against his aspersions: Hanushek seems to question the validity of meta-analysis. We, and many other scientists, disagree. For example, a recent report of a com-
How Science Takes Stock
67
mittee of the Mathematical Sciences Board of the National Research Council (1992) concluded that "quantitative research synthesismeta-analysis-has gained increasing use in recent years and rightly so. Meta-analysis offers a powerful set of tools [or extracting information from a body of related research."24 Laine and Greenwald were all for thoroughly bashing Hanushek, but Hedges, with long years of academic experience, preferred a more politic approach and the "Reply to Hanushek" ended on a collegial and positive note: This interchange has moved the discussion forward. It has evolved from (a) the position that Hanushek's sample of studies show definitively that there is no relation between resources and outputs to (b) a discussion of how large the positive relations might be and of the characteristics of the studies that best reveal this relation. This strikes us as progress. 25 Privately, they have other thoughts. Laine, at least, says candidly, "We felt that Hanushek knew there was a better method of analyzing the data but ignored it because it detracts from his message, it attacks his bread and butter. Recently; though, he's begun to modify what he says; now it's 'We need to focus on the ways in which money does matter.' But his previous work lives on in a lot of people's minds, and conservatives still say that money doesn't matter." Greenwald's feeling, based on meetings he has attended lately, is that the meta-analytic view is now widely accepted in the field. Science, the New York Times, Phi Delta Kappan, and Education Week have all paid major attention to the Hedges, Laine, and Greenwald analYSiS, and theJoumal of Education and Finance devoted its entire summer 1994 issue to "Further Evidence on Why and How Money Matters in Education." Hedges, Laine, and Greenwald, though they remain embroiled in controversy with Hanushek, have moved into a larger sphere of inquiry: They have adopted what Greenwald refers to enthusiastically as "the new universe"-a collection of studies far more extensive than Hanushek's-and have meta-analyzed it for the effects of educational inputs, including money, on pupil achievement. They believe this larger sample is more definitive, wide-ranging, and fine-grained than Hanushek's.26 What began as a seminar paper five years ago still has a claim on the time of all three co-authors and promises to for years to come. Most gratifying, perhaps, their work seems likely to playa significant part in the continuing struggle over the funding of public education in America.
68
Clarifying Murky Issues in Education
IN BRIEF ... What Good Is Homework? Is homework useful or a waste of time, good for students or harmful, an aid or an impediment to the learning process? Educators and the public have swung back and forth in their views of this matter in response to changing fads in education, historic events (the launching of Sputnik, for example), and other factors. Surely, research can provide an answer. Indeed, it has provided a number of answers-which, however, contradict each other. "Reviews of homework research," writes Harris Cooper, professor of psychology at the University of Missouri, "give appraisals that generally fit the tenor of their times. Through selective attention and imprecise weighting of the evidence, research can be used to muster a case to back up any position."27 In 1986 Cooper, whose two young children were then entering school, saw this as a challenge that could be met by meta-analysis, about which he had been an enthusiast for over a decade. For some, metaanalysis is an acquired taste; for others, a case of love at first sight. Cooper, a lean, denim-clad, bearded man in his early forties, was instantly smitten by it when he was a postdoctoral fellow at Harvard. He had meant to become an experimental psychologist, but in 1975 his Harvard mentor, Robert Rosenthal, was writing a book on methodology and let Cooper read a draft copy of a chapter on combining the results of independent studies. Cooper says he was "tickled" by it, though "hooked" would be more accurate. In the two decades since then, although he has done a fair amount of social psychological research, more than half of his time has been devoted to teaching, writing, and editing books about meta-analysis. And of course, practicing it. The meta-analysis he considers his best concerns the homework issue. 2s When this topic captured his attention, he secured a grant from the National Science Foundation, hired graduate students as assistants, and went at it full throttle. Defining the universe of studies very broadly, Cooper and his assistants did a huge on-line and library search, then slogged through the bibliographies of retrieved articles and books, hunting for other sources. And that was only the beginning. Cooper recalls, I had a feeling that in this area there might be a lot of unpublished material, so I wrote to fifty-odd state departments of education and to twenty deans of the most active educational research schools. I also
How Science Takes Stock
69
located an organization of research evaluation specialists in education, got the names of their members in 106 school districts, and WTote to all of them. I asked everyone I wrote to for recent doctoral dissertations or any other forms of research on homework-l requested copies, or abstracts, or mere citations, or whatever they could give me. Most of them helped to some degree, but the state departments were particularly good and sent me a lot of "fugitive literature" -surveys and reports meant for internal use that would never show up on any formal index. I looked at about a thousand documents, read the 250 or so that were worth reading, and used the 120 that were empirical studies of whether or not homework works. The rest were either advocacy pieces or reviews of someone else's work or other stuff that wasn't useful for a meta-analysis. Many of the 120 were by teachers and some of these were very bad in terms of methodology, but I didn't exclude any on grounds of quality. I did, though, code their methodology carefully so I could find out later whether poor studies gave different effects than good studies and, if so, adjust for it. For the same reason, I used both controlled and uncontrolled studies but coded them so I could see later whether that made a difference in the reported effect size, and, if so, take it into account. Cooper developed a number of hypotheses to be tested by metaanalytic methods. The most obvious was whether students given homework assignments got higher grades than students given none. Another was whether the amount of time spent on homework made a difference in student achievement. A third was whether there were any differences in the effectiveness of homework for boys and for girls. Cooper's data analysis relied on two standard methods. He calculated average effect sizes-the difference in achievement between treatment and control groups or between two different treatments-and estimated 95 percent confidence levels for the findings. He also did a homogeneity test to be sure the samples were drawn from the same population. 29 The meta-analysis decisively settled some long-debated issues. Students given homework assignments did better than those not given homework assignments by about one-fifth of a standard deviation; this Signified that the average student doing homework outperformed about 60 percent of those students not doing homework. The amount of time spent on homework was even more important: Long assignments yielded almost four-tenths of a standard deviation of improvement over short assignments. Homework was equally beneficial for both sexes. 30 All of this answered questions that had simmered in the education field for some time. But one finding that emerged from the meta-analysis was totally unanticipated:
70
Clarifying Murky Issues in Education
I had routinely called for cross-tabulations by grade level but never expected what I found-that homework is very effective in high school
but almost totally ineffective in elementary school. High school students doing homework average about one-half of a standard deviation higher in achievement than no-homework students, junior high students one-quarter higher, and elementary school students only about one-twentieth higher. 3l That was dramatic and absolutely unexpected-a question nobody had ever asked, and a finding I came upon never expecting such a difference. Nothing in Cooper's sources or meta-analysis indicates why such differences exist but, Cooper speculates, "I think it has to do with children's ability to teach themselves. Elementary school children haven't learned how to study, haven't learned how to learn on their own. High school students have; Its the mental capacity we call 'meta-cognition'the awareness of our own thinking processes." Cooper's meta-analysis appeared in short form in the journal Educational Leadership in November 1989 and in detailed form as a book, Homework. The article and book briefly hauled Cooper out of his quiet academic life and into the public arena: He appeared on the Larry King show, was interviewed by a dozen newspaper reporters and radio talk show hosts, and was quoted in various periodicals, including the Wall StreetJoumal and Reader's Digest. Most important to him was the opportunity to recommend to educators a detailed homework policy based on his results; he set forth the specifics in Homework (1989) and in The Battle over Homework (1994). The most noteworthy concern his unanticipated finding regarding grade differences: He urged that although elementary school students be given homework, it should be short, use materials commonly found in the home, and lead to successful experiences; it should not be expected to improve test scores but merely to help children develop good study habits and positive attitudes toward school, and to acquire the idea that learning takes place at home as well as at school. Not until junior high school should the academic function of homework begin to emerge, and not until high school should homework serve as an extension of the classroom, requiring students independently to integrate skills or different parts of the curriculumY
New Knowledge As in Cooper's meta-analysis of homework studies, new knowledge, particularly in the form of findings not sought by the original studies, is a frequent product of the meta-analytic process. Sometimes this is
How Science Takes Stnck
71
due to the greater statistical power of the meta-analysis, as two examples illustrate: • Meta-analyses of perinatal care have found, contrary to the accepted findings of small clinical studies, that routine episiotomy (surgical incision of the vulva during birth to enlarge the canal) was not generally beneficial, and have led to recommendations against it. 33 • A number of small clinical trials reported that lidocaine was useful to control arrhythmias (dangerous irregularities in the heartbeat). When the trials were meta-analytically combined, however, it turned out that lidocaine actually increased the risk of mortality, at least in certain groups of patients. 34 In these two cases, new knowledge came from combining studies, but it also, and perhaps more typically, comes from comparing different studies-that is, searching for the moderator and other variables (differences in the circumstances, personnel, methodology, and subjects of the studies) that might account for discrepancies among the findings. Such comparisons sometimes enable the meta-analyst to make inferences that "go well beyond the original results," according to The Handbook of Research Synthesis. 3s Harris Cooper and Larry Hedges, the editors of the Handbook, amplify the point: "Current methods ... permit the testing of hypotheses that may never have been tested in primary studies. For yesterday'S syntheSist, the variation among studies was a nuisance; it clouded interpretation. For today's syntheSist, variety is the spice of life. "36 By comparing results across studies that look at the same phenomenon but that involve different methods, researchers, subjects, and other variables, the analyst can test new hypotheses. Betsy Jane Becker, an educational statistician at Michigan State University, sees this as even more exciting than what can be learned by combining studies. "What meta-analysis is really great for," she says, "is what Dick Light has called 'capitalizing on variation.' It's a way of looking at differences in studies and saying, 'Why are these studies different?' And understanding that really does give you the solution to the puzzle." When studies of any given phenomenon vary conSiderably or actually contradict each other, meta-analysts will search first for errors and methodological flaws that might be to blame. But if important differences remain after they have been excluded, the meta-analysts try to identify the moderator and mediator variables that might be responsible. Statistical tests of subsamples show whether such a variable, present in some studies and not in others, plays a significant part.
72
Clarifying Murky Issues in Education
An example of this type of meta-analytic "by-product" was briefly alluded to in the previous chapter. Alice Eagly and Linda Carli meta-analyzed 148 studies of the persuasibility and tendency to conform of men and women in a variety of group situations. The meta-analysis confirmed the familiar findings of many primary studies that under group pressure women are more easily influenced and more conformist than men. Because there were inconsistencies across the three subgroups of studies-those dealing with persuasion, with conformity under group pressure, and with conformity not under group pressure-Eagly and Carli looked for variables that might help explain them. To their surprise, the sex of the researcher turned out to be strikingly influential: Male authors were much more likely to report that girls and women were more easily influenced and more conformist under group pressure than boys and men. 37 Eagly and another colleague, Wendy Wood, then reanalyzed the meta-analysis mentioned earlier in which Judith Hall reported that women were better at decoding nonverbal cues than men. Again they found that the sex of the researcher was an influential variable: Female authors reported larger tendencies for women to be accurate at such decoding than did male authors. In each case, therefore, Eagly and her collaborators made the unexpected finding that researchers tend to report effects that are flattering to their own sex. 38 The finding does not mean that such studies are worthless; rather, it provides a basis for adjusting the results to counteract gender-associated bias. A quite different source of new knowledge in meta-analysis comes from outlier studies-those with outcomes far higher or lower than the majority. The usual attitude toward outliers is that they are pestiferous; as an authoritative biostatistics textbook puts it, "Almost any collection of data will be infested with outliers."'9 Even one or two extreme outliers can considerably affect the average of the sample, and if the unusual effect sizes of the outliers are due to recording errors or other methodological flaws, they distort the central truth the meta-analysis would otherwise reveal. One way of dealing with them, therefore, is to excise them from the sample. But there is another possibility: "Careful examination of outliers," writes Frederic Wolf of the University of Michigan, an expert in medical meta-analysis, "can provide important understandings and generate new hypotheses that might otherwise be missed. "40 In a meta-analysis of school data, for instance, one researcher focused on the outliers rather than the main effect; by looking at the exceptionally successful schools,
How Science Takes Stock
73
he was able to discover certain features special to them. 41 In other cases, an outlier may have an unusual effect size because it combines certain moderator and/or mediator variables that together exert a special "interaction effect." If that combination of variables more often leads to the desired outcome than the individual variables working alone, the metaanalysiS has produced new knowledge as to what works best, and why.
Sex and Science Achievement To hear Becker talk about meta-analysis or to read her abstruse contributions to statistics, you would never suppose that in high school in the 1970s she opted out of chemistry and physics in favor of French and chorus and that she nearly managed to sidestep math-which turned out to be her main interest in life. (Her doctorate is in education, but her specialty is statistical methodology.) Becker, a slim, pert woman who is forty but looks thirty, has recently spent time exploring those decisions in a meta-analysis of the differences between males and females in scientific achievement. In particular, it is commonly argued that while females have exactly the same cognitive capacities as males but are steered away from science by parents, teachers, and the male power structure of the sciences. Becker, however, suspected that opinion was seriously incomplete; had it been correct, she herself would never have become a statistician and metaanalyst. A more complete and objective view, she felt, should encompass the empirical evidence of the "predictors" of science achievement for both boys and girls. A full explanation of how girls and boys make their choices would take into account whether these predictors have the same influence on each sex. Do science-related aptitudes and attitudes impel girls toward scientific achievement as strongly as they do boys? Does persistence in taking science courses have the same impact on each sex? How important is the influence of socializers (parents, teachers, and others) on the scientific achievement of boys and of girls? And so on.42 Becker's opportunity to explore the matter came in 1986, when the Russell Sage Foundation invited her and a handful of other scholars to propose meta-analyses that would provide explanations, rather than just syntheses, of research issuesY She responded with a plan for using a path model of sex differences in science achievement. As she later wrote, One approach to the syntheSiS of studies predicting achievement and persistence in science might be to simply amass all available studies with
74
Clarifying Murky Issues in Education
those variables as outcomes and to synthesize the existing results for each of the predictor-outcome relationships found in the literature .... A second approach, used here, is to gUide the synthesis by the use of conceptual and empirical models. The models were drawn from the literature on science achievement and the literature on social and psychological influences on the development of general achievement behaviors.44
The purpose of using a path model is to picture the successive stages or links in a complex process. The strength of the connections among the stages is indicated by correlations found by research; the net result is to suggest the causal chain of events leading to the outcome because often it is not possible to know which component in a correlation is cause and which is effect unless one of them clearly occurs first. Becker used two theoretical models of science achievement that had been proposed by other social scientists, one a simple structure of four components with six interconnections, the other a complex structure of eleven components with eighteen interconnections. Her goal was to find out, by meta-analyzing relevant studies, how many of these suggested linkages were validated by correlations. To the extent that they were, she would have an evidence-based, rather than merely hypothetical, explanation of how boys and girls become scientific achievers. With her grant, Becker hired a research assistant and collected 522 titles of interest, of which only thirty-two, containing thirty-eight studies, met her requirements. But those thirty-eight presented her with an alarming mass of possibilities: From their data on age, grade levels, achievement scores, aptitudes, self-image, socializing influences, and so on, she and her assistants-she now had four--extracted a daunting total of 446 correlations. Through statistical legerdemain of various kinds, Becker combined these correlations into a relative handful. She found five that corresponded to and validated five links of the simple model, four that fit four links of the complex model. The simple model, with the correlations, is shown in figure 3-1. The numbers in parentheses indicate confidence interyals, at the 95 percent confidence level, around the average correlations; those in the upper left corner, for instance, show that for boys the correlation between abilities and interest is fairly strong, lying somewhere between .36 and .42, and for girls somewhat less strong, lying between .24 and .31. For
75
How Science Takes Stock
Figure 3-1 Simple Path Model of Predictors of Male and Female Scientific Achievement
M: (.36, .42)* F: (.24, .31)*
k=13
STUDENTS' INTEREST IN AND LIKING FOR SCIENCE (AFFECT)
M: (.31, .34)* F: (.31, .33)*
STUDENTS' ABILITIES (APTITUDES)
k"\
M: (.14, .18)* F: (.10, .15)*
k=22
M: (.22, .34)* F: (.29, .40)* k=3 ,---------,
k= 50
r-S-O-C-IA-L-IZ-E-R---'S-'-AT-T-IT-U-D-E-S--' AND EXPECTATIONS
SCIENCE ACHIEVEMENT BEHAVIORS
L20,
.51)'
F: (.32, .85)* k=1
Source: Becker and others, in Cook and others (1992) p. 248. Note: Asterisks represent sets of heterogeneous correlations. The number of correlations for each sex is denoted as h. Correlations shown are significant at the 0.05 level.
girls, the correlation between abilities and actual achievement is stronger (between .31 and .33)-and, intriguingly, virtually the same as that for boys (between .31 and .34). As for the influence of socializers, the confidence intervals are extremely wide and hence the actual correlations are hard to estimate, but the midpoints of the two intervals, which are at least suggestive, are only .15 for boys but .58 for girls. The intricacies of the complex model are beyond the scope of this book; for the curious, however, figure 3-2 shows it with the confidence intervals, at the 95 percent confidence level, entered at the four links to which they apply
(.27, .30)' (.21, .25)' k= 6
DIFFERENTIAL APTITUDES OF CHILD
~
3
I
Experiences
INTERPRETATION OF PAST EVENTS 1. Attributions
2, Standardized Test Scores 3. Related
f-
CHILD'S
6
~
CHILD'S PERCEPTION OF SOCIALIZERS' ATTITUDES AND EXPECTATIONS
1/
f-
5
1. Grades
PAST EVENTS
t
1. Behaviors and Self-Concepts 2. Attitudes and Expectations for Child
SOCIALIZERS'
l
1. Sex Division in Labor Market 2. Cultural Stereotypes of Subject Matter 3. Cultural Stereotypes of Competitiveness
CULTURAL MILIEU
4
2
f-
f-
8
7
t
t--
F:(-.21, .38)
1-
10
f-
M:(.14, .18 F:(.10, .15 k= 22
I-
ACHIEVEMENT BEHAVIORS 1. Persistence 2. Choice 3. Performance
EXPECTANCIES 1. Current 2. Future
11
1. Intrinsic Value 2. Utility Value 3. Cost 4. Attainment Value
CHILD'S PERCEPTION OF TASK VALUE
M:(.09, .20)' F:(,09 .. 20)' k= 7
SPECIFIC BELIEFS 1. Self-Concept of Ability 2. Perceptions ofTask Difficulty
CHILD'S TASK
1. Relevant Self-Schemata 2. Long-range Goals 3. Immediate Goals
CHILD'S GOALS AND GENERAL SELF-SCHEMATA
9
Complex Path Model of Predictors of Male and Female Scientific Achievement
Source: Becker and others, in Cook and others (1992) p. 249. Note: Asterisks represent sets of heterogeneous correlations. The number of correlations for each sex is denoted as k. Correlations shown are Significant at the 0,05 level.
1
Figure 3-2
;:l
o·
~ H
~
tTl 0.-
:::i"
en
(1)
~
--w
1->
i=~
z::> «-.J ::>« a> w
LITTLE OR NONE
CONSIDERABLE
QUALITY OF STUDIES AND CREDIBILITY OF AVAILABLE INFORMATION LEGEND:
D
CONCLUSIVE EVIDENCE
D
SOME OR MODERATE EVIDENCE
•
GAPS IN KNOWLEDGE
KEY: 1. INCREASE IN MEAN BIRTHWEIGHTS 2. DECREASE IN PERCENTAGE OF LOW-BIRTHWEIGHT INFANTS 3. EFFECTS, FOR HIGH-RISK GROUPS AND FOR THOSE PARTICIPATING LONGER THAN 6 MONTHS, ON BIRTHWEIGHTS 4. IMPROVEMENT IN MATERNAL NUTRITION 5. DECREASE IN INCIDENCE OF ANEMIA IN INFANTS AND CHILDREN 6. DECREASE IN INCIDENCE OF FETAL AND NEONATAL MORTALITY 7. EFFECTS, BY LENGTH OF PARTICIPATION AND FOR HIGH-RISK GROUPS, ON MATERNAL NUTRITION, FETAL AND NEONATAL MORTATLlTY, AND ANEMIA IN INFANTS AND CHILDREN 8. DECREASE IN INCIDENCE OF MENTAL RETARDATION IN INFANTS AND CHILDREN 9. EFFECTS OF THE THREE SEPARATE WIC COMPONENTS
Source: U.s. General Accounting Office (1984) p. iii.
How Science Takes Stock
145
questions are pushed toward the "gaps in knowledge" corner of the chart, indicated by the darker shading.... In sum, the information is insufficient for making any general or conclusive judgments about whether the WIC program is effective or ineffective overall. However, in a limited way, the information indicates the likelihood that WIC has modestly positive effects in some areas. 12
In a press release issued by Helms' committee a few weeks later, Helms was quoted as saying, "GAO has produced a balanced and objective analysis of WIC program studies. Based on the actual evidence, it seems that some past witnesses clearly have exaggerated the effectiveness of the program." The press release, citing the report's qualified conclusions, added, a touch sourly, "It would be much more reassuring to the taxpayers to be more confident that the money spent is actually improving the health of poor women, infants, and children-its intended and worthwhile purpose." Although this seemed to forecast trouble for WIC, the general tenor of both sessions of the hearing-March 15th and April 9th-was amiable and distinctly positive. Senators Helms and Thad Cochran were present at the first hearing, Senator Rudy Boschwitz at the second. The senators, their aides, and the witnesses sat around a conference table in a room of the Russell Senate Office Building, a milieu Senator Helms preferred to the formal courtlike setting so familiar to television viewers in recent years. The first witness was Chelimsky, who briefly gave the main findings of the report. Although only two of the eighteen senators on the committee were present that day and only one at the second session, what may have impressed the others when they read the transcript was that many other witnesses echoed what Chelimsky reported about WIC's influence on birth weights and stressed its importance. Dr. David Rush of Albert Einstein College of Medicine, who was heading a nearly completed five-million-dollar study of WIC for the Department of Agriculture, said he had to apologize because his remarks would be so much like Chelimsky's. Dr. David Paige, professor of maternal and child health at the Johns Hopkins University, said that the effects on birth weights that the PEMD reported had "a major public health implication." Robert Greenstein, director of the Center on Budget and Policy Priorities, a liberal think-tank, said that the PEMD's finding of a decrease in low birthweight infants by close to 20 percent was of "striking significance."13 In addition to the PEMD appraisal of the existing evidence, the committee heard the proWIC testimony of twelve other witnesses and received pro-WIC statements from a number of other experts, plus a report by the Congressional Research Service that judged WIC to be cost effective.
146
Lighting the Way for Makers of Social Policy
Of all this input, the PEMD meta-analysis carried the most weight with Helms, according to Chelimsky, who told the 1994 National Conference on Research Synthesis, "Our finding that the program was responsible for a 3.9 percent increase in infant birthweight convinced Senator Jesse Helms not to go ahead with plans to reduce or zero the program's funding."H Undoubtedly, the other testimony played some part in the outcome, but the committee's decision to endorse continued funding for WIC and Congress's vote to follow suit are clearly an instance, and probably the first on record, of meta-analysis having a direct influence on congressional policy-making. Meta~Analyses
and Congressional Policy~Making
There are good reasons why policy-makers are likely to find a metaanalysiS, particularly in the social sciences, more persuasive than primary research. One is its political impartiality. As Kenneth Wachter, professor of demography and statistics at the University of CaliforniaBerkeley, says, A lot of social science research, even though it's an attempt to do good science, comes looking tainted by a certain set of political values-liberal or conservative-held by the people doing research in the field. But meta-analysis has the ability to bring studies from all sides together and review them in a quantitative way. That convergence of views gives them some protection from the charge of politicization.
Another reason meta-analyses carry special weight with legislators and agency staffers is that meta-analysts, in looking at the impact of a program, often dig deeper than main effects and ferret out the speCial effects of moderator and mediator variables. Policy-makers, as a result, hear not only whether it works but an answer to the question, "What works best?"l5 Meta-analysts can even build what Mark Lipsey calls "policy models" for difficult social problems, such as drug abuse. A poliey model, he explains, is "an interconnected set of statements of relationships that embrace the key variables in the problem" and that enable the analyst and policy-maker to carry out "what if" simulations of the probable results of specific changes in the treatment, the kinds of people administering it, and other projections of the future. 16 A noteworthy example of meta-analysis effectively bringing nonpartisan findings to bear on a fiercely debated issue is a series of PEMD reports in the 1980s of studies of the "Bigeye bomb." This was a chemical weapon being developed by the Department of Defense (DOD) that
How Science Takes Stock
147
would produce a lethal agent by combining two nonlethal substances. DOD claimed that its studies demonstrated the potential effectiveness of the new weapon, but Representative Dante Fascell, chairman of the House Committee on Foreign Affairs, wanted a more impartial appraisal and asked PEMD for a meta-analysis of those studies. PEMD's analysis found that the DOD studies cited expert opinions as if they were objective evidence but that they often lacked experimental data. Moreover, the meta-analysis of such objective evidence as existed found serious technical uncertainties as to the reliability and effectiveness of the proposed weapon. PEMD's initial report-it conducted several-startled, alarmed, and enlightened many members of Congress. Representative Fascelllater wrote to Chelimsky that PEMD's work "proved to be a crucial factor in the formulation of U.S. policy," leading to high-level negotiations and agreements with the Soviet Union about chemical weapons, the elimination by Congress of all funds for such weapons, and the eventual cancellation of the Bigeye bomb program by the DODY The breadth of vision a meta-analysis can offer policy-makers is exemplified by the impact of a PEMD report on a bill affecting the nation's supply of low-cost rental housing. In December 1989, Representative Henry Gonzalez, chairman of a subcommittee of the House Committee on Banking, Finance, and Urban Affairs, asked for a PEMD meta-analysis of the probable effects of a proposed change in the National Housing Act. The change would have enabled owners of more than a third of a million housing units to prepay their federally insured mortgages; if enough chose to do so, and then opted to sell their houses at a profit, the national supply of low-rental properties would undergo considerable shrinkage. A "windfall profits test" was supposed to control the shrinkage, but existing studies, made in different regions of the country, differed widely as to the likely effect of the proposed change. The meta-analysis portrayed the probable impact of the proposed change as no single study had done and as no mere review of the studies could do. PEMD concluded that it was very uncertain whether the "windfall profits test" could control the loss of rental housing and that the potential loss was serious enough to warrant revising the windfall profits test guidelines. These findings, presented at a hearing on the new housing bill, impressed Representative Gonzalez enough that he changed the bill then and there along the lines of PEMD's recommendations. 18 Meta-analytic evidence has also played an important role in passing new legislation aimed at the public'S general well-being, even when it conflicts with partisan and commercial interests. A particularly striking instance is PEMD's study of the effects of drinking-age laws. Legislation
148
Lighting the Way for Makers of Social Policy
enacted in 1984 required that a portion of federal highway funds be withheld from states that did not raise the minimum drinking age to twenty-one. Most states complied, but some state legislatures, pressured by beer lobbyists, resisted, maintaining that raising the drinking age did little to reduce driving accidents. Some other states, that had passed such laws, were considering repealing them, and South Dakota sued to overturn the federal requirement, claiming that it was unconstitutional. 19 In late 1985, Representative James Oberstar, chairman of a subcommittee of the House Committee on Public Works and Transportation, asked PEMD for a meta-analysis of the effects of drinking-age laws on traffic accident rates. PEMD collected more than four hundred relevant documents and selected fourteen that met all its criteria for inclusion, among them five kinds of outcome measure, such as driver death, driver injury, and total fatalities. The meta-analysis, released in March 1987, found that raising the drinking age to twenty-one unquestionably and significantly reduced alcohol-related accidents among youths. 20 The U.S. Supreme Court used the meta-analysis in 1987 to dismiss questions as to the effectiveness of such laws; then, reviewing South Dakota's case, it ruled that the federal law was constitutional. The remaining noncomplying states thereupon enacted drinking-age laws and those considering whether to repeal theirs abandoned those plans. The National Highway Traffic Safety Administration publicly credited PEMD with convincing the states and the Supreme Court of the effectiveness of drinking-age laws. 21 These odds and ends of evidence certainly suggest, though they do not prove, that PEMD meta-analyses influence the decisions of policymakers. Since no rigorous study of the matter exists, we have only the subjective impressions of those involved to bolster that view, though admittedly these are skewed by temperament and general outlook. For example, York (the codirector ofPEMD when I visited him), elegantly tall, slim, white-haired, and low-keyed, said, Staff people read our reports first, and it's their job to pass on the gist of the reports to their bosses. But I don't really think they pay more attention to an evaluation synthesis than to other kinds of studies. And very few congressmen and congresswomen actually read more than the executive summary. We try to convince ourselves that the person who reads our evaluation synthesis will be overwhelmed by the power and majesty of our presentation, but that happens in our dreams, not in reality. To be realistic, I'd say that the policy impact of our evaluation syntheses is modest.
How Science Takes Stock
149
Chelimsky, spirited and chronically cheerful, was considerably more upbeat: Congressmen and congresswomen are political people, and their understanding of your work has to do with how relevant it is to their constituencies, how much money it involves, and other power-type questions. But we have had a major effect in a number of cases-the Bigeye bomb case, the drinking-age-law case, WIC, and others. From a policy point of view, meta-analysis is the answer to all those policy-makers who say, "1 saw this tried in Podunk and therefore 1 think it won't work" or "It worked in Podunk, so 1 think it would be good for the United States as a whole." And it's also very useful for informing policy before it's started; we've had that kind of impact a number of times-the low-cost housing bill is a good case in point.
She summed up the policy influence of meta-analysis in her keynote address to the 1994 National Conference on Research Synthesis in these words: The use of [our) findings has been appropriate and considerable; new agency research has been commissioned or mandated when information was missing; sound studies have been used and reused; and finally, the synthesis has not-as we had feared-driven out congressional requests for studies requiring original data collection: on the contrary, I'd say it has actually sharpened the congressional appetite for impact evaluation. 22
A postscript: In mid-1996 PEMD was disbanded and its staff redistributed throughout the GAO as a result of a tightened budget made necessary by an economy-minded Republican-dominated Congress. Wherever they are now located, however, GAO staffers who have worked on meta-analyses will still be able to conduct others when requested to do so by Congress.
IN BRIEF ... Lumpectomy Versus Mastectomy Early in 1993, Judith Droitcour, a PEMD staff member, was discussing breast cancer research with George Silberman, a senior staffer with whom she had often worked. Both were interested in the subject, Droitcour because her sister-in-law had recently been diagnosed with it, Silberman because, having done a number of cancer studies, he was in the
150
Lighting the W::lY for Makers of Social Policy
habit of keeping up with new developments in the field. Droitcour, an attractive, birdlike, little woman, recalled for me how their conversation led to a PEMD meta-analysis: We talked about the fact that large numbers of women are diagnosed with breast cancer every year and that many of them have to make a decision between lumpectomy and mastectomy but that it's difficult to do so because there's some controversy about the matter. We knew that there had been randomized clinical trials and that experts had said in 1990 that patient survival rates following lumpectomy and mastectomy were equivalent, but the trials had been conducted in cancer centers and it wasn't at all certain that the results of the treatments were the same in day-to-day general medical practice. That seemed an important question and one that could best be answered by a "cross-design synthesis"-a combination of a meta-analysis of clinical studies and data from case records of medical practice in the community.
(The previous year, Droitcour and Silberman had worked out the methodology of cross-design synthesis, and GAO had published a 121page report on it written by them. 23 ) With Chelimsky's approval, Droitcour took on a cross-design synthesis of breast conservation versus mastectomy. She worked mostly alone, though she got some help from another staffer, Eric Larson, and from half a dozen other PEMD staffers and eighteen outside oncologists, meta-analysts, and statisticians. When the study was nearing completion, a member of GAO's Office of Congressional Relations described the project to staff members of the Subcommittee on Human Resources of the House Committee on Government Operations. Because that committee oversees the activities of, and grants made by, the National Cancer Institute, the subcommittee chairman formally requested the report. To gather her data, Droitcour, aided by a librarian, did an on-line data base search for randomized studies conducted in major cancer centers. She looked for studies that compared the five-year results of the two treatments on node-negative patients (patients showing no evidence of metastases to the lymph nodes) and that included radiation as part of the treatment. Droitcour also wrote to a number of breast cancer researchers to make sure there were no other studies she might have missed. Only six studies met all the criteria-three in single centers, three in multiple associated centers-but all were of high quality and taken together had a total of nearly 2,500 patients. In each of the six, lumpectomy and mastectomy patients had similar survival rates, and in all six
How Science Takes Stock
151
the odds ratios of both treatments were close to one, suggesting equal effectiveness of the two. The patients in the studies were all seventy or younger and, for the most part, had tumors measuring four centimeters or less in size. Droitcour, with the guidance of several of her expert advisers and the help of an advanced software program, synthesized the data to get precise, statistic estimates of the treatment effects. The combined odds ratio for patients treated in cancer centers was l.05, which indicated a minuscule advantage for lumpectomy; actually, there was no difference in five-year survival rates for the two kinds of treatment. 24 Then, for data on patients in day-to-day general medical practice in the United States, Droitcour turned to SEER (an acronym for Surveillance, Epidemiology, and End Results), a data base maintained by the National Cancer Institute. SEER is an up-to-date compilation of almost all cancer cases in five states (Connecticut, Hawaii, Iowa, New Mexico, and Utah) and four metropolitan areas (Atlanta, Detroit, San Francisco-Oakland, and Seattle-Puget Sound); these nine areas are varied enough in population to effectively represent the nation. From SEER's huge data base, Droitcour was able to extract cases that matched those in her combined set of randomized clinical trials: The computer came up with 5,326 women aged seventy or younger who had had node-negative breast cancer with tumors four centimeters or smaller in size. About one-fifth had had lumpectomy plus radiation and nodal dissection; the rest had had mastectomy. All of them had been followed for five years, or until death if it occurred sooner. Some of these patients, however, were more likely to have been offered lumpectomy by their doctors than others and to have selected ityounger women, women living in sophisticated areas, and so on-so it was not appropriate simply to calculate the survival rate for alliumpecto my cases and compare it with the survival rate for all mastectomy cases. Droitcour had to sort the women into five groups according to how likely they were to have received lumpectomy, whether or not they actually got it. In the least likely fifth were women close to seventy, those diagnosed as early as 1983, those living in Iowa, and those with tumors close to four centimeters in size; in the most likely fifth were younger women, those more recently diagnosed, those living in San Francisco-Oakland, and so on. At that point Droitcour could make a valid comparison of the outcomes of lumpectomy and mastectomy, since her survival rates were now free of the selection bias arising from group differences in lumpectomy rates. She then combined the lumpectomy rates and the mastec-
152
Lighting the Way for Makers of Social Policy
tomy rates she had calculated for each category within each sample; the combined rates for the two forms of treatment turned out to be very similar, there being only six-tenths of 1 percent better survival rate for mastectomy, a nonsignificant difference. 25 The crucial next step was to compare these SEER cases with those of the randomized clinical trials. Since the clinical trials had been conducted in cancer centers and the SEER cases came from general medical practices, one might suppose that the results of lumpectomies in clinical trials would be superior to those found in general practice. But the comparison offered genuine hope to all women diagnosed early with breast cancer. As Droitcour summed up the findings in her report, Our three-step analysis indicated that ... the effectiveness of breastconservation therapy has, on average, been similar to that of mastectomy in community medical practice as well as in randomized studies. Specifically, for medical practice cases, the adjusted 5-year survival rates ... were 86.3 percent for breast-conservation patients and 86.9 percent for mastectomy patients. These results clearly correspond to the results of multicenter randomized studies (88 percent 5-year survival for breast conservation and 88 percent for mastectomy). Singlecenter studies reported somewhat higher survival for both treatment groups. Thus, on average, for breast cancer patients of physicians in regular medical practice who are similar to patients in randomized studies, there appears to be no appreCiable risk associated with selecting breast-conservation therapy rather than mastectomy.26
Into the Black Hole The breast conservation report was issued on November 15, 1995, but as of late 1996 the Subcommittee on Human Resources of the House Committee on Government Operations, to which the report was submitted, had made no recommendations to Congress or congressional agencies based on it. From that standpoint, the PEMD report, like a small star sucked into the unappeasable maw of a black hole, had disappeared without a trace. To be sure, PEMD had included no policy recommendations in its report, but the lack of response may also have been due to the fact that the report was delivered less than two weeks after the national election that resulted in republican domination of both houses of Congress, which brought about the reshuffling of committee chairmanships and staffing; a number of the staffers who knew most about the breast conservation report had gone and the new chairman's staffers knew nothing about it and had many other things on their mind.
How Science Takes Stock
153
From another standpoint, however, the report had not vanished but been noticed by and had an impact on several nonfederal audiences: physicians treating early breast cancer, breast cancer patients, and health researchers concerned with ways of assessing the outcomes of alternate medical treatments. The National Cancer Institute invited Droitcour to present her findings at a national cancer conference and sent out copies of the report to cancer organizations around the country. The knowledge yielded by her study has been seeping down through the oncological community to general medical practitioners and, to some extent, to the newspaper-reading and television-watching public. It has therefore probably had some influence-albeit unmeasurable-on the National Cancer Institute's official position on lumpectomy and on the policies of cancer centers and oncologists throughout the country. But it has had no impact on congressionally established policy to date, and may never have. That is the fate of a fair number of PEMD meta-analyses for various reasons. One is, as with the breast conservation report, that a requested meta-analysis may be completed just when compelling political circumstances preempt the attention of potentially interested legislators, the result being that the pertinent legislation is mothballed. By the time the legislation is again discussed, the report either has been forgotten or appears dated. Another reason a requested meta-analysis may have little or no impact on policy-making is that representatives and senators are rarely moved to action solely by scientific evidence; loyalty to their constituencies is often a stronger influence. As Chelimsky wrote in Science a few years ago: "The relation between researchers and decision-makers remains one of inherently imperfect understanding, based as it is on the uneasy juxtaposition of different kinds of rationality and the dominance of politics over scientific logic in democratic societies. "27 CongreSSional committee members may choose to ignore a report whose conclusions run counter to their stance or belittle it by invoking moral values, the basic principles of their party, or the views of one of the Founding Fathers. Or they may play the bumpkin role, jesting that the statistical analysis in the report is beyond their comprehension. Or with a fine disregard for the methodology of the study, they may base their proposals and proffered amendments on the few items in the report that support their position, carefully ignoring those that do not. 28 Moreover, legislators, often under pressure to reach decisions and set policies, may be impatient with an exhaustive but still inconclusive meta-analysis; little wonder if they ignore a study that says no good evidence yet exists and offers no policy gUidance. 29
154
Lighting the Way for Makers of Social Policy
Worse still for the meta-analyst: If a report says what a legislator in a powerful position-a committee chairperson, for instance-does not want to hear, he or she may, rather than blast away at it, simply bury it without obituary or funeral. Several years ago, a meta-analysis that found the opposite of what a senator chairing a committee advocated vanished after delivery. It was never mentioned in the committee's deliberations, let alone at its hearings, and has never been heard of elsewhere.
Smoke Gets in Your Eyes-and Lungs, and That's Nothing to Sing About In 1964 Surgeon General Luther L. Terry issued an impressive report marshaling evidence that cigarette smoking is a cause oflung cancer and chronic bronchitis; the tobacco industry's Tobacco Research Council counterattacked by funding a number of studies that, not surprisingly, discounted the risks of smoking. Congress, faced with conflicting evidence and, more important, immobilized by the tobacco industry'S contributions to political campaigns and the quid pro quo relationships between tobacco-state congressmen and their colleagues, did nothing to limit smoking for many years. Not until the 1970s did the antismoking forces win a victory, and then only a minor one, a law requiring warning labels on cigarette packs. With Congress so reluctant to act against smoking-the cause of an estimated 434,000 deaths each year, 40 percent of all cancer deaths 30it is not surprising that it was even less inclined to act against "passive smoking" (breathing air contaminated by the smoke of others). It was not for lack of evidence, however: In 1986 the National Research Council and the Surgeon General independently issued reports that found "environmental tobacco smoke" (ETS)-passive smoking-to be a cause of lung cancer in adult nonsmokers and of respiratory ailments in children. 31 The tobacco industry again responded by citing other studies-in this round, not funded by it-which contradicted those findings; as mentioned in chapter 1, two studies published in 1984 actually found less lung cancer among people exposed to ETS than those not exposed. But in 1988 a meta-analysis by an independent riskassessment organization combined the hard-to-believe 1984 studies with fifteen others and concluded that ETS increased the risk of lung cancer by fifty percent in women and more than doubled it in men. 32 Still Congress took no action.
How Science Takes Stock
155
But another avenue exists via which evidence obtained by metaanalytic techniques can affect policy, namely, regulatory agencies to which Congress has delegated power, in this case, the u.s. Environmental Protection Agency (EPA). In 1986 Congress passed the Superfund amendment to the Radon Gas and Indoor Air Quality Research Act, authorizing EPA to provide "information and guidance" on the potential hazards of indoor air pollution. On those grounds, James Repace, a senior policy analyst in the Indoor Air Division of EPA, suggested to his superiors that if EPA conducted its own scientific study of the risks of ETS, it could legitimately put forth a policy on passive smoking. Robert Axelrad, director of the Indoor Air Division, welcomed the suggestion, provided the major funding for such a study in 1988, and named Steven Bayard, a biostatistician in the EPA's Office of Research and Development, its technical director. Bayard, unfamiliar with the scientific literature on ETS, at first doubted that there could be a link between second-hand smoke and cancer.33 Because he thought that lung cancer studies, taken individually, had very little power to detect the effects of ETS, since virtually everyone is exposed to at least some second-hand smoke, he was more inclined to believe that the combined data of a number of studies would yield a believable conclusion. Bayard assembled a staff of seven highly qualified people: two EPA staffers, an epidemiologist at Yale, a medical researcher at the University of Arizona, and three staff members of his prime contractor, an engineering and risk-assessment research firm in Chapel Hill, North Carolina. He also lined up eighteen specialists at universities and health agencies to serve as reviewers and consultants. The team then exhaustively searched the scientific literature for research on the effects of ETS, ending up with thirty studies of ETS and lung cancer in adults plus more than fifty other recent studies of ETS and respiratory disorders in children. Because the thirty lung cancer studies, despite having been conducted in eight different countries, used similar methodologies and measures of effect, the team's statisticians were able to meta-analyze them. They also separately analyzed a subset of the higher-quality studies and another of the eleven studies conducted in the United States. For all these poolings, they compared the estimates of the risk for death from lung cancer among nonsmoking wives exposed to their husbands' smoke with that of nonsmoking wives not so exposed. These and other comparisons filled seventy-eight tables of data in the five-hundred-plus page report. Three of the studies showed, contrary to common sense, that passive smokers had less chance of getting lung cancer than other
156
Lighting the Way for Makers of Social Policy
nonsmokers, but when all the studies were meta-analytically combined, the results were quite clear-and totally overcame Bayard's doubts. After extensive public and expert review, the report was released in January 1993. Some of its most striking conclusions: • Based on the pooled data of the eleven studies conducted in the United States, the risk of lung cancer death in women nonsmokers exposed to their husbands' smoke was 19 percent greater than the risk for women nonsmokers with nonsmoking husbands. When the eight higher-quality studies were used, the risk was 22 percent greater. The differences were statistically significant. 34 • Based on the seven U.S. studies that had data on men, the risks for male nonsmokers exposed to their wives smoke was 40 percent higher than for male nonsmokers not so exposed; based on eleven studies from various countries, it was 60 percent higher. Because far fewer men than women had been studied, the results were statistically less certain; even so, the team concluded, the relative risks for men were at least as great as, and probably greater than, those for women. 35 • While not all studies showed a statistically significant association between ETS and lung cancer, the pooled results of all of the studies did. The risks in other countries and places, among them Greece, Japan, and Hong Kong, were much greater than those in the United States. The four studies from Western Europe, when combined, also showed a positive correlation, but not at a statistically significant leveL Only the four studies from China showed no such connection when combined, but many of the Chinese homes were so polluted by heating with smoky coal and by smoky indoor cooking that any ETS effect would be masked. 36 The report's summary presented these important estimates of the effects of ETSY • ETS is responsible for about three thousand lung cancer deaths annually in U.s. nonsmokers. • ETS annually causes an estimated 150,000 to 300,000 cases of lower respiratory tract infections, such as bronchitis and pneumonia, among American infants and young children up to eighteen months of age. • ETS exposure causes upper respiratory tract irritation and a small but Significant reduction in lung function. • In anywhere from 200,000 to 1,000,000 U.S. children with asthma, ETS causes additional episodes and more severe symptoms. In children who have not yet shown symptoms of asthma, ETS increases the risk of developing it.
How Science Takes Stock
157
In a foreword to the EPA report, Louis W Sullivan, secretary of the U.S. Department of Health and Human Services, and William K. Reilly, administrator of EPA, wrote, [This report] provides important new documentation of the emerging scientific consensus that tobacco smoke is not just a health risk for smokers. It is, in fact, also a significant risk for nonsmokers, particularly for children. This report demonstrates conclusively that environmental tobacco smoke increases the risk of lung cancer in healthy nonsmokers. The report estimates that roughly 30 percent of all lung cancers caused by factors other than smoking are attributable to exposure to environmental tobacco smoke. Put another way, a nonsmoker exposed to environmental tobacco smoke during everyday activities faces an increased lifetime risk of lung cancer of roughly I-in-SOO to l-in-I,OOO. By comparison, EPA generally sets its standards or regulations so that increased cancer risks are below l-in-lO,OOO to I-in-a-million. In other words, estimated lung cancer risks associated with environmental tobacco smoke are more than ten times greater than the cancer risks which would normally elicit an action by EPA.3s
EPA did, in fact, take action half a year later. InJuly 1993, it announced gUidelines on smoking in public buildings. All companies and agencies operating public buildings were asked to either ban smoking or use ventilation to assure that people are protected from other people's smoke. 39 By then, EPAS determination that ETS was a cancer hazard was having multiple policy effects: A number of states and cities had enacted or were considering controls over smoking in public places; the Occupational Safety and Health Administration (OSHA) had proposed banning smoking in six million workplaces nationwide; DOD had prohibited smoking in its workplaces worldwide; and hundreds of private real estate companies, restaurateurs, operators of fast-food eateries, and other entrepreneurs were making efforts to control ETS in order to avoid lawsuits based on EPAS finding. 4o The tobacco industry, accustomed to fighting, struck back: In aJune 1993 hometown filing in Winston-Salem, North Carolina, R.]. Reynolds sued the EPA in an effort to overturn EPAS designation ofETS as a known human carcinogen, claiming that EPA had manipulated its scientific studies and ignored accepted statistical practices in order to arrive at its risk assessment. 41 Reynolds and Phillip Morris conducted massive advertising campaigns in newspapers, attacking both EPAS science and motives and depicting smokers as victims of government oppression. Another attack on the EPAS findings came from a different quarter. The Congressional Research Service, in response to a congressional re-
158
Lighting the Way for Makers of Social Pulicy
quest, prepared an economic analysis of the Clinton administration's proposal to fund health care reform using new cigarette taxes; the authors of the study, two economists, concluded that the tax would be relatively ineffective, and then, curiously, went beyond the bounds of their assignment to attack the EPA study of passive smoking, citing primarily studies funded by the tobacco industryY It is worth noting that one of those studies found the levels of ETS in a number of public places to be relatively low-and that Representative Henry Waxman, chairman of the House Health and Environment subcommittee at that time, revealed that three workers on the study told people on his staff that their superiors had altered their data and consistently underreported the measurements of cigarette smoke. 43 But the tobacco industry's attacks and the Congressional Research Service report did not reverse or stay the course of events. The National Research Council's findings, supported by the EPA's meta-analysis, has been the propelling force behind a steady series of legislative and nonlegislative moves to control and eliminate ETS in public places. Angry smokers have put on a few half-hearted demonstrations, and a few hotheads have defied the ban on smoking in airplanes or restaurants (and gotten themselves arrested for their pains). The resistance has come to little; metaanalytic knowledge has prevailed.
"More Ways than One to Skin a Cat"-Mark Twain As the case of passive smoking shows, meta-analysis can exert an important influence on social policies by means other than legislative acts. The chief such mechanisms are the establishing of policies and practices by agencies to which Congress, state legislatures, and city governments have aSSigned the authority to do so. An impressive example is the publication of medical guidelines by the Agency for Health Care Policy and Research (AHCPR). This agency, a branch of the Public Health Service, was established in December 1989 by Public Law 101-239 to "conduct and support general health services research." As part of its broad mandate, AHCPR publishes reports, for the benefit of health care providers, policy-makers, and the public, of the best and latest research on various medical problems and treatments. The reports come in two forms: Clinical Practice Guidelines, which are substantial volumes on a number of subjects, including management of cancer pain, diagnosis and treatment of benign prostatic enlargement, and the treatment of major depression; and Quick Reference Guides for
How Science Takes Stock
159
Clinicians, which are slim brochures presenting the same material in abbreviated form. Each Guideline is the work of a panel of independent experts commissioned by AHCPR; the panelists gather information in a number of ways, the major one being a literature review that results in a metaanalysis if enough experimental studies are found. A typical example is the panel that produced Management of Cancer Pain (Guideline Number 9); it searched nineteen data bases, screened 9,600 citations, and, in a meta-analysis of 550 of them, appraised the effectiveness of a number of interventions. H (The Guidelines run anywhere from about 100 to 260 pages, the Quick Reference Guides from twelve to forty pages. The seventeen Guidelines and seventeen Quick Reference Guides published thus far are available free to practitioners, scientists, educators, and "consumers" [patients]; they can also be downloaded free from the National Library of Medicine's data base, Health Services Technology Assessment Text.) A similar program exists in the United Kingdom: Teams at two centers established by the National Health Service, one in Oxford and one in York, prepare "systematic reviews"-their term for meta-analyses-of existing information on medical subjects and make their findings available to physicians and other interested parties. 45 In addition to the AHCPR, various institutes within the National Institu tes of Health are beginning to base their advisory notices to the medical and scientific communities on meta-analyses. The National Heart, Lung, and Blood Institute, for one, recently issued a warning that the short-acting form of nifedipine, a calcium channel blocker employed to control high blood pressure and heart disease, should be used "with great caution, if at all." Although the Heart Institute's warning was not binding on doctors, it undoubtedly carried considerable weight with them. It was based on a meta-analysis conducted for the institute by a team at the Bowman Gray School of Medicine in Winston-Salem headed by Dr. Curt D. Furberg, chairman of the department of health sciences. The team meta-analyzed sixteen clinical trials involving 8,350 heart attack patients given nifedipine. The meta-analysis showed that during the period covered, patients on high doses of the drug had an average risk of death at least twice that of similar patients receiving either a placebo or no drug. 46 The meta-analytic findings about nifedipine are exerting an influence on policy in a second way: Dr. Robert Temple, a top official at the Food and Drug Administration (FDA), said that the study provided credible data and that the FDA would be looking at it closely and considering a change in labeling for nifedipine.47 The FDA itself has recently
160
Lighting the Way for Makers of Social Policy
begun to use meta-analytic findings in some of its reviews of new drug applications. Normally, the FDA reviews randomized controlled clinical studies submitted by the drug companies wishing to market a new drug; when the findings of the studies are inconsistent, the FDA has begun using meta-analysis to resolve the uncertainties and reach decisions about the safety of the new drugs and establish labeling requirements. 48 A number of the meta-analyses presented in this book appear to be having some influence-"appear," because no measurements of it exist-on policies and practices of America's schools, hospitals, state welfare programs, mental health clinics, courts, prisons, and other institutions. The influence of meta-analytic findings has been exerted through legislative measures, through regulatory policies, and through voluntary changes in behavior, the moral equivalent of policy. That, at least, is the impression of knowledgeable people interested in metaanalysis and policy-making. There is also psychological reason to believe it correct: Information is known to change attitudes, and attitude change is known to be the precursor of behavior change. With that, let us rest the case for the influence of meta-analysis on policy and conclude with a few speculations about the future of this powerful new method of extracting the truth from the thorny and resistant fruits of research.
Chapter 7
Epilogue: The Future of Meta,Analysis Elixir of Forecast: Take Minimal Dose
H
ow will meta-analysis develop from this point on? It is a question one may well be wary of answering, for predictors of the future, particularly of social phenomena, often turn out to be embarrassingly wrong. Herbert Hoover assured Americans in 1932 that "prosperity is just around the corner"; it turned that corner fifteen years later. Hitler said that his Third Reich would last a thousand years; it lasted a dozen. As Francis Bacon tartly commented, predictions are good only for winter talk by the fireside. Well, yes and no. Actually, anyone can quite accurately predict the weather five minutes hence, but even the highly computerized National Weather Service does poorly when forecasting a week ahead. To say how meta-analysis will develop and what role it will play in science a generation from now would be mere guesswork, but one can be fairly certain how it is likely to develop in the next several years. There are good grounds on which to base these short-range extrapolations. The best evidence is the remarkable growth of meta-analyses in scientific journals (none in 1977, nearly 400 in 1994) and in data banks (none in 1977, nearly 3,500 in 1994). The leveling off of that growth in the last several years suggests that meta-analysis has found and is filling its proper niche in contemporary science. The foregoing chapters have tried to highlight its achievements in fields ranging from biology and medicine to education and social psychology. What meta-analysis will be able to accomplish in the near future follows logically from them; it is almost certain that meta-analysts will continue to improve the precision and certainty of their findings and be increasingly capable of determining causal connections. The American Statistical Association, hardly a hotbed of enthusiasts and radicals, sees the methodology as having the potential to revolutionize how research, particularly in medicine, is done,l and in an article about metaanalysis in Science, the statistician and meta-analyst Frank Schmidt of 161
162
Epilogue: The Future of Meta-Analysis
the University of Iowa is cited as predicting that meta-analysis could transform research in the behavioral sciences. 2 Another indicator of the short-term future of meta-analysis is the waning of disbelief in and opposition to it. No longer do journal editors reject a meta-analysis on the grounds that it is "not original research," and no longer are speakers who present meta-analytic findings at scientific symposia scoffed at. Even the usually arthritic and rule-bound agencies of the federal government are beginning to recognize the value of meta-analysis, and some even to practice it; that growing acceptance is likely to continue over the next several years. To be sure, there still are some nonbelievers and critics. H.j. Eysenck is doing business at the same old stand and repeating all his well-worn castigations-"garbage in, garbage out," "apples and oranges," and so on. When I recently visited him in London, he even upgraded his old epithet for meta-analysis, "mega-silliness," to "mega-imbecility." In fairness to him, however, it should be said that he does not initiate these attacks on meta-analysis. "It's a sideline, as far as I'm concerned," he said. "I get invited by various journals to write on meta-analysis because everybody seems to be for it and they're looking for somebody who's against it. So they come to me." A few American critics of meta-analysis are as dead-set against it as Eysenck. In an article in the American Journal of Epidemiology, Samuel Shapiro of the Slone Epidemiology Unit, Brookline, Massachusetts, sneeringly characterizes it as follows: "Meta-analysis begins with scientific studies, usually performed by academics or government agencies, and sometimes incomplete or disputed. The data from the studies are then run through computer models of bewildering complexity; which produce results of implausible precision. "3 In American Psychologist, David Sohn, a psychologist at the University of North Carolina, is even more caustic and rejecting. As quoted earlier, he argues that primary research is the only valid way to make discoveries, rates the claim that meta-analysis is a superior mechanism of discovery simply farcical, and scorns the notion that "the process of arriving at truth is mediated by a literature review. "4 But his identification of meta-analysis with literature reviews is itself farcical; anyone who has even a nodding acquaintance with the subject knows that meta-analysis is a far different, more rigorous, and statistically objective procedure for extracting truths from a mass of diverse reports. But not many such voices are raised against meta-analysis today. Sounder and more useful criticisms of, and predictions about, metaanalYSis come from those who support and use it while pointing out its
How Science Takes Stock
163
pitfalls and shortcomings. They do not scoff at the "apples and oranges" criticism, for instance, but specify how and when it applies, and how and when it does not. They recognize the risk of publication bias but go to great pains to minimize it and to test for it. They admit the multiple threats to validity in meta-analysis but layout the procedures by which those threats can be circumvented. They freely acknowledge the many kinds of error or distortion possible in meta-analysis and prescribe remedies for each. Robert Rosenthal spends an entire chapter of his Meta-Analytic Procedures for Social Research on the principal weaknesses of meta-analysis and ways to deal with them. Harris Cooper and Larry Hedges, editors of the ponderous and authoritative Handbook of Research Synthesis, allocate half a dozen of its thirty-two chapters to the hazards of meta-analytic methodology and how to avoid them, and in their concluding remarks they foresee meta-analysis, incorporating all these precautionary procedures, as becoming part of the standard armamentarium of scientific research: There is no reason to believe that carrying out a sound research synthesis is any more complex than carrying out sound primary research .... When research synthesis procedures have become as familiar as primary research procedures much of what now seems overwhelming will come to be viewed as difficult, but manageable. 5
A Few Specific Prophecies The future is bound to bring numerous minor refinements and innovations in meta-analytic procedures, particularly the statistical ones. Many have been suggested in this book, but some larger and perhaps more important trends are also clear enough to permit credible short-term predictions. Four are worth noting: Finer slices of the information pie: Founding Father Gene Glass, whose original emphaSis was on the main effect, now says, What I've come to think meta-analysis really is-or, rather, what it ought to be-is not little single-number summaries such as "This is what psychotherapy's effect is" but a whole array of study results that show how relationships between treatment and outcome change as a function of all sorts of other conditions-the age of the people in treatment, what kinds of problems they had, the training of the therapist, how long after therapy you're measuring change, and so on. That's what we really want to get-a total portrait of all those changes and shifts, a complicated landscape rather than a single central point. That would be the best contribution we could make.
164
Epilogue: The Future of Meta-Analysis
We have, in fact, already seen plentiful evidence in previous chapters of a trend in that direction-the increasing emphasis on seeking out not only main effects but the special relationships between moderator variables, mediator variables, and outcomes that account for the differing results of studies of the same subject. This kind of fine-toothed analysis yields findings as important as the central effect, namely, causal explanations and findings as to what variants of the treatment work best and why. Continuously updated meta-analytic findings: Joseph Lau and Thomas Chalmers, as we have seen, have campaigned for the more rapid adoption by physicians of meta-analyses of clinical trials and, to make that more feasible, developed the technique of cumulative metaanalysis. Taking that concept a giant step further, in 1992 Lau, Chalmers, and four colleagues proposed to the AHCPR that it fund their research in developing a "Real-Time Meta-Analysis System" (RTMAS) and were awarded a $1.2 million, three-year grant. By "real-time" Lau and his colleagues mean that their system will provide medical leaders and practicing physicians with updated metaanalyses of drug trials as fast as new studies are completed. In contrast, the present dissemination of meta-analytic data on new treatment outcomes is slow and tortuous; years can elapse between the completion of an important clinical trial and its incorporation and publication in a metaanalysis. The RTMAS, when fully developed, will solve this problem: It will be a methodology, fully translated into a systematic software program, for promptly evaluating and processing the data of new clinical controlled trials and then incorporating those data into one or many ongoing and always up-to-date cumulative meta-analyses. It will be, Lau says, "a research tool to do meta-analyses more efficiently and reproducibly via standardization of data extraction and analytic methods, while maintaining flexibility and the 'recycling' of previous research efforts."6 The Cochrane Collaboration: A different and even more ambitious use of meta-analysis is already being made by the Cochrane Collaboration. This organization, as mentioned in chapter 4, is an international network of two dozen "collaborative review groups"-teams of volunteers in each of many health care specialties who keep track of the latest findings in their fields, evaluate them, and incorporate them in systematic reviews and, when feasible and appropriate, meta-analyses. The original data and the pooled results are published electronically through The Cochrane Database of Systematic Reviews, available on disk, CDROM, and on-line. The system enables doctors to keep easily abreast of the current state of knowledge.
How Science Takes Stock
165
The Cochrane Collaboration and its products operate through Cochrane centers in several countries. The UK Cochrane Centre at Oxford is funded chiefly by the U.K.'s National Health Service Research and Development Programme; the Cochrane centers in other countries are funded by various public agencies and, to a limited extent, by foundations. Despite their diverse sources of funding, the centers in the United Kingdom, the United States, Canada, Italy, the Nordic countries, and Australasia are cooperatively linked in the Cochrane Collaboration. It may well become one of the chief mechanisms by which meta-analysis will play an increasingly important role in medicine. 7 Envisioning the perfect study: At a 1986 workshop on metaanalysis, Donald Rubin felt that everyone was too much in agreement and set out to foment some dissension. He did so by telling his audience: Often, the right way to add provocative stimuli is to claim that everything everybody is doing is wrong. Even when this isn't true (and it certainly can't be in this context), it is often useful to take such an attitude and see how far it can be pushed. So I begin by claiming that everything everybody's been doing statistically for meta-analysis, including the things Bob Rosenthal and I have done and do, are irrelevant and have missed the point. We all should really be doing something else. B Having thoroughly aroused his listeners, he then outlined his idea of the something else they should be doing. Instead of combining and summing up a batch of imperfect studies, they should be developing statistical methods of estimating what a perfect study would find. To do so, they should take a set of studies of some treatment and its effects, examine how a change in the value of each variable affects the outcome, and then, using curves plotted from those data, extrapolate to what a perfect study-alI-embracing, free of sampling or other errors, and without any methodological uncertainties-would yield. If, for instance, the effect size changes with the size of the sample, they would project statistically what the outcome would be if the sample included 100 percent of the individuals in the universe being studied. Or if the effect size varies with the experience of the researchers, they would project the findings of ideally experienced researchers, and if these results covaried with the size of the sample, they would calculate the interactions all along the sample-size curve by a second crisscrossing set of curves for the variable of experience. And so on for every conceivable variable that affects the outcome: time, place, method of treatment, characteristics of the one who is treated and of the one who is administering the treatment, and so forth.
166
Epilogue: The Future of Meta-Analysis
The effects of all these relevant variables extrapolated to the theoretically perfect study, and the results of their interactions with every other variable, can be envisioned not as a set of curves on a two-dimensional graph but as a "response surface"-an undulating three-dimensionallandscape of plains, valleys, and hills representing all the points of reality as revealed by the perfect study. What is all wrong with meta-analysis, according to Rubin, is that it interpolates-looks for an answer inside a set of studies; what it should do is extrapolate-look for an answer beyond them by using their data to project the findings of an ideal study. These would be the "true" effects, not the probable effects yielded by present techniques. "Standard statistical tools can be used to build and extrapolate this treatment-effect response surface," Rubin said, but thus far has offered only a few hints as to how that would be done. He admits that developing the necessary statistical tools and procedures would require "a relatively massive statistical effort in the context of real examples," and, being chairman of the statistics department at Harvard, he is too busy to do it; he hopes, though, that some graduate student will come along who will carry out the work under his guidance. Reactions at the workshop and in the field to Rubin's vision have been mixed. His concept is dizzying and even alarming, and some metaanalysts and statisticians regard it as high-flown flapdoodle. Others, however, find it challenging, constructive, and a reasonable prediction of the future. A review of the book in which Rubin's presentation appeared concluded, "The chapter that truly speaks to the future of metaanalysis, and not to its distant and recent past, is the chapter by Donald B. Rubin entitled 'A New Perspective.' ... [It is} an eminently sensible, indeed altogether wholesome, conceptualization of what research synthesis should be. "9 The reviewer was Gene Glass.
A Parting Word I have been a science writer nearly all my working life, and in the course of my work I have listened to and read the work of many bright and some brilliant men and women: mathematicians, phYSicists, chemists, physicians, geneticists, and, most of all, psychologists and sociologists. They have been a very varied lot, but the one trait they share is a pervasive and communicable enthusiasm about what they do. All serious and gifted
How Science Takes Stock
167
scientists, I think, are adventurers-intellectual explorers making their way into terra incognita and enduring the dreadful tedium and innumerable travails and disappointments in the hope of achieving that transfiguring moment when at last they glimpse the truth and feel much likestout Cortez when with eagle eyes He star'd at the Pacific-and all his men Look'd at each other with a wild surmiseSilent, upon a peak in Darien. (John Keats, "On First Looking into Chapman's Homer")
That has most certainly been true of the fifty-odd men and women I met and interviewed for this book. Some were young, others not so young, and a few old; some were wonderful explainers and others hard to follow; many were ebullient and some sober; but all-truly; all-were enthusiastic about what they were doing and conveyed a sense of high intellectual adventure. I found it exciting and enriching to listen to them, read the fruits of their studies, and share their journeys and discoveries. I regret that I have come to the end of my travels with them. I can only hope that you do, too.
Appendix
Some Finer Points in Meta~Analysis Harris Cooper The main body of the text introduced meta-analysis to the general reader, with a particular eye to underscoring the field's novel contribution to social science methods. As a consequence, it dealt with the actual mechanics of meta-analysis in a necessarily cursory way. This appendix aims to counter that deficiency, and much of the material in it requires an acquaintance with basic statistics. (Needless to say, even this amplifying discussion is somewhat simplified, however, and leaves many issues untouched. For the interested reader, the many excellent metaanalysis methods books mentioned in the text can provide a thorough picture of the field.) This appendix is divided into three sections. Each section addresses a major issue in conducting a meta-analysis. The first section concerns how meta-analysts obtain effect sizes from individual studies in a form that is combinable across studies. The second section describes the mechanics of combining effect sizes and analyzing them for differences related to moderator variables. The third section deals with how metaanalysts interpret effect sizes, assigning degrees of "largeness" or "smallness" and of "importance" or "triviality" to the statistic.
Obtaining Effect Sizes in Combinable Form Actually, about a dozen effect size metrics exist and are described in a book by Jacob Cohen (1988), entitled Statistical Power Analysis for the Behavioral Sciences. In practice, however, the three effect size metrics discussed in the text-the r-index, the d-index, and the odds rationearly always do the trick. They are focused and informative, and one of the three fits just about every situation. Each effect size metric accommodates relationships among variables with different measurement characteristics. The r-index is used to express the relationship between two continuous variables, such as when 169
170
Appendix: Somc Finer Points in Mcta-Analysis
class size is related to achievement. The d-index relates one dichotomous variable to a continuous variable, such as comparing clients who do or do not receive psychotherapy (dichotomous) using subjective measures of well-being (continuous). The odds ratio is used when both variables are dichotomous, such as the effect of taking or not taking aspirin on the occurrence or nonoccurrence of heart attack. The choice of how to express the effect sizes, then, is determined by which metric best fits the measurement characteristics of the variables under consideration. A meta-analyst makes that choice by looking at the central relationship under consideration and at how the variables are measured. A problem arises when the meta-analyst comes across separate studies that examine the same relationship but do so using different metrics. Suppose we are looking for studies on class size and academic achievement. We find one study that employed as data all the naturally occurring classes in a school district. Clearly, the r-index best fits this study because both class size and achievement vary over a continuous range of values. However, we find another study that used a large school district and compared only those classes with more than thirty students, labeled the "large class" group, with classes of fewer than twenty students, labeled the "small class" group. Classes with twenty-one to twenty-nine students were discarded from the data set. This study seems to fit the d-index best because the class size variable is measured as if it were dichotomous. Which should the meta-analyst use? The r-index is still the metric of choice because the variables are conceptually continuous, even though the second study created two "artificial" extreme groups (with the probable intention of maximizing the chance that a Significant difference would be found). So, for the second study we would calculate ad-index from the large-class and small-class means and standard deviations and then convert it to an r-index using a simple formula. Conversion formulas, and tables of equivalent values based on them, allow meta-analysts to move from one effect size metric to another. They are available in meta-analysis textbooks. Suppose we find a study that compares three groups, say, a traditional psychotherapy versus a brief psychotherapy versus no treatment. How would we generate an effect size? In this instance, we would most likely consider calculating two d-indexes, one comparing traditional psychotherapy to no treatment and another comparing brief psychotherapy to no treatment (we could also consider comparing traditional and brief therapies if this were the focus of our review). The two d-indexes would not be statistically independent, since both rely on the means and standard deviations Csds) of the same no-
How Science Takes Stock
171
treatment group. But this complicating factor (discussed below) is preferable to the alternative strategy of using yet a fourth effect size metric called percentage of variance (PV). PV tells us how much of the variance in the dependent variable is explained by group membership. PV has the initially appealing characteristic of being usable regardless of the number of groups in the study (indeed, it can be used with two continuous measures as well). However, it has the unappealing characteristic of being relatively uninformative: The resulting effect size tells us nothing about which of the three treatments is most effective. Identical PVs can result from any rank ordering of the three group means. Thus, PV is an unfocused effect size metric that is rarely, if ever, used by meta-analysts. Another problem facing meta-analysts is that many research reports do not contain the information needed to estimate an effect size. This is probably the most frustrating problem encountered by quantitative reviewers. To address this problem, the meta-analyst must comb a research report to find the data needed to make the calculation and sometimes make some assumptions. Let's stick with the example of the effects of psychotherapy. In the best of all possible worlds, every research report that compared the wellbeing of psychotherapy patients with untreated controls would present either a d-index calculated by the primary researchers or a table that included the means, standard deviations, and sample sizes for each group. Often, too often, that does not happen. Sometimes only the means associated with a significant difference are given. In this case, the metaanalyst first looks to see if the primary researchers reported the exact value of the statistical test associated with the psychotherapy comparison. In its simplest form, this will be a t-test, a test of the difference between two means. Say the researchers report that the comparison between final well-being indicated that clients receiving psychotherapy reported greater subjective well-being (M = 52.4) than the no-treatment controls (M = 48.2), where M signifies the mean, and that the t-value of 2.11, using sixty subjects, was Significant at p < .05. No standard deviations are reported, so the d-index cannot be calculated directly. However, because the d index and t-test rely on much the same information (means, sds, and sample sizes), the meta-analyst can use a simple formula (also available in meta-analysis texts) to estimate the effect size from the value of the significance test. In this case, d = .55. If the exact t-value is not given but the primary researchers do report a p level and sample size, the meta-analyst can estimate the t-value (by finding the two known quantities in a table of t-values) and plug it into the formula.
172
Appendix: Some Finer Points in Meta-Analysis
If a single-degree-of-freedom F-value from a one-factor analysis of variance is given, it can be converted to at-value (F = t2 ) and the formula used. However, if an F-value is based on a multiple-group comparison (say, traditional versus brief versus no therapy), or, if it comes from an analysis of variance with more than one factor (say, both treatment and sex of the client are part of the computations), then additional steps must be taken to make the F-value equivalent to the simple t-test. Sometimes these adjustments can be made, sometimes not. Interestingly, studies that relate two continuous variables are quite likely to report the exact r-index associated with the result. If not, there is also a formula for converting a t-test to an r-index. Similarly, studies that relate two dichotomous variables are likely to present a table with the needed cross break of data. The odds ratio is then easily calculated. Meta-analysts have invented lots of other seat-of-the-pants techniques that allow them to generate effect size estimates from incomplete reports. In nearly all cases, the meta-analyst is forced to make assumptions about the data that can be plaUSible or suspect. How plausible or suspect is for the users of the meta-analysis to decide. For all effect size metrics, the biggest problem arises when primary researchers report that the comparison of interest was "not significant" and give no statistics whatsoever. Then, meta-analysts are really stuck. They can try to contact the researchers and ask for the data they need. This strategy meets with limited success. Alternatively, they can ignore the study, but this may upwardly bias the average effect size estimate. Or, they can assume the missing effect size is exactly zero, but this will likely bias their average estimate downward. Or, they can assume the effect size fell just below significance and calculate its largest possible nonsignificant value. If this value is assumed to be in the prevailing direction of other effect sizes, it will likely cause an overestimation of the average effect. If this value is assumed to be in the direction opposite to most other effect sizes, it will cause an underestimation. Given all these possibilities, it is not unusual for meta-analysts to present multiple estimates of average effect sizes based on different assumptions about missing data. For example, a meta-analyst can present an average effect size based on the known data and an average based on very conservative assumptions about missing data, sort of stacking the deck against the hypothesis. If both approaches lead to a similar conclusion about the existence and importance of a relationship, the users of the meta-analysis can have greater confidence in its conclusion. A bias in estimates of average effect sizes (and in comparisons between effect sizes) can result if missing data lead to the exclusion of the
How Science Takes Stock
173
study. For this reason, meta-analysts are obligated to discuss how many data were missing from their reports, how the missing data were handled, and why they chose the methods they did. A sample statistic-be it an effect size, a mean, or a standard deviation-is based on measurements of a small number of people drawn from a larger population. Sometimes these sample statistics will differ in known ways from the value obtained if every person in the population could be measured. Because effect size estimates based on samples are not always true reflections of their underlying populations, metaanalysts have devised ways to adjust for the bias. A d-index based on small samples (less than ten people in each group) tends to overestimate the population value. So, a correction factor can be calculated to adjust small-sample d-indexes, making the val. ues smaller but more accurate. Formulas and tables for calculating the adjustment are available. Some ways of calculating odds ratios also can lead to over- or underestimates, and similar adjustments may be needed. In a related vein, r-indexes that estimate underlying population values very different from zero have nonnormal sampling distributions. This occurs because r-indexes are limited to values between + l.00 and -l.00. So, as a population value approaches either of these limits, the range of possible values for a sample estimate will be restricted toward the approached limit. Therefore, the distribution of sampled values will become more skewed away from normal. To use a visual image, if a set of sampled r-indexes centers around a value of, say, .40, then more positive values can range only from .41 to l.00 while less positive values can range from .39 to -l.00. Thus, the distribution will appear to have its upper tail (the one at the r = + l.00 end) cut off. To adjust for this, some meta-analysts convert r-indexes to their associated z scores, which have no limiting value and which will be normally distributed, before the estimates are combined. In essence, the transformation "stretches" the upper tail and restores the bell shape. Then, the z scores are combined and averaged. Once an average z score has been calculated, it can be converted back to an r-index. If the r-index is close to zero, there is really no need to do the transformation. (Indeed, some meta-analysts eschew this transformation regardless of the r-index value.) An examination of the r-to-z transformation table reveals that the two values are near identical until r = .25. However, when the r-index equals .50 the associated z score equals .55, and when the r-index equals .8 the associated z score equals 1.1. Another statistical problem arises when a single study contains multiple effect size estimates. This is most bothersome when more than one
174
Appendix: Some Finer Points in Meta-Analysis
measure of the same construct is taken and the measures are analyzed separately. Suppose we are conducting a meta-analysis comparing the effects of two workgroup structures on job satisfaction. We find a study that compared the two structures using both an attitude measure and a measure of absenteeism. Obviously, since the same workers are involved, these measures are not independent estimates of the effect. Thus, it would seem unfair to combine these two estimates plus a third from a separate study to arrive at an average effect. The first study would be given too much weight. Also, the assumption that effect size estimates are independent underlies the other meta-analysis procedures described below. There are several approaches that meta-analysts use to handle dependent effect sizes. Some treat each effect size as independent, regardless of the number that come from the same sample of people, and assume that the effect of violating the independence assumption is not great. Other meta-analysts use the study as the unit of analysis. In this strategy, they calculate the mean effect size or take the median result and use this value to represent the study. So, if a study reports three nonindependent d-indexes of, say, .10, .15, and .35, the study might be represented by a single value of .20 if the mean is used, or .15 if the median of the several measures is used. Recently, sophisticated statistical models have been suggested as a solution to the problem of dependent effect size estimates. However, the viability of these procedures lies in whether the meta-analyst can credibly estimate the actual degree of relation among dependent measurements. Because this approach is new and complex, and estimating dependencies is tricky, it has not been used much yet. The issue of dependent estimates is not confined to the problem of multiple measures taken on the same sample. Results in the same study that are reported separately for different samples of people also share certain influencing factors. Suppose the effect of workgroup structure on job satisfaction is estimated separately for men and women within the same study. Then, the samples are independent but the type of work might not be, nor the quality of management, nor the study's design and execution. All these things will likely make these two effect sizes more similar than any two effects drawn at random. Taken a step further, a reviewer also might conclude that separate but related studies from the same group of investigators are not independent. So when does it stop? In practice, most meta-analysts ignore the study-level interdependencies in effect sizes but not those based on shared samples.
How Science Takes Stock
175
Combining Effect Sizes Across Studies State-of-the-art meta-analytic procedures call for the weighting of effect sizes by their sample size (or more accurately, by the inverse of their variance) when they are averaged across studies. The reason for doing so is that studies with larger samples yield more precise estimates and so should contribute more to the combined result. This weighting produces an estimated effect size that is closer to the population value. As was mentioned in the text, a good meta-analysis will report the 95 percent confidence interval around an average weighted effect size estimate. The average estimate gives the most likely population value of the effect size, based in the known studies. The confidence interval gives the range of values that we can say with 95 percent certainty will contain the population value. Also, the confidence interval can be used in place of the combined p level techniques. It can be used to reject the null hypothesis that the difference between two means, or the size of a correlation, is zero. If the 95 percent confidence interval does not include d = 0 or r = 0, then the null hypothesis can be rejected. Having calculated an effect size for each study and then a weighted average of all the effects, a good meta-analyst will then want to calculate average effect sizes for subsets of studies. Often, a meta-analyst wants to compare separate estimates for studies that used different outcomes, different treatment implementations, or different types of subjects, to name just a few of the multitudinous ways studies can differ. If average effect sizes across subsets of studies are similar, then the meta-analyst can be more confident that the overall conclusion is generalizable across different situations. If the averages are not similar, then the meta-analyst has identified important qualifiers, or moderators, of the overall results. The ability to raise questions about variables that moderate effects is one of the major contributions of meta-analysis. Even if no individual study has compared different outcomes, implementations, or types of subjects, the meta-analysis can give a first hint about whether these moderators are important. Suppose we are interested in whether brief psychotherapies are more effective with male or female clients. We might examine the literature and discover that no single study compares the effect of brief therapies on men and women. However, we might find some studies that compare brief therapies to no treatment that include just men and others that include just women. By calculating the average effect sizes across these two groups of studies and comparing them, we get a first indication of whether the sex of the patient moderates the effect of therapy
176
Appendix: Some Finer Points in Meta-Analysis
We would not accept the conclusions of a primary research report if it did not formally test for the difference between two group means. Likewise, calculating average effect sizes for subsamples of studies and detailing the differences are not sufficient to claim that the estimates are truly different. Meta-analysts must formally test the difference between average effect sizes to determine if they are significant, or to rule out chance as an explanation for the difference. Early on, meta-analysts used the familiar parametric inference tests, namely analysis of variance or multiple regression, that are used in primary research. The effect size served as the dependent variable and features of the study, such as treatment or sample differences, were used as "independent," or predictor, variables. This approach was criticized, however, because meta-analytic data often do not meet the underlying assumptions of the familiar techniques. For example, the parametric inference procedures assume that each measurement of the dependent variable is influenced by a roughly equal amount of error (an assumption called "homoscedasticity of error"). However, it is unlikely this assumption is met when effect sizes are the dependent variables because the error in each effect size estimate is a function of its sample size (more error in smaller samples) and sample sizes can vary markedly from one estimate to another. For this reason, the familiar procedures have been largely abandoned in metaanalysis. In their place, two approaches have gained acceptance. The first compares the variation in the observed effect sizes with the variation expected if only sampling error were making the effect size estimates differ. This approach involves calculating (1) the actual variance in the effect sizes that the meta-analyst obtains from studies and (2) the expected variance in these effects, given that all observed effects are estimating the same underlying population value. This expected value is a simple function of the average effect size estimate, the number of estimates, and the sample sizes. Sampling theory gives us quite precise estimates of how much sampling variation to expect in a group of effects. The meta-analyst then compares the observed with the expected variance. Proponents of this procedure generally refrain from formal tests to judge whether a significant difference exists between the observed and expected variances. Instead, they establish an arbitrary criterion and use it. They might say that if the observed variance is twice as large as the expected sampling variance then they will assume the two are reliably different. Whatever the criterion, if the variance estimates are deemed similar, then sampling error is the simplest explanation for
How Science Takes Stock
177
why the effect sizes differ. If they are deemed different, that is, if the observed variance is much greater than that expected from sampling error alone, then the search for systematic influences on effect sizes begins. In addition, proponents of this approach often adjust effect size estimates to account for methodological artifacts that have calculable influences. For example, we know that social science measurements vary in reliability because they are influenced by factors extraneous to the construct they are meant to gauge: A measure of academic achievement is influenced not only by what students know but also by their health on the day of the test and the degree of commotion in the testing room. For some measures, the amount of unreliability can be estimated. With them, the effect sizes can be adjusted to tell us what the relationship among variables might be if the measures were pristine. When two variables are involved, these adjusted effect sizes are always larger than the observed ones. If the estimates are not available, the adjustments can be carried out using plausible assumed values or not done at all. The second approach, called homogeneity analysis, provides the most complete gUide to making inferences about a research literature. As just noted, effect sizes will vary somewhat, due to sampling error, even if they all estimate the same underlying population value. Like the first approach, a homogeneity analysis compares the observed variance to that expected from sampling error. It differs in that it includes a calculation of how probable it is that the variance exhibited by the effect sizes would be observed if sampling error alone was making them differ. In essence, the homogeneity test asks the question, "What are the chances of observing this much variance in a set of effect sizes if sampling error only is operating?" Just as in a Significance test of the difference between two means, if the probability of observing the variance is less than .05, we reject the notion that sampling error alone is the cause of differences in effect sizes. Suppose an analysis reveals a homogeneity statistic, typically called Q, that has an associated p value of .05. This means that only five times in a hundred would sampling error create this amount of variance in effect sizes. Thus, we would reject the (null) hypothesis that sampling error alone explains the variance in effect sizes and begin the search for additional influences. The meta-analyst then tests whether study characteristics explain variation in effect sizes. Studies are grouped by features, and the average effect sizes for those groups are tested for homogeneity in the same way as the overall average effect size. In the Handbook of Research Synthesis, Larry Hedges gave a good example of how a homogeneity analysis is car-
178
Arrendix: Some Finer Points in Meta-An::llysis
ried OUt. Hedges presented a table with data from ten studies examining whether a sex difference exists in the tendency to conform (see Cooper and Hedges 1994, table 19.2). The ten d-indexes were - .33, .07, - .30, .35, .70, .85, .40, .48, .37, - .06 (with positive values indicating females are more conforming). The weighted (by sample size) average of these d-indexes is + .123, indicating that females are more likely to conform than males. The overall homogeneity statistic for the set of effects is Q = 31.799, which is significant beyond p < .005. Therefore, Hedges could safely reject the hypothesis that the variance in this set of effect sizes was due solely to sampling error. What else might be causing the effects to differ, or more precisely, which characteristics of the studies might be systematically associated with variance in effect sizes? As one possibility, Hedges decided to look at the sex of the authors of the research reports. He noted that the first two effect sizes came from studies for which 25 percent of the authors were male, the third effect size came from a study whose authors were 50 percent male, and the final seven effects came from studies for which all the authors were male. Is it possible that grouping studies in this manner accounts for some of the variance in effects? First, Hedges calculated the average d-index for each of the three groups of studies. He found that the weighted average effect for the first two studies (25 percent male) was d = - .15. This indicates that studies with relatively few male authors find that males are more conforming than females. However, the 95 percent confidence interval includes values from - .39 to + .09. Therefore, because zero is contained in the interval, Hedges cannot reject the null hypothesis that this set of studies found no reliable sex difference in conformity. The single study whose authors were 50 percent male has an effect size of d = - .30, which is significantly different from zero; the confidence interval ranges from -.59 to - .01. Finally, the seven studies authored entirely by males reveal an average weighted effect of d = + .34, which is also significantly different from zero (the confidence interval is .19 to .49) but in the opposite direction. These studies indicate females are more conforming than males. Apparently, Hedges was on to something. Whether a study found males or females more conforming appeared related to the gender composition of the authors: Studies authored exclusively by males found greater female conformity; studies with at least one female author did not. But, just looking at the average d-index in the three groups of studies does not tell us if they are reliably different. To do this, Hedges returned to homogeneity analysis. This time, however, he analyzed the
How Science Takes Stock
179
variance in the three average effect sizes: - .15, - .30, and + .34. The associated Q statistic is 20.706, significant well beyond p < .005. So, Hedges could now say that grouping studies in this manner showed that some of the variance in effect sizes, beyond that expected by sampling error alone, could be accounted for by the gender composition of the team of authors. In Hedges' meta-analysis, he assumed that the effect sizes coming from the separate studies were estimating a fixed population value rather than a random one. The distinction between fixed and random effects often baffles even sophisticated data analysts. In essence, an effect size is said to be fixed when the sampling error discussed above is the only random factor affecting its estimation. Sometimes, however, other features of studies can be viewed as random variables. For example, in studies of class size and academic achievement, the effect of the teacher will differ from class to class in unsystematic ways. Might it therefore be appropriate to consider teachers to be randomly sampled from a population of teachers? Or to give another example, in a broad-based evaluation of Head Start, programs will differ in a multitude of ways from site to site. Should the included programs be considered a random sample of all programs? If the researcher answers "yes," then the statistical analysis must proceed in a fashion that takes this additional randomness into account. In meta-analysis, the question is whether the effect sizes in a data set are affected by a large number of these uncontrollable influences-in the Head Start example, by, say, differences in teachers, facilities, community economics, state regulations, and so on. If they are, the meta-analyst must choose a statistical model that takes this random variance in effect sizes into account. If they are not, then random variance in effect sizes is ignored (or, more accurately, set to zero) and a fixed effects statistical model is used. It is rarely clear-cut which assumption, fixed or random, is most appropriate for a particular set of effect sizes. In practice, most meta-analysts opt for the fixed effects assumption because it is analytically easier to manage. Some meta-analysts argue that fixed effects models are too often used when random-effects models are more appropriate (and conservative). Others counterargue that a fixed effect statistical model can be applied if a thorough, appropriate search for influences on effect sizes has been part of the analytic strategy, that is, if the meta-analyst looks at the systematic effects of influences such as teachers, facilities, community economics, and state regulations.
180
Appendix: Some Finer Points in Meta-Analysis
Interpretation of Effect Sizes Effect size estimates are of little value unless users can understand their substantive, practical, and statistical significance. The first gUides to interpreting effect sizes were provided by Jacob Cohen (1988), who proposed a set of values to serve as definitions of "small," "medium," and "large" effects. However, he recognized that judgments of "largeness" and "smallness" are relative, requiring a comparison between two items. Therefore, in defining these adjectives he compared the different average effect sizes he had encountered in the behavioral sciences. Cohen defined a small effect as d = .2 or r = .1, which his experience suggested were typical of those found in personality, social, and clinical psychology research. A large effect of d = .8 or r = .5 was more likely to be found in sociology, economics, and experimental or physiological psychology. According to Cohen then, a clinical psychologist might interpret an effect size of r = .1 associated with the impact of psychotherapy to be "small" when compared to all behavioral sciences but "about average" when compared to other clinical psychology effects. Because his contrasting elements were so broad, Cohen was careful to stress that his conventions were to be used as a last resort. At the time he was writing, comparing the effect of psychotherapy to "all behavioral science effects" might have been the best contrasting element available. Today, with so many meta-analytic estimates of effect available, other more meaningful contrasting effects can often be found. Suppose we have just completed a meta-analysis of the effect of printed public service advertisements on adolescents' attitudes toward smoking. We find that the average effect is d = .2, indicating that teenagers exposed to print ads hold attitudes toward smoking that are two-tenths of a standard deviation more negative than teenagers not so exposed. Using Cohen's guide, we would label this effect "small." However, additional contrasting elements might be available to us. These might come from other meta-analyses that looked at entirely different ways to change smoking attitudes, like high school health classes. Or, they might come from meta-analyses of different but related treatments, such as the effect of television ads. Also, we might try to find meta-analyses that shared the same treatment or predictor but varied in outcome measure, for example, ones that examined actual smoking behavior or behavioral intentions rather than attitudes. Effect sizes also need to be interpreted in relation to the methodology used in the primary research. Thus, studies with more extensive treatments (in our example, more frequent exposures to ads), more sensitive research designs (within-subject versus between-subject designs),
How Science Takes Stock
181
and more trustworthy measures can be expected to reveal larger effect sizes, all else being equal. If we are comparing the results of a metaanalysis involving carefully controlled lab studies of print ads and attitudes with one involving field studies of TV ads and behavior, we have to consider the methodological differences before drawing a conclusion about the relative bigness or smallness of print and TV ad effects. In this instance, the deck is stacked against TV ads. Judgments of size are not synonymous with judgments of importance or value. The relative merit of different treatments or explanations involves considerations in addition to the numerical size of a relationship. Most notable among these are the cost of particular treatments and the value placed on the changes the treatment is meant to foster. Suppose we wanted to explore the relative cost effectiveness of decreasing class sizes from twenty-five to twenty students versus extending the school day by fifteen minutes. First, we would have to establish an estimate of the effect of each change. Assume that a meta-analysis finds the improvement in student achievement when class sizes decrease to be d = .15. Another (or the same) meta-analysis finds the improvement in achievement associated with the addition of fifteen minutes of instruction to be d = .10. So, reducing class size has a bigger effect. But that is not enough. Next, we must calculate the cost per student of each change. We find that the new classrooms and teachers needed to reduce class size adds $200 to the cost of educating each child. The additional teacher salary needed to increase the length of the school day, plus building utilities and other expenses adds $100 per child. So, we would estimate a cost-effectiveness ratio by calculating the effect size gain per dollar spent. In this analYSis, extending the school day yields more brains for the buck.
Conclusion It is hoped that the reader has not been left with the impression that conducting a good meta-analysis presents a set of insurmountable practical problems. It does not. When social scientists first sought ways to measure variables or conduct experiments, the task must have seemed equally intimidating. However, the problems provide the challenges, not just in the search for new techniques but also in the refinement and application of the old ones. And it is the challenges, along with the clear understanding of the benefits that come with overcoming them, that provide the motivation for meta-analysts to continue work at the frontiers of scientific methods.
Notes
Unpublished quotations throughout the text are from personal interviews conducted by the author.
Chapter 1 1. Gina Koata. 1994. "New Study Finds Vitamins Are Not Cancer Preventers." The New York Times,July 21,1994, p. A20. 2. Jane E. Brody. 1995. "Study Says Exercise Must Be Strenuous to Stretch Lifetime.'· New York Times, April 19, 1995, p. AI. The original study is Lee, Hsieh, and Paffenbarger (1995). 3. Jane E. Brody. 1995. "Trying to Reconcile Exercise Findings." The New York Times, April 19, 1995, p. AI. 4. Thompson (1994) fig. 2. 5. C. G. Moertel, "Improving the Efficiency of Clinical Trials," cited in Begg and Berlin (1988). 6. U.S. National Institutes of Health (1994). 7. Jerry E. Bishop. 1995. "Link Between EMF, Brain Cancer Is Suggested by Study at Five Utilities." The Wall StreetJournal,Jan. 11,1995, p. B6. The article refers to a study by David A. Savitz and Dana P. Loomis in American Journal of Epidemiology. 8. Joel Greenhouse and others, in a meta-analytic review of the aphasia treatment literature, in Wachter and Straf (1990) p. 31. 9. Becker (1990). 10. Mann (1994). 11. Light and Smith (1971). 12. Jeffrey Schneider, in an introduction to a set of meta-analyses of desegregation, in Wachter and Straf (1990) pp. 55-57. l3. Eagly and Johnson (1990) pp. 233-234. 14. The massive and influential review is Maccoby and Jacklin (1974). The later studies are summarized in Eagly and Wood (1991). 15. Fischer (1991). 16. Not all meta-analysts agree with this contention; see Hedges (1987). 17. Mulrow (1994). 18. The two 1984 studies: Buffler and others; Kabat and Wynder. The larger, later study: Janerich and others (1990). 19. Bunge (1991) pp. 553-554; Bunge (1992) pp. 72-73; Malcolm W. Brown. 1995. "Scientists Deplore Flight from Reason." The New York Times,June 6,1995, pp. C1, C7. The article reports on a conference held at the New York Academy of Sciences. 183
184 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30.
31. 32.
33. 34. 35.
36. 37. 38. 39. 40.
41.
42. 43. 44. 45. 46. 47. 48. 49. 50.
Notes
Ibid. White in Cooper and Hedges (1994) p. 42. Goldfried, Greenberg, and Marmar (1990) p. 675. Mulrow (1987). T. Chalmers and Lau (1994). Light and Pillemer (1984) pp. 3-4. Pearson (1904) pp. 1243-1246. Tippett (1931). Cochran (1937) pp. 102-118. For details, see Glass, McGaw, and Smith (1981); Light and Pillemer (1982); Sacks and others (1987); and Ingelfinger and others (1994). Developments listed in the paragraph refer to Light and Pillemer (1984) and to brief summaries in Glass (1976) and in Glass, McGaw, and Smith (1981) pp. 24-25. Glass (1976). Some authorities divide them into six phases, some into many substages. See Cooper (1982); Durlak and Lipsey (1991); T. Chalmers and Lau (l993a); and Cooper and Hedges, in Cooper and Hedges (1994) pp. 8-13. In Wachter and Straf (1990) p. 47. Eysenck (1978). The peer review is from Cooper's personal files. Hall, interview with the author. The figures are from my own search in late 1995. Higher ones have been reported for the three social science data bases between 1978 and 1989; see White in Cooper and Hedges (1994) p. 43, but this search did not eliminate duplications among the data bases. Another search reported five hundred listings for 1992 in MEDLINE alone, but included articles that mentioned or discussed meta-analysis and were not themselves meta-analysis (1. Chalmers and Haynes 1994). Yusuf and others (1994). Held, Yusuf, and Furberg (1989). Baum and others (1981). Hayashi and T. Chalmers (1993). Lawrence K. Altman. 1992. "Tiny Cancer Risk in Chlorinated Water." The New York Times, July 1, 1992, p. A18. The report by Robert Morris and others appeared in American Joumal of Public Health Ouly 1992). Kranzler and Jensen (1989). For the statistically inquisitive, the figure - .54 is the measure of the regression of IQ on IT when both variables are scaled in standardized units. If two people differ by, say, twenty standardized units in IT, the slower person will be about ten units lower in IQ. Ray (1993) p. 35. Bushman and Cooper (1990). Paik and Comstock (1994); Comstock and Paik (1990). Lehman and Cordray (1993). Greenberg and others (1992,1994). Olkin and Shaw (1994). Sohn (1995). In Wachter and Straf (1990) pp. 6-7. Freeman Dyson. 1995. "The Scientist as Rebel." New York Review of Books, May 25, 1995, p. 33.
Notes
185
Chapter 2 1. "The talking cure" was what Anna 0., a patient of Freud's early mentor and collaborator Josef Breuer, called Breuer's primitive form of psychotherapeutic treatment. 2. Luborsky (1972); Luborsky; Singer, and Luborsky (1975). The 93 percent improvement rate: Morton (1955). 3. Eysenck (1952) pp. 661-662. 4. Eysenck (1990). 5. Eysenck in Lindzey (1980) p. 165. 6. Reisman (1976) p. 352; Luborsky (1954). 7. The two studies: Bergin, summarized in Smith and Glass (1977), and Meltzoff and Kornreigh, cited in Luborsky, Singer, and Luborsky (1975). 8. Light and Smith (1971) p. 433. 9. National Task Force on the Prevention and Treatment of Obesity (1994). lO. Fisher (1926) p. 504. 11. Bailar and Mosteller (1992) p. 358; Rosenthal and Rubin (1979) p. 1165. 12. Glass, McGaw, and Smith (1981) p. 95. 13. Details in the section titled "Genesis of Meta-Analysis, Part I" are drawn from interviews with Glass and Smith; Glass (1976); and Smith and Glass (1977). 14. Smith and Glass (1977) p. 753. 15. Hunter and Schmidt (1990) p. 85. Also quoted in Cooper and Hedges (1994) p.126. 16. Altick (1975) p. 13. 17. See Orwin in Cooper and Hedges (1994) p. 141. 18. Durlak and Lipsey (1991) pp. 311-313; Glaser and Olkin, in Cooper and Hedges (1994) pp. 340-341,350. 19. Yeaton and Wortman (1993); Orwinin Cooper and Hedges (1994) pp. 145-153. 20. Orwin in Cooper and Hedges (1994). 21. Durlak and Lipsey (1991) p. 305. 22. Ibid., p. 3lO. 23. Rosenthal (l991a) pp. 97-98. 24. Becker in Cooper and Hedges (1994) p. 218. 25. Kraus (1995). 26. Draper and others (1992) pp. 178-180. 27. Bailar and Mosteller (1992) pp. 184-185. 28. Smith and Glass (1977) p. 753. 29. Gallo (1978) p. 516. 30. Eysenck (1978) p. 517. 31. Mary Lee Smith interview. For typical articles contesting Smith and Glass's findings, see Gallo (1978); Prioleau, Murdock, and Brody (1983); and Orwin and Cordray (1985). For typical articles confirming them, see Andrews and Harvey (1981); Landman and Dawes (1982); and Shapiro and Shapiro (1982). 32. Lipsey and Wilson (1993) p. 1200. Since many of the original studies appeared in more than one meta-analysis, the actual number of original studies is somewhat less than five thousand.
186 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44.
Notes Cook and others (1992) p. 13. Shadish and others (1993); also, Shadish and others (1995). Eaglyand Carli (1981). Cook (1991) p. 264; Miller and Pollock (forthcoming). Light (1984) p. 59. Flather, Farkouh, and Yusuf (1994) p. 394. Hedges in Cooper and Hedges (1994) p. 298; Hedges and Olkin (1985) pp. 11-12. Shoham-Salomon and Rosenthal (1987); Shoham and Rohrbaugh (1994). Shoham-Salomon and Rosenthal (1987) p. 27. Shadish and Sweeney (1991). They are speaking of meta-analyses of psychotherapy, but their remarks apply to all forms of intervention. Baron and Kenny (1986). Shadish and Sweeney (1991).
Chapter 3 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
11. 12. 13. l4. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25.
Hanushek (1989) p. 45. Ibid., p. 47. Memo to Larry Hedges from Laine and others, April 23, 1992. Hedges in Cooper and Hedges (1994) p. 30. Ibid., pp. 30-33. Wachter and Straf (1990) p. 167. Dickersin, Scherer, and LeFebvre (1994) p. 1290. Knipschild (1994) p. 719. Laine and others (1992) pp. 4-8. Laine and others (1992) pp. 19-25. They give their results in other mathematical terms; I have used the p value given in Hedges, Greenwald, and Laine, (1994a) pp. 9-10. Laine and others (1992) pp. 39-40. Ibid., p. 38. The third technique, a correlation matrix analysis, yielded less impressive results than the two discussed here. Eysenck (1994) p. 791. Glass (1978). Glass (1983) pp. 401,404. Hall and others, in Cooper and Hedges (1994) p. 20. Matt and Cook, in Cooper and Hedges (1994) p. 515. Hanushek (1989). Hedges, Laine, and Greenwald (1994a) pp. 10, 11. Ibid., p. 11. That is, 0.7 times the sd of 34 percent. Hanushek (1994) p. 5. Ibid., pp. 7,8. Hedges, Laine, and Greenwald (1994b) p. 9; National Research Council (1992) p. 2, also cited in Draper and others (1992). Hedges, Laine, and Greenwald (1994b) p. 10.
Notes
187
26. Hedges, Greenwald, and Laine (forthcoming). 27. Cooper (1989b) pp. 85-86. 28. Details that follow are drawn from Cooper (1989a); Cooper (l989b); and interviews with Cooper. 29. Cooper (1989a) p. 54. 30. Cooper (1989a) table 10.4; for the 60 percent figure, p. 163; and for homework and gender, pp. 74-75. 31. For published findings, see Cooper (1989a) p. 164, table 10.4. 32. Cooper (1989b). 33. 1. Chalmers and others (1989). 34. Coplen and others (1990) and Hine and others (1989). 35. Hall and others, in Cooper and Hedges (1994) p. 18. 36. Cooper and Hedges (1994) p. 523. 37. Eagly and Carli (1981). 38. Eagly and Wood, in Cooper and Hedges (1994) p. 492. 39. Ingelfinger and others (1994) p. 123. 40. Wolf in Wachter and Straf (1990) p. 147. 41. Kippel (1981). 42. Becker in Cook and others (1992) p. 212. 43. The studies sponsored by the grants appear in Cook and others (1992). 44. Becker in Cook and others (1992) p. 215. 45. Cook and others (1992) p. 24. 46. Cook and Shadish (1994) p. 547 and almost all social psychology and sociology textbooks. 47. Cooper and Hedges (1994) p. 527. 48. Shadish in Cook and others (1992) p. 171. The "orientation effects" are those in his study, described earlier, of marital and family therapies, Shadish and others (1993); Shadish and others (1995). 49. Durlak and Lipsey (1991) p. 328.
Chapter 4 1. T. Chalmers (1994) p. 12. 2. "First major meta-analysis": Yusuf (1993) p. 3. The meta-analysis: T. Chalmers and others (1977). 3. "Of special interest": Lau and others (1992); "One of the most important papers": lain Chalmers, interview, referring to Antman and others (1992). 4. Cook and others (1992) p. 285. 5. Bailar and Mosteller (1992) pp. 358, 372. 6. Lau, Schmid, and Chalmers (1995), items 1 through 15 in fig. 2. 7. Ibid., fig 3. 8. Colorectal surgery: Baum and others (1981); endoscopy: Sacks and others (1990). 9. T. Chalmers and Lau (1993a) pp. 164-165. 10. Glass (1976). 11. Emerson and others (1990) p. 339.
188
Notes
12. "Smaller effect sizes": Gilbert, McPeek, and Mosteller (1977) pp. 130-131; "results can always be improved": Hugo Muench, cited in U.s. General Accounting Office (1992a) p. 75; "larger effect sizes": Shadish, Heinsman, and Ragsdale (1993). 13. See items 12, 13, 16, and 18 in "References" in Lau and others (1992). The study giving the 22 percent figure: Yusuf and others (1985). 14. "268 randomized clinical trials"-Lau, interview. Previous meta-analyses: See "References" in Lau and others (1992). 15. Lau, Schmid, and T. Chalmers (1995). 16. Laird and Mosteller (1990). 17. Devine and Cook (1983); Devine, interview. 18. Greenhouse and others, in Wachter and Straf (1990) p. 34. 19. Lipsey in Cook and others (1992) p. 98. 20. Antman and others (1992). 21. Ibid., p. 243. 22. Ibid., p. 245. 23. Lau, Schmid, and T. Chalmers (1995). 24. T. Chalmers and Lau (1993a) p. 166. 25. T. Chalmers and Lau (1993b) p. 99. 26. Mosteller and T. Chalmers (1992). 27. The meta-analysis: Teo and others (1991); the ISIS-4 trial: ISIS-4 (1995). ISIS is a group that conducted one of the two very large trials of thrombolytic treatment alluded to earlier; the acronym stands for International Study of Infarct Survival. The other very large thrombolytic trial was conducted by GISSIGruppo italiano per 10 studio della streptochinasi nell'infarto miocardicowhich, like ISIS, conducts massive clinical trials. 28. Draper and others (1992) p. 186. 29. Flather, Farkouh, and Yusuf (1994) pp. 397-398. 30. Ibid., p. 404. 31. Collins and others (1995) pp. 20, 26. 32. Early Breast Cancer Trialists' Collaborative Group (1992). 33. Light and Pillemer (1984) p. 69. 34. Rosenthal and Rubin (1982a). 35. Law, Wald, and Thompson (1994). 36. Adapted from Moore and McCabe (1993) p. 189. 37. Cited in U.S. General Accounting Office (1992) pp. 75-76. 38. Laird in Wachter and Straf (1990) p. 51; Hine and others (1989). 39. Colditz (1994). 40. Ibid., tables 1 and 2. 41. Ibid., p. 701. 42. Ingelfinger and others (1994) pp. 349-351; Cooper and Hedges (1994) chapters 19 and 20. 43. Laird and Mosteller (1990) p. 15. 44. Hedges in Wachter and Straf (1990) p. 23. 45. Hedges (1982); Rosenthal and Rubin (1982a).
Notes
189
Chapter 5 1. Ambady and Rosenthal (1992). 2. Cooper in Wachter and Straf (1990) p. 90, citing Jacob Cohen, Statistical Power Analysis for the Behavioral Sciences (Hillsdale, N.j.: Erlbaum, 1988). 3. Lipsey in Cook and others (1992) pp. 86,98. 4. Latane and Darley (1968). 5. Latane and Nida (1981). 6. The example is adapted from one in Cooper and Lemke (1991) p. 247. 7. Cooper and Lemke (1991) p. 250. 8. Ambady and Rosenthal (1992) pp. 267-269. 9. "Ten times": Greenwald (1975); "Two-thirds ... one-third": Rosenthal (1991a) pp. 106-107; "55 percent ... 22 percent": T. Chalmers and others (1987a); "Other studies": Begg in Cooper and Hedges (1994) p. 400. 10. Quoted in Hetherington and others (1989) p. 374. 11. Wachter (1988). 12. Shadish, Doherty, and Montgomery (1989). 13. Yusuf (1987) p. 285. 14. Durlak and Lipsey (1991) p. 323. 15. Rosenthal (1979). 16. Rosenthal and Rubin (1978) p. 381. 17. Begg in Cooper and Hedges (1994) p. 406. 18. Draper and others (1992) pp. 120-121; Begg in Cooper and Hedges (1994) pp. 406-407. 19. Rosenthal and Jacobson (1966). 20. Rosenthal and Jacobson (1968). 21. Rosenthal and Rubin (1978). 22. Cook and Campbell (1979). 23. The example is adapted from Matt and Cook, in Cooper and Hedges (1994) p. 513. 24. Cook and others (1992) p. 299; Matt and Cook, in Cooper and Hedges (1994) pp.506-513. 25. Hunter and Schmidt, in Cooper and Hedges (1994) p. 326. 26. Ibid., p. 327. 27. Eagly and Wood, in Cooper and Hedges (1994) p. 486. 28. Quoted in Mann (1990). 29. The 1975 survey: Lipton, Martinson, and Wilks (1975); the 1977 update: Greenberg (1977) p. 141. 30. Details that follow are from Lipsey in Cook and others (1992) pp. 83-127; Lipsey (1995); and an interview with Lipsey. 31. Lipsey in Cook and others (1992) pp. 97-98. A similar way of assessing the real importance of what seem like small effects is the Binomial Effect Size Display of Rosenthal and Rubin. They appraise the effect of a treatment in terms of the proportionate change in the improvement or cure rate itself. A rise in cure rate from 40 percent to 60 percent thus is not a 20 percent rise but a 50 percent rise (the 40 percent rate was increased by 20 points, or 50 percent). See Rosenthal and Rubin (1982c) and Rosenthal (1991a) pp. 132-135.
190 32. 33. 34. 35. 36.
Notes
Lipsey (1995) p. 5. Lipsey in Cook and others (1992) p. 98. Lipsey (1995) pp. 11-13. Ibid., p. 16. Yeaton and Wortman (1993).
Chapter 6 1. 2. 3. 4. 5.
6. 7. 8. 9. 10. 11.
12. 13. 14.
15. 16. 17.
18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31.
u.s. General Accounting Office (1982) app., pp. 59-60. Chelimsky (1994). Ibid., pp. 4-5. Chelimsky (1994). National Academy of Public Administration (1994) pp. 14-15; Robert Pear. 1994. "U.s. Watchdog Gets Criticism on Objectivity." The New York Times, Oct. 17,1994, AI, BlO. U.S. General Accounting Office (1995); see especially letter from DHHS. U.S. General Accounting Office (1984) p. 6. Ibid., pp. 7,9. Ibid., p. 16, footnote e. Ibid., pp. ii, iv. Ibid., pp. vi, vii. Ibid., p. ii. U.s. Senate Committee on Agriculture, Nutrition, and Forestry (1984) pp. 8, 16,77. Chelimsky (1994) p. 17. Cordray in Wachter and Straf (1990) p. Ill; see also Cook and others (1992) p.338. Lipsey (1993). Letter of June 14, 1990, from Representative Dante Fascell to Eleanor Chelimsky (in PEMD's Bigeye bomb files). On the outcomes: Chelimsky (1994) pp. 13-14. U.S. General Accounting Office (1991); Chelimsky (1994). U.S. General Accounting Office (1987) p. 2; Chelimsky (1994). U.S. General Accounting Office (1987) p. 3. Chelimsky (1994). Ibid., p. 2l. U.S. General Accounting Office (1992b). U.S. General Accounting Office (1994) p. 9. Ibid., p. 13. Ibid., pp. 1-2. Chelimsky (1991). Ibid. Ibid. Centers for Disease Control (1991). National Research Council (1986); U.S. Department of Health and Human Services (1986).
Notes
32. 33. 34. 35. 36. 37. 38. 39. 40.
41. 42. 43. 44. 45. 46.
47. 48.
191
Wells (1988) p. 251, tables 1,2. Bayard is quoted to that effect in Newsday, Jan. 8,1993, p. 5. U.S. Environmental Protection Agency (1992) chapter 6: p. 16. Ibid., chapter 6: pp. 17, 19. Ibid., chapter 5: pp. 34, 35. Ibid., chapter 1: p. l. Ibid., p. xv. Philip]. Hilts. 1993. "U.s. Issues Guidelines to Protect Nonsmokers." The New York Times, July 22, 1993, p. A14. Tom Kenworthy and David Brown. 1993. "Tobacco Firms Sue EPA on Cancer Ruling." The Washington Post, June 23, 1993, p. AI; John Schwartz. 1994. "Smoking Recast: From Sophistication to Sin." The Washington Post, May 29, 1994, p. Al. Tom Kenworthy and David Brown. 1993. "Tobacco Firms Sue EPA on Cancer Ruling." The Washington Post, June 23, 1993, p. AI. Gravelle and Zimmerman (1994a, 1994b). Philip]' Hilts. 1994. "Danger ofTobacco Smoke Is Said to Be Underplayed." The New York Times, Dec. 21, 1994, p. D23. Jacox and others (1994). 1. Chalmers and Haynes (1994). Lawrence K. Altman. 1995. "Agency Issues Warning for Drug Wisely Used for Heart Disease." The New York Times, Sept. 1, 1995, pp. AI, A20. The metaanalysis was published in Circulation, September 1, 1995. Lawrence K. Altman. 1995. "Agency Issues Warning for Drug Wisely Used for Heart Disease." The New York Times, Sept. 1, 1995, pp. AI, A20. Dr. Robert Temple, FDA, interview.
Chapter 7 l. 2. 3. 4. 5. 6.
Draper and others (1992) p. 43. Mann (1994). Shapiro (1994) p. 77l. Sohn (1995); see chapter 1 for a fuller quote. Cooper and Hedges (1994) p. 522. Details are from Lau and others' proposal to the AHCPR (unpublished) and from two interviews with Lau. 7. 1. Chalmers and Altman (1995) pp. 86-94; interview with 1. Chalmers; and leaflet of February 12, 1995, from the Baltimore Cochrane Center. S. Rubin in Wachter and Straf (1990) p. 155; see also Rubin (1992, 1993). 9. Glass (1991) pp. 1141-1142.
References
This is a list of sources cited in the text and in the Notes. It also includes a relatively small number of articles and books which, though not cited, are amplifications of those which are and may be of interest to readers who wish to pursue any of the matters discussed in this book. Altick, Richard. 1975. The Art of Literary Research. New York: Norton. Ambady, Nalini, and Robert Rosenthal. 1992. "Thin Slices of Expressive Behavior as Predictors of Interpersonal Consequences: A Meta-Analysis." Psychological Bulletin 111(2): 256-274. - - - . 1993. "Half a Minute: Predicting Teacher Evaluations from Thin Slices of Nonverbal Behavior and Physical Attractiveness." Journal of Personality and Social Psychology 64(3): 431-441. Andrews, Gavin, and Robin Harvey. 1981. "Does Psychotherapy Benefit Neurotic Patients? A Reanalysis of the Smith, Glass, and Miller Data." Archives of General Psychiatry 38: 1203-1208. Antman, Elliott M., and others. 1992. "A Comparison of Results of Meta-Analyses of Randomized Control Trials and Recommendations of Clinical Experts. Treatments for Myocardial Infarction." Journal of the American Medical Association 268(2): 240-248. Bailar, john C, and Frederick Mosteller. 1992. Medical Uses of Statistics. Boston: NEjM [New England journal of Medicine] Books. Bangert-Drowns, Robert L. 1986. "Review of Developments in Meta-Analytic Method." Psychological Bulletin 99(3): 388-399. Baron, R. M., and D. A. Kenny. 1986. "The Moderator-Mediator Variable Distinction in Social Psychological Research." Journal of Personality and Social Psychology 51: 1173-1182. Baum, Mark, and others. 1981. "A Survey of Clinical Trials of Antibiotic Prophylaxis in Colon Surgery: Evidence Against Further Use of No-Treatment Controls." New EnglandJournal of Medicine 305: 795-799. Becker, Betsy jane. 1990. "Coaching for the Scholastic Aptitude Test: Further Synthesis and Appraisal." Review of Educational Research 60(3): 373-417. Begg, Colin B., and jesse A. Berlin. 1988. "Publication Bias: A Problem in Interpreting Medical Data." Journal of the Royal Statistical Society 151(3): 419-463. Buffler, P A., and others. 1984. "The Causes of Lung Cancer in Texas." In M. Mizell and P Correa, eds., Lung Cancer: Causes and Prevention. New York: Verlag Chemie International. Bunge, Mario. 1991. "A Critical Examination of the New Sociology of Science. Part 1." Philosophy of the Social Sciences 21 (4): 524-560.
193
194
References
- - - . 1992. "A Critical Examination of the New Sociology of Science. Part 2." Philosophy of the Social Sciences 22(1): 46-76. Bushman, Brad j., and Harris M. Cooper. 1990. "Effects of Alcohol on Human Aggression: An Integrative Research Review." Psychological Bulletin 107(3): 341-354. Centers for Disease Control, u.s. Department of Health and Human Services. 1991. "Smoking-Attributable Mortality and Years of Potential Life." Morbidity and Mortality Weekly Report 40: 62-71. Chalmers, lain, and Douglas G. Altman. 1995. Systematic Reviews. London: BMJ Publishing Group. Chalmers, lain, and Brian Haynes. 1994. "Reporting, Updating, and Correcting Systematic Reviews of the Effects of Health Care." British Medicaljournal 309: 862-865. Chalmers, lain, and others. 1989. Effective Care in Pregnancy and Childbirth. Oxford: Oxford University Press. Chalmers, Thomas. 1994. "Meta-Analysis and Its Role in Evaluating Endoscopic Therapies for Bleeding Peptic Ulcer." Masters in Gastroenterology 6(2): 12-17. Chalmers, Thomas c., and Joseph Lau. 1993a. "Meta-Analytic Stimulus for Changes in Clinical Trials." Statistical Methods in Medical Research 2: 161-172. - - - . 1993b. "Randomized Control Trials and Meta-Analyses in Gastroenterology: Major Achievements and Future Potential." In Kenneth S. Warren and Frederick Mosteller, eds., Doing More Good Than Harm: The Evaluation of Health Care Interventions. New York: New York Academy of Sciences. - - - . 1994. "What Is Meta-Analysis?" Emergency Care Research Institute 12: 1-5. Chalmers, Thomas c., and others. 1977. "Evidence Favoring the Use of Anticoagulants in the Hospital Phase of Acute Myocardial Infarction." New EnglandJournal of Medicine 297: 1091-1096. ---.1981. "A Method for Assessing the Quality of a Randomized Clinical Trial." Controlled Clinical Trials 2: 31-49. - - - . 1987a. "Meta-Analysis of Clinical Trials as a Scientific Discipline. I: Control of Bias and Comparison with Large Co-operative Trials." Statistics in Medicine 6: 315-325. - - - . 1987b. "Meta-Analysis of Clinical Trials as a Scientific Discipline. II: Replicate Variability and Comparisons of Studies That Agree and Disagree." Statistics in Medicine 6: 733-744. Chelimsky, Eleanor. 1984. "Statement on Evaluations ofWIC's Effectiveness, before the Committee on Agriculture, Nutrition, and Forestry, United States Senate. Washington, DC: U.S. General Accounting Office. Unpublished. - - - . 1991. "On the Social Science Contribution to Governmental DecisionMaking." Science 254: 226-231. - - - . 1994. "Politics, Policy, and Research SyntheSiS." Keynote address, Russell Sage Foundation National Conference on Research Synthesis, Washington, D.C.,June 21. Cochran, William G. 1937. "Problems Arising in the Analysis of a Series of Similar Experiments." Journal of the Royal Statistic Society 4(suppl.): 102-118. Cohen, Jacob. 1988. Statistical Power Analysis in the Behavioral Sciences. Hillsdale, NJ: Erlbaum.
References
195
Colditz, Graham A. and others 1993. The Efficacy ofBCG in the Prevention of Tuberculosis: Meta-Analyses of the Published Literature. Boston: Technology Assessment Group, Harvard School of Public Health. ---.1994. "Efficacy ofBCG Vaccine in the Prevention of Tuberculosis." Journal of the American Medical Association 271(9): 698-702. Collins, R., and others. 1995. "2.4: Large-scale Randomized Evidence: Trials and Overviews." In D. Weatherall,]. G. G. Ledingham, and D. A. Warrell, eds., Oxford Textbook of Medicine. Oxford: Oxford University Press. Committee on Undergraduate Education. 1994. CUE 1994-95: Harvard University Course Evaluation Guide. [Cambridge, MAl: Harvard College. Comstock, George, and Haejung Paik. 1990 "The Effects of Television Violence on Aggressive Behavior: A Meta-Analysis." Paper prepared for the National Research Council Panel on the Understanding and Control of Violent Behavior. S. 1. Newhouse School of Public Communication, Syracuse University. Cook, Thomas D. 1991. "Meta-Analysis: Its Potential for Causal Description and Causal Explanation Within Program Evaluation." In Gunther Albrecht and Hans-Uwe Otto, eds., Social Prevention and the Social Sciences. Berlin: Walter de Gruyter. Cook, Thomas D., and D. T. Campbell. 1979. Quasi-Experimentation: Design and Analysis Issues for Field Settings. Boston: Houghton Mifflin. Cook, Thomas D., and William R. Shadish. 1994. "Social Experiments: Some Developments over the Past Fifteen Years." Annual Review of Psychology 45: 545-580. Cook, Thomas D., and others. 1984. School Desegregation and Black Achievement. Unpublished report. Washington, DC: National institute of Education. ERIC no. ED 241 67l. - - - . 1992. Meta-Analysis for Explanation: A Casebook. New York: Russell Sage Foundation. Cooper, Harris M. 1982. "Scientific Guidelines for Conducting Integrative Research Reviews." Review of Educational Research 53(2): 291-302. - - - . 1989a. Homework. New York: Longman. - - - . 1989b. "Synthesis of Research on Homework." Educational Leadership (November): 85-91. - - - . 1989c. Integrating Research: A Guide for Literature Reviews, 2nd ed. Newbury Park, CA: Sage Publications. Cooper, Harris, and Larry V. Hedges, eds. 1994. The Handbook of Research Synthesis. New York: Russell Sage Foundation. Cooper, Harris M., and Kevin Lemke, 1991. "On the Role of Meta-Analysis in Personality and Social Psychology." Personality and Social Psychology Bulletin 17(3): 245-25l. Coplen, S. E., and others. 1990. "Efficacy and Safety of Quinidine Therapy for Maintenance of Sinus Rhythm After Cardioversion. A Meta-Analysis of Randomized Control Trials." Circulation, 82: 1106-1116. Cordray, David S., and Robert T. Fischer. 1994. "Job Training and Welfare Reform: A Policy-Driven Synthesis." Paper presented at Russell Sage Foundation National Conference on Research Synthesis, June 1994, Washington, D.C. Unpublished.
196
References
Devine, Elizabeth C 1992. "Effects of Psychoeducational Care for Adult Surgical Patients: A Meta-Analysis of 191 Studies." Patient Education and Counseling 19: 129-142. Devine, Elizabeth C, and Thomas D. Cook. 1983. "A Meta-Analytic Analysis ofEffects of Psycho educational Interventions on Length of Postsurgical Hospital Stay." Nursing Research 32(5): 267-274. - - - . 1986. "Clinical and Cost-Saving Effects of Psycho educational Interventions with Surgical Patients: A Meta-Analysis." Research in Nursing & Health 9: 89-lO5. Dickersin, Kay, Roberta Scherer, and Carol LeFebvre. 1994. "Identifying Relevant Studies for Systematic Reviews." British MedicalJournal309: 1286-1291. Draper, David, and others. 1992. Contemporary Statistics, Number 1: Combining Information. Statistical Issues and Opportunities for Research. Washington: National Academy Press, for the American Statistical Association. Druckman, Daniel. 1993. "The Situational Levers of Negotiating Flexibility." Journal of Conflict Resolution 37(2): 236-276. - - - . 1994. "Determinants of Compromising Behavior in Negotiation." Journal of Conflict Resolution 38(3): 507-556. Durlak, Joseph A., and Mark W Lipsey. 1991. "A Practitioner's Guide to MetaAnalysis." AmericanJournal of Community Psychology 19(3): 291-332. Eagly, Alice H., and Linda L. Carli. 1981. "Sex of Researchers and Sex-Typed Communications as Determinants of Sex Differences in Influenceability: A MetaAnalysis of Social Influence Studies." Psychological Bulletin 90(l): 1-20. Eagly, Alice H., and Blair T. Johnson. 1990. "Gender and Leadership Style: A MetaAnalysis." Psychological Bulletin 108(2): 232-256. Eagly, Alice H., and Wendy Wood. 1991. "Explaining Sex Differences in Social Behavior: A Meta-Analytic Perspective." Personality and Social Psychology Bulletin 17(3): 306-315. Early Breast Cancer Trialists' Collaborative Group. 1992. "Systematic Treatment of Early Breast Cancer by Hormonal, Cytotoxic, or Immune Therapy." The Lancet 339: 1-15, 71-85. Emerson, John D., and others. 1990. "An Empirical Study of the Possible Relation of Treatment Differences to Quality Scores in Controlled Randomized Clinical Trials." Controlled Clinical Trials 11: 339-352. Epstein, William. 1995. The Illusion of Psychotherapy. New Brunswick, N]: Transaction Publishers. Eysenck, H.]. 1952. "The Effects of Psychotherapy: An Evaluation." Journal of Consulting Psychology 16: 319-324. - - - . 1978. "An Exercise in Mega-Silliness." American Psychologist 33: 517. ---.1990. Rebel with a Cause. London: W H. Allen. - - - . 1992. "Meta-Analysis, Sense or Nonsense?" Pharmaceutical Medicine 6: 113-119. - - - . 1994. "Meta-Analysis and Its Problems." British Medical Journal 309: 789-792. Fischer, P J. 1991. Alcohol, Drug Abuse and Mental Health Problems Among Homeless Persons: A Review of the Literature. Washington: U.s. Department of Health and Human Services.
References
197
Fisher, R. A. 1926. "The Arrangement of Field Experiments." Journal of the Ministry of Agriculture of Great Britain 33: 505-513. Flather, M. D., M. E. Farkouh, and S. Yusuf. 1994. "Meta-Analysis in the Evaluation of Therapies." In Desmond G.Julian and Eugene Braunwald, eds., Management of Acute Myocardial Infarction. London: W B. Saunders Company, Ltd. Gallo, Philip S. 1978. "Meta-Analysis-A Mixed Metaphor?" American Psychologist 33(5): 515-517. Gilbert, John P., Bucknam McPeek, and Frederick Mosteller. 1977. "Progress in Surgery and Anesthesia: Benefits and Risks of Innovative Therapy." In John P. Bunker, Benjamin A. Barnes, and Frederick Mosteller, eds., Costs, Risks, and Benefits of Surgery. New York: Oxford University Press. Glass, Gene V 1976. "Primary, Secondary, and Meta-Analysis of Research." The Educational Researcher 10: 3-8. - - - . 1978. "In Defense of Generalization." The Behavioral and Brain Sciences 3: 394-395. ---.1983. "Synthesizing Empirical Research: Meta-Analysis." In S. A. Ward and L. J. Reed, eds., Knowledge Structure and Use. Philadelphia: Temple University Press. - - - . 1991. "Review of The Future of Meta-Analysis [Wachter and Straf (1990))." Journal of the American Statistical Association 86(416): 1141. Glass, Gene V, Barry McGaw, and Mary Lee Smith. 1981. Meta-Analysis in Social Research. Beverly Hills, CA: Sage Publications. Goldfried, Marvin R., Leslie S. Greenberg, and Charles Marmar. 1990. "Individual Psychotherapy: Process and Outcome." Annual Review of Psychology 41: 659-688. Gravelle, Jane G., and Dennis Zimmerman. 1994a. CRS Report for Congress. Cigarette Taxes to Fund Health Care Reform: An Economic Analysis. Washington: Congressional Research Service, Library of Congress. - - - . 1994b. Statement on Environmental Tobacco Smoke before the Subcommittee on Clean Air and Nuclear Regulation, Committee on Environment and Public Works, U.S. Senate, May 11,1994. Washington: Congressional Research Service. Greenberg, D. F. 1977. "The Correctional Effects of Corrections: A Survey of Evaluations." In D. F. Greenberg, ed., Corrections and Punishment. Newbury Park, CA: Sage Publications. Greenberg, Roger P., and others. 1992. "A Meta-Analysis of Antidepressant Outcome Under 'Blinder' Conditions." Journal of Consulting and Clinical Psychology 60(5): 664-669. - - - . 1994. "A Meta-Analysis of Fluoxetine Outcome in the Treatment of Depression." Journal of Nervous and Mental Disease 182(10): 547-551. Greenwald, A. G. 1975. "Consequences of Prejudice Against the Null HypotheSiS." Psychological Bulletin 85: 845-857. Hall, Judith A. 1978. "Gender Effects in Decoding Nonverbal Cues." Psychological Bulletin 85(4): 845-857. Hanushek, Eric A. 1981. "Throwing Money at Schools." Journal of Policy Analysis and Management 1: 19-41.
198
References
- - - . 1989. "The Impact of Differential Expenditures on School Performance." Educational Researcher 18(4): 45-62. - - - . 1994. "Money Might Matter Somewhere: A Response to Hedges, Laine, and Greenwald." Educational Researcher 23(4): 5-8. Hayashi, K., and T. C. Chalmers. 1993. "Famotidine in the Treatment of Duodenal Ulcer: A Meta-Analysis of Randomized Control Trials." Gastroenterology International 6(1): 19-25. Hedges, Larry V. 1982. "Estimation of Effect Size from a Series of Independent Experiments." Psychological Bulletin 92: 490-499. - - - . 1987. "How Hard Is Hard Science, How Soft Is Soft Science? The Empirical Cumulativeness of Research." American Psychologist 42: 443-455. Hedges, Larry v., Rob Greenwald, and Richard D. Laine. 1996. "The Effect of School Resources on Student Achievement." Review of Educational Research 66:361-396. Hedges, Larry v., Richard D. Laine and Rob Greenwald. 1994a. "Does Money Matter? A Meta-Analysis of Studies of the Effects of Differential School Inputs on Student Outcomes." Educational Researcher 23(3): 5-14. - - - . 1994b. "Money Does Matter Somewhere: A Reply to Hanushek." Educational Researcher 23(4): 9-10. Hedges, Larryv., and Ingram Olkin. 1985. Statistical Methodsfor Meta-Analysis. Orlando, FL: Academic Press. Held, Peter H., Salim Yusuf, and Curt D. Furberg. 1989. "Calcium Channel Blockers in Acute Myocardial Infarction and Unstable Angina: An Overview." British Medical Journal 299: 1187-1192. Hetherington, jini, and others. 1989. "Retrospective and Prospective Identification of Unpublished Controlled Trials: Lessons from a Survey of Obstetricians and Pediatricians." Pediatrics 84(2): 374-380. Hine, L. K., and others. 1989. "Meta-Analytic Evidence Against Prophylactic Use of Lidocaine in Acute Myocardial Infraction." Archives of Internal Medicine 149: 2694-2698. Hunter,john E., and Frank L. Schmidt. 1990. Methods of Meta-Analysis: Correcting Error and Bias in Research Findings. Newbury Park, CA: Sage Publications. Ingelfinger, joseph A., and others. 1994. Biostatistics in Clinical Medicine, 3rd ed. New York: McGraw-Hill, Inc. ISIS-4 [Fourth International Study of Infarct Survival] Collaborative Group. 1995. "ISIS-4: A Randomised Factorial Trial Assessing Early Oral Captopril, Oral Mononitrate, and Intravenous Magnesium Sulphate in 58,050 Patients with Suspected Acute Myocardial Infarction." The Lancet 345: 669-685. jacox, A., and others. 1994. Management of Cancer Pain: Clinical Practice Guideline No.9. Washington: u.s. Department of Health and Human Services, Public Health Service. janerich, D. T., and others 1990. "Lung Cancer and Exposure to Tobacco Smoke in the Household." New EnglandJournal of Medicine 323: 632-636. Kabat, G. c., and E. L. Wynder. 1984. "Lung Cancer in Nonsmokers." Cancer 53: 1214-1221. Kippel, G. 1981. "Identifying Exceptional Schools." New Directions for Program Evaluation 11: 83-100.
References
199
Knipschild, Paul. 1994 "Some Examples." British Medicaljournal309: 719-72l. Kranzler, John H., and Arthur Jensen. 1989. "Inspection Time and Intelligence: A Meta-Anaylsis." Intelligence 13: 329-347. Kraus, Stephen]. 1995. "Attitudes and the Prediction of Behavior: A Meta-Analysis of the Empirical Literature." Personality and Social Psychology Bulletin 210): 58-75. Laine, Richard, and others. 1992. "Dollars and Sense: Reassessing Hanushek." Unpublished paper submitted in Education 411, University of Chicago School of Education, June 16, 1992. Laird, Nan M., and Frederick Mosteller. 1990. "Some Statistical Methods for Combining Experimental Results." InternationalJournal of Technology Assessment in Health Care 6: 5-30. Landman,Janet Tracy, and Robyn M. Dawes. 1982. "Psychotherapy Outcome: Smith and Glass' Conclusions Stand Up Under Scrutiny:" American Psychologist 37(5): 504-516. Latane, Bibb, and John Darley. 1968. "Group Inhibition of Bystander Intervention in Emergencies." Journal of Personality and Social Psychology 10(31): 215-221. Latane, Bibb, and Steve Nida. 1981. "Ten Years of Research on Group Size and Helping." Psychological Bulletin 89(2): 308-324. Lau,]oseph, Christopher H. Schmid, and Thomas C. Chalmers. 1995. "Cumulative Meta-Analysis of Clinical Trials Builds Evidence for Exemplary Medical Care." Journal of Clinical Epidemiology 48(1): 45-57. Lau,joseph, and others. 1992. "Cumulative Meta-Analysis of Therapeutic Trials for Myocardial Infarction." New England Journal of Medicine 327: 248-254. Law, M. R., N.J. Wald, andS. G. Thompson. 1994. "By How Much and How QUickly Does Reduction in Serum Cholesterol Concentration Lower Risk of Ischaemic Heart Disease?" British Medical Journal 308: 367-373. Lee, I.M., C. C. Hsieh, and R. S. Paffenbarger, Jr. 1995. "Exercise Intensity and Longevity in Men. The Harvard Alumni Health Study:" Journal of the American Medical Association 273(5): 1179-1184. Lehman, Anthony F., and David S. Cordray. 1993. "Prevalence of Alcohol, Drug, and Mental Disorders Among the Homeless: One More Time." Contemporary Drug Problems (Fall): 355-383. Light, Richard. 1984. "Six Evaluation Issues That SynthesiS Can Resolve Better than Single Studies." In W. H. Yeaton and P. M. Wortman, eds., Issues in Data Synthesis. New Directions for Program Evaluation, no. 24. San Francisco: JosseyBass. Light, Richard]., and David B. Pillemer. 1982. "Numbers and Narrative: Combining Their Strengths in Research Reviews." Harvard Educational Review 52(1): 1-26. - - - . 1984. Summing Up: The Science of Reviewing Research. Cambridge, MA: Harvard University Press. Light, Richard, and Paul V. Smith. 1971. "Accumulating Evidence: Procedures for Resolving Contradictions Among Different Research Studies." Harvard Educational Review 41(4): 429-471. Lindzey, Gardner, ed. 1980. A History of Psychology in Autobiography, Vol. VII. San Francisco: W H. Freeman.
200
References
Lipsey, Mark W 1993. "Using Linked Meta-Analysis to Build Policy Models." Paper prepared for National Institu te of Drug Abuse Technical Review, Bethesda, MD. ---.1995. "What Do We Learn from 400 Research Studies on the Effectiveness of Treatment with Juvenile Delinquents?" In]. McGuire, ed., What Works: Effective Methods to Reduce Re-Offending. New York: John Wiley. Lipsey, Mark. W, and David B. Wilson. 1993. "The Efficacy of Psychological, Educational, and Behavioral Treatment: Confirmation from Meta-Analysis." American Psychologist 48(12): 1181-1209. Lipton, D., R. Martinson, and]. Wilks. 1975. The Effectiveness of Correctional Treatment: A Survey of Treatment Evaluation Studies. New York: Praeger. Luborsky, Lester. 1954. "A Note on Eysenck's Article, The Effects of Psychotherapy: An Evaluation.'" BritishJournal of Psychology 45: 129-13l. - - - . 1972. "Another Reply to Eysenck." Psychological Bulletin 78(5): 406-408. Luborsky, Lester, Barbara Singer, and Lisa Luborsky. 1975. "Comparative Studies of Psychotherapies: Is It True That 'Everyone Has Won and All Must Have Prizes'?" Archives of General Psychiatry 32: 995-1008. Luborsky, Lester, and others. 1993. "The Effect of Dynamic Psychotherapies: Is It True That 'Everyone Has Won and All Must Have Prizes'?" In N. Miller, L. Luborsky, and others, eds., Psychodynamic Treatment Research: A Handbook for Clinical Practice. New York: Basic Books. Maccoby, Eleanor E., and Carol N. Jacklin. 1974. The Psychology of Sex Differences. Standford: Stanford University Press. Mann, Charles C. 1990. "Meta-Analysis in the Breech." Science 249: 476-478. - - - . 1994. "Can Meta-Analysis Make Policy?" Science 266(11): 960-962. Miller, Norman, and Vicki Pollock. Forthcoming. "Use of Meta-Analysis for Testing Theory." In Evaluation and the Health Professions. Moore, David S., and George P. McCabe. 1993. Introduction to the Practice of Stat istics, 2nd ed. New York: W H. Freeman. Morton, R.B. 1955. "An Experiment in Brief Psychotherapy." Pscyhological Monographs 69 (386). Mosteller, Frederick, and Thomas C. Chalmers. 1992. "Some Progress and Problems in Meta-Analysis of Clinical Trials." Statistical Science 7(2): 227-236. Mulrow, Cynthia D. 1987. "The Medical Review Article: State of the Science." Annals of International Medicine 106: 485-488. - - - . 1994. "Rationale for Systematic Reviews." British Medical Journal 309: 597-599. National Academy of Public Administration. 1994. The Roles, Mission and Operation of the U.S. General Accounting Office. Report Prepared for the Committee on Governmental Affairs, United States Senate, Washington: National Academy of Public Administration. National Research Council. 1986. Environmental Tobacco Smoke: Measuring Exposures and Assessing Health Effects. Washington: National Academy Press. National Task Force on the Prevention and Treatment of Obesity. 1994. "Weight Cycling." Journal of the American Medical Association 272(15): 1196-1202. aikin, Ingram, and Douglas V. Shaw. 1994. "Meta-Analysis and Its Applications in Horticultural Science." Stanford University; University of California, Davis. Unpublished paper.
References
201
Orwin, Robert G., and David S. Cordray. 1985. "Effects of Deficient Reporting on Meta-Analysis: A Conceptual Framework and Reanalysis." Psychological Bulletin97(l): 134-147. Paik, Haejung, and George Comstock. 1994. "The Effects of Television Violence on Antisocial Behavior: A Meta-Analysis." Communication Research 21(4): 516-546. Pearson, Karl. 1904. "Report on Certain Enteric Fever Inoculation Statistics." British Medical Journal 3: 1243-1246. Prioleau, Leslie, Martha Murdock, and Nathan Brody. 1983. "An Analysis of Psychotherapy Versus Placebo Studies." Behavioral and Brain Sciences 6: 275-310. Ray, William]. 1993. Methods: Toward a Science of Behavior and Experience. Pacific Grove, CA: Brooks/Cole Publishing Company. Reisman, John M. 1976. A History of Clinical Psychology. New York: Irvington. Rosenthal, Robert. 1979. "The 'File-Drawer Problem' and Tolerance for Null Results." Psychological Bulletin 86: 638-641. - - - . 1991a. Meta-Analytic Procedures for Social Research, rev. ed. Newbury Park, CA: Sage Publications. - - - . 1991b. "Teacher Expectancy Effects: A Brief Update 25 Years After the Pygmalion Experiment." Journal of Research in Education 1(3): 3-12. Rosenthal, Robert, and Lenore Jacobson. 1966. "Teachers' Expectancies: Determinants of Pupils' IQ Gains." Psychological Reports 19: 115-118. - - - . 1968. Pygmalion in the Classroom. New York: Holt, Rinehart &: Winston. Rosenthal, Robert, and Donald. B. Rubin. 1978. "Interpersonal Expectancy Effects: The First 345 Studies." Behavioral and Brain Sciences 3: 377-415. - - - . 1979. "Comparing Significance Levels of Independent Studies." Psychological Bulletin 86(5): 1165-1168. - - - . 1982a. "Comparing Effect Sizes of Independent Studies." Psychological Bulletin 92: 500-504. - - - . 1982b. "Further Meta-Analytic Procedures for Assessing Cognitive Gender Differences." Journal of Educational Psychology 74(5): 708-712. - - - . 1982c. "A Simple, General Purpose Display of Magnitude of Experimental Effect." Journal of Educational Psychology 74(2): 166-169. Rubin, Donald B. 1992. "Meta-Analysis: Literature Synthesis or Effect-Size Surface Estimation?" Journal of Educational Statistics 17 (4): 363-374. - - - . 1993. "Statistical Tools for Meta-Analysis: From Straightforward to Esoteric." In Peter David Blanck, ed., Interpersonal Expectations: Theory, Research, and Applications. Cambridge: Cambridge University Press. Sacks, Henry 5., and others. 1987. "Meta-Analysis of Randomized Controlled Trials." New EnglandJournal of Medicine 316: 450-455. - - - . 1990. "Endoscopic Hemostasis: An Effective Therapy for Bleeding Peptic Ulcers." Journal of the American Medical Association 264(4): 494-499. Shadish, William R., Jr., Maria Doherty, and Linda M. Montgomery. 1989. "How Many Studies Are in the File Drawer? An Estimate from the FamilylMarital Psychotherapy Literature." Clinical Psychology Review 9: 589-603. Shadish, William R., Jr., Donna T. Heinsman, and Kevin Ragsdale. 1993. "Comparing Experiments and Quasi-Experiments Using Meta-Analysis." Paper
202
References
presented at Annual Convention of the American Psychological Association, Dallas, TX, November 4. Shadish, William R.,Jr., and Rebecca R. Sweeney. 1991. "Mediators and Moderators in Meta-Analysis: There's a Reason We Don't Let Dodo Birds Tell Us Which Psychotherapies Should Have Prizes." Journal of Counseling and Clinical Psychology 59(6): 883-893. Shadish, William R., Jr., and others. 1993. "Effects of Family and Marital Psychotherapies: A Meta-Analysis." Journal of Counseling and Clinical Psychology 61(6): 992-1002. - - - . 1995. "The Efficacy and Effectiveness of Marital and Family Therapy: A Perspective from Meta-Analysis." Journal of Marital and Family Therapy 21: 343-358. Shapiro, David A., and Diana Shapiro. 1982. "Meta-Analysis of Comparative Therapy Outcomes: A Reply to Wilson." Behavioural Psychology 10: 307-310. Shapiro, Samuel. 1994. "Point/Counterpoint: Meta-Analysis of Observational Studies. Meta-Analysis/Shmeta-analysis." American Journal of Epidemiology 140(9): 771-778. Shoham, Varda, and Michael Rohrbaugh. 1994. "Paradoxical Intervention." Encyclopedia of Psychology 3: 5-8. Shoham-Salomon, Varda, Ruth Avner, and Rivka Neeman. 1989. "You're Changed If You Do and Changed If You Don't: Mechanisms Underlying Paradoxical Interventions." Journal of Consulting and Clinical Psychology 57(3): 590-598. Shoham-Salomon, Varda, and Robert Rosenthal. 1987. "Paradoxical Interventions: A Meta-Analysis." Journal of Consulting and Clinical Psychology 55(1): 22-28. Smith, Mary Lee, and Gene V Glass. 1977. "Meta-Analysis of Psychotherapy Outcome Studies." American Psychologist 32(9): 752-760. - - - . 1987. Research and Evaluation in Education and the Social Sciences. Beverly Hills: Sage Publications. Sohn, David. 1995. "Meta-Analysis as a Means of Discovery." American Psychologist. (February): 108-110. Teo, K. K., and others. 1991. "Effects of Intravenous Magnesium in Suspected Acute Myocardial Infarction." British Medical Journal 303: 1499-1503. Thompson, Simon G. 1994. "Why Sources of Heterogeneity in Meta-Analysis Should Be Investigated." British Medical Journal 309: 1351-1355. Tippett, Leonard H. C. 1931. The Methods of Statistics. London: Williams &. Norgate. U.S. Department of Health and Human Services. 1986. The Health Consequences of Involuntary Smoking: A Report of the Surgeon General. DHHS Pub. No. (PHS) 87 -8398. Washington: U.S. Department of Health and Human Services. U.S. Environmental Protection Agency, Office of Research and Development. 1992. Respiratory Health Effects of Passive Smoking: Lung Cancer and Other Disorders. Washington: U.S. Environmental Protection Agency. U.S. General Accounting Office. 1984. Report to the Committee on Agriculture, Nutrition, and Forestry, United States Senate. GAOIPEMD-84-4 [Evaluation of the Special Supplemental Program for Women, Infants, and Childrenl. Washington: U.S. General Accounting Office.
References
203
- - - . 1987. Drinhing-Age Laws: An Evaluation Synthesis of Their Impact on Highway Safety. GAOIPEMD-87-10. Washington: U.S. General Accounting Office. - - - . 1991. Rental Housing: Implementing the New Federal Incentives to Deter Prepayments of HUD Mortgages. GAO/PEMD-91-2. Washington: U.s. General Accounting Office. - - - . 1992a. The Evaluation Synthesis. GAO/PEMD-1O.1.2. Washington: U.S. General Accounting Office. - - - . 1992b. Cross Design Synthesis: A New Strategy for Medical Effectiveness Research. GAO/PEMD-92-1S. Washington: U.s. General Accounting Office. - - - . 1992c. Hispanic Access to Health Care: Significant Gaps Exist. GAO-PEMD92-6. Washington, DC: U.S. General Accounting Office. ---.1994. Breast Conservation Versus Mastectomy: Patient Survival in Day-to-Day Medical Practice and in Randomized Studies. GAO/PEMD-95-9. Washington: U.s. General Accounting Office. - - - . 1995. Welfare to Worh: State Programs Have Tested Some of the Proposed Reforms. GAOIHEHS-95-93. Washington: U.S. General Accounting Office. U.S. National Institutes of Health, National Task Force on the Prevention and Treatment of Obesity. 1994. "Weight Cycling." Journal of the American Medical Association 272: 1196-1202. U.S. Senate Committee on Agriculture, Nutrition, and Forestry. 1984. Hearings of March 15 and April 9, 1984, on Evaluation and Reauthorization of the Special Supplemental Food Program for Women, Infants, and Children [WIC]. Washington: U.S. Government Printing Office. Wachter, Kenneth W 1988. "Disturbed by Meta-Analysis?" Science 241: 1407-1408. Wachter, Kenneth W, and Miron Straf, eds. 1990. The Future of Meta-Analysis. New York: Russell Sage Foundation. Wells, A. Judson. 1988. "An Estimate of Adult Mortality in the United States from Passive Smoking." Environment International 14: 249-265. Wolf, F. W 1986. Meta-Analysis: Quantitative Methods jilr Research SyntheSis. Newbury Park, CA: Sage Publications. Wortman, Paul M. 1992. "Lessons from the Meta-Analysis of Quasi-Experiments." In Fred B. Bryant and others, eds., Methodological Issues in Applied Social Psychology. New York: Plenum. Yeaton, William H., and Paul M. Wortman. 1993. "On the Reliability of MetaAnalytic Reviews." Evaluation Review 17(3): 292-309. Yusuf, Salim. 1987. "Obtaining Medically Meaningful Answers from an Overview of Randomized Clinical Trials." Statistics in Medicine 6: 281-286. ---.1993. "Cardiologist Salim Yusuf Gets to the Heart of Meta-Analysis." [Interview with Yusuf.l Science Watch 4(8): 3-4, Sept.lOct. Yusuf, Salim, and others. 1985. "Intravenous and Intracoronary Fibrinolytic Therapy in Acute Myocardial Infarction: Overview of Results on Mortality, Reinfarction and Side-Effects from 33 Randomized Controlled Trials." European Heart Journal 6: 556-585. - - - . 1994. "Effect of Coronary Artery Bypass Graft Surgery on Survival: Overview of lO-Year Results from Randomised Trials by the Coronary Artery Bypass Graft Surgery Trialists Collaboration." The Lancet, 344(8922): 563-570.
Index
Agency for Health Care Policy and Research (AHCPR), 158-59 agreement rate (AR), 37 Altick, Richard, 35 Ambady, Nalini, 109-12,115-18 American Psychological Association, 41 American Statistical Association: on combined p values, 39; potential of meta-analysis, 161 Antman, Elliott, 94 "apples and oranges" issue, 61-63 artifacts, 131-34 Ascher, Michael, 50 averaging: effect sizes, 31-33; metaanalysis, 9-10, 51; Pearson's, 8 Bacille Calmette-Guerin (BCG), 104-8 Barber, Theodore X., 122, 125 Battle over Homework, The ( Cooper), 70 Bayard, Steven, 155-56 Becker, Betsy Jane, 24n, 71, 73-74, 77 Bergin, Allen, 26 biases: in effect size estimates and samples, 172-73; errors resulting from, 58; publication, 118--21, 132 Boschwitz, Rudy, 145 breast cancer meta-analyses, 99-101, 149-54 Breton, Andre, 9,20 Bright, rvey, 46 Bushman, Brad, 16 Calmette, Leon, 104 Carli, Linda, 72 causality: explanations of, 77-80; using meta-analysis to establish, 74-77
Centers for Disease Control (CDC), 104-7 Chalmers, lain, 83, 86, 94, 98--99 Chalmers, Thomas, 7,81-82,84, 86-89,92,94-96,103,164 Chelimsky, Eleanor, 135-39, 142, 145-46,149,153 clinical trials: meta-analysis of acute myocardial infarction treatment using streptokinase, 86-99; metaanalysis oflarge, 99-101; question of small versus large, 96-99; results of meta-analyses of, 84; using RTMAS to hasten, 164 Cochran, Thad, 145 Cochran, William G., 11 Cochrane Collaboration: funding of, 165; U.K. Centre, 83; use of metaanalysis, 164; work of, 98 Cochrane Database of Systematic Reviews, 164 coding: of data for meta-analysis, 29, 35-37,45-46; decision making and compatibility in, 36-37; for effect size, 33, 36-37; form used in meta-analysis, 40; task of, 35 Cohen,Jacob,33, 169, 180 Colditz, Graham, 104-8 Collins, Rory, 97-101 colon cancer meta-analysis, 2; 2, confidence interval: around weighted effect size estimate, 175; as range of real value, 24-25; substitute for p level techniques, 175 Cook, Thomas, 79 Cooper, Harris, 13, 16, 24n, 55, 68-71, 114, 163, 177-78 Cordray, David, 16, 128, 139, 141
205
206
Index
Darley, john, 112-13 data: assumptions about, 39, 172, 176; coding of, 35-37, 45-46; combining differentiated, 61-63; combining divergent (Pearson), 8, 10; combining for WIC program analysis, 141; combining in metaanalysis, 12,34,37,46; correlation and combined correlation among, 9-10; dealing with missing, 172-73; dissemination of metaanalytic, 164; extraction and selection for meta-analysis, 35-39; finding data in research reports, 171; generalization (Tippett), 10-11; publication by Cochrane Collaboration, 164; used in development of effect size, 30 data, outlier: outlier studies, 72-73; techniques in handling, 46 data analysis: for behavior meta-analysis, 116; meta-analysis of homework research, 69; in metaanalysis of juvenile delinquency treatment, 130 data bases: creation for meta-analysis, 94-95; SEER (Surveillance, Epidemiology, and End Results), 151-52 Devine, Elizabeth, 92 d-index: adjustment for bias in estimates, 173; calculation of, 171; criteria for choosing, 170; expression of effect size using, 84; function of, 169-70 Droitcour, Judith, 149-53 Druckman, Daniel, 12 Durlak,joseph,79 Dyson, Freeman, 19 Eagly, Alice, 72 effect size metrics: defined, 31; identification and function of, 169-71; problems of, 172. See also d-index; odds ratio; percentage of variance (PV); r-index
effect sizes: absence of measurement of vote-counting, 25; adjustment for exogenous variables, 177-79; assumptions related to, 172, 174; averaging, 31-33; biases in estimates of, 172-73; coding of, 40, 46; combining, 11-12, 36, 38-39, 123-24; concept of, 33-34, 37; directions for generating, 170; heterogeneity, 102; homogeneity analysis of, 62-63,177-79; interpreting, 180-81; measurement in medical research, 84; meta-analysis of juvenile delinquency treatment, 129-31; meta-analysis techniques for estimating, 172; multiple effect size, 36, 173-74; multiple estimates based on different assumptions, 172; from percentage of variance, 171; random and fixed models, 107-8, 179; standardization of measures of, 30-31; for subsets of studies, 175; summarized, 37; variance in, 107-8, 176-77; weighting of, 38, 85, 175. See also coding; fixed effects model; random effects model; variables, moderator environmental tobacco smoke metaanalysis, 154-58 Epstein, William, 43 errors: artifactual, 132; avoiding, 37-38,58-59; in effect size estimates, 38, 176; homoscedasticity of error, 176; random, 132; sampling, 4, 38, 177; techniques to avoid,48 Evaluation of Psychotherapeutic Outcomes, The (Bergin and Garfield), 26 expectancy effect theory, 121-25 Eysenck, H.J., 13,20-21,25-27,34, 42-43,61,162 Fascell, Dante, 147 file-drawer problem, 118-21
How Science Takes Stock
Fineberg, Harvey, 105 Fisher, Ronald A., 24 fixed effects model, 107-8, 179 Fossett, Chris, 140-41, 143 F-value, 172 Gadlin, Howard, 125 Gallo, Philip, 42 Garfield, Sol, 26 General Accounting Office (GAO), Program Evaluation and Methodology Division (PEMD), 135-60 generalization: conditions for, 175; with differentiated universe, 61-63; testing for possibility of, 113-14; Tippett's, 10-11 Glass, Gene V, 11-12,25-34,61-62, 85,91,108,163-64 Gonzalez, Henry, 147 Greenstein, Robert, 145 Greenwald, Rob, 55, 59-61, 63,65-67 Guerin, Camille, 104 Hall, judith, 13, 35, 62, 72 Handbook of Research Synthesis, The (Hedges and Cooper), 55, 60, 62, 71, 163, 177 Hanushek, Eric A., 54-56, 64-66 Hedges, Larry, 13, 24n, 41-42, 45, 55, 56,63-67,71,108,163,177-79 Helms,jesse, 135-36, 139-40, 142, 145 heterogeneity test, 102 Hilgard, Ernest, 125 Hine, L. K., 103 Homework (Cooper), 70 homework meta-analysis, 69-70 homogeneity analysis, 177-79 homogeneity test, 62-63, 108, 126 Hunter, john, 35,127
Illusion of Psychotherapy, The (Epstein), 43 inference: in homogeneity analysis, 177; parametric inference procedures,176
207
jacobson, Lenore, 122 jensen, Arthur, 15 juvenile delinquent treatment metaanalysis, 128-34 Kelvin, William Thomson (Lord), 7 Knipschild, Paul, 59 Koestler, Arthur, 20 Kranzler, john, 15 Laine, Richard, 55, 56, 59-61, 63, 65-67 Laird, Nan, 12, 92, 103 Larson, Eric, 150 Latane, Bibb, 112-13 Lau, joseph, 7,87-91,92,94-96, 164 Lehman, Anthony, 16 Lemke, Kevin, 114 Light, Richard]., 3, 7,102,139,141 Lipsey, Mark, 79, 93,128-34,146 Mann, Thomas, 20 marital and family therapy meta-analysis, 44-53 measures of outcome, 91-93 mediational models, 51-53 mediator variables. See variables, mediator meta-analysis: achievements of, 14-17; averaging, 9-10, 51; benefits, 18-19; combining of research findings in, 10; cumulative, 87; definition and functions, 1-4, 55; emergence of process, 12; filedrawer problem, 118-21; in GAO's PEMD division, 135-39; growth of, 161; of Hanushek's work, 59-61,63-67; influence and acceptance of, 43, 67; opposition to and criticism of, 13, 18,42,61, 162; other names for, 2n; outcome measures in, 91-93; of paradoxical intervention studies, 49-51; performing very large meta-analyses, 99-101; primary and secondary goals, 47, 51; real-time, 164; role
208
Index
of Cochrane Collaboration, 164-65; statistical problems of, 172-74; support for, 13,41-42; for u.s. Congress, 137-39, 146-49; uses in government health-related areas, 158-60; using large or small clinical trials, 96-99; validity and threats to validity of, 125-28 Meta-Analysis for Explanation: A Casebook, 43, 78 Meta-Analytic Procedures for Social Research (Rosenthal), 163 methodology: development of metaanalytic, 25-39; effect of improved, 102; interpreting differences in, 180-81 Miller, George, 33 moderator variables. See variables, moderator Mosteller, Frederick, 13, 22, 58, 82-83, 85,92,104-5,113 Moynihan, Daniel Patrick, 139 Newton, Isaac, 1 noise in data. See artifacts null hypothesis: conditions for rejection of, 175; P values to test, 38; in test of significance, 24n Oberstar,James, 148 odds ratio: adjustment for bias in estimates, 173; calculation of, 172; described, 84-85; function of, 169-70; weighting, 85 aikin, Ingram, 13, 17, 19,45,48,128 outliers. See data, outlier Paige, David, 145 paradoxical intervention studies, 49-51 parametric inference tests, 176 path models: in explaining causality; 79; function of, 53; simple and complex, 74-77 Pauling, Linus, 58-59
Pearson, Karl, 8 PEMD (GAO Program Evaluation and Methodology Division). See Program Evaluation and Methodology Division (PEMD), General Accounting Office percentage of variance CPV), 171 Peto, Richard, 97-98,100 Pillemer, David, 7, 102 Planck, Max, 125 p levels: combining values, 38-39, 123; conversion to z score, 39; substitution of confidence interval for, 175 point estimate, 24 policy models, 146 population. See universe probabilities: combining (Tippett), 11; in statistical analysis, 23-24 Program Evaluation and Methodology Division (PEMD), General Accounting Office, 135-60 psychotherapeutic effectiveness metaanalysis, 25-44 publication bias, 118-21, 132 PV (percentage of variance). See percentage of variance CPV) Pygmalion effect, 122 quality: rating or data validity score, 106, 141; weighting for, 85-86 random effects model, 107-8, 179 Real-Time Meta-Analysis System (RTMAS), 164 Rebel with a Cause (Eysenck), 21 regression, multiple, 48 Reilly, William K., 157 research: chaotic and anarchic nature of contemporary, 1-4; discrepancies and contradictions, 4-6; votecounting analysis, 22-25 research findings: combining, 102; data validity scoring, 106, 141; limitations of literature review articles, 6-7; meta-analysis to uncover new knowledge, 70-73; seeming disar-
How Science Takes Stock
209
ray, 2-3; synthesis, 39; unpublished, 118-21; used in development of meta-analysis, 25-39; weighting for quality, 85-86 research synthesis. See meta-analysis Reynolds, William, 81 r-index: conversion to z scores, 173; criteria for choosing, 170; function of, 169-70; sampling distributions, 173; t-test conversion to, 172 Rosenthal, Robert, 11,36,41,50-51, 68, 102, 110-12, 116-18, 120-25, 132,163 RTMAS. See Real-Time Meta-Analysis System (RTMAS) Rubin, Donald, 102, 120-21, 123-25, 165-66 Rush, David, 145
Sohn, David, 18, 162 standard deviation (sd), 30-31, 60 standardization: of data extraction, 164; in meta-analysis, 30-31, 60 statistical analysis: combining probability values (Tippett), 11; Pearson's averaging, 8; problems for metaanalysts, 170-74; statistical significance, 23-24, 29; vote-counting method, 22-25 Statistical Methods for Meta-Analysis (Hedges and Olkin), 43 Statistical Power Analysis for the Behavioral Sciences (Cohen), 169 Stock, William, 58 Sullivan, Louis W, 157 Summing Up (Light and Pillemer), 7,102 Sweeney, Rebecca, 51, 52 Swift, Jonathan, 43
sample statistics, 173 sampling errors: chance of and correction for, 38, 57; effect of, 4; failure to test for, 7; as random factor, 179; relation to effect size estimate, 176-77 sampling theory, 176 Schmidt, Frank, 35, 126, 161-62 school expenditure and student achievement meta-analysis, 59-67 science achievement by gender metaanalysis, 73-77 scientific findings. See research findings sd (standard deviation). See standard deviation (sd) sensitivity analysis, 66 Shadish, William, 44-47, 51, 52, 79 Shapiro, Samuel, 162 Shaw, Douglas, 17 Shoham, Varda, 49-51 significance levels, combined, 39 significance testing, 38, 65 Silberman, George, 149-52 Singer, Jerome, 125 Sluzki, Carlos, 49 Smith, Mary Lee, 12,27-30,33-34,91
Temple, Robert, 159 Terry, Luther L., 154 "thin-slices" of behavior meta-analYSiS, 109-18 Tippett, Leonard H. C, 10-11,38 t-test: conversion to r-index, 172; defined, 171; relation to F-value, 172 t-value estimate, 171-72 universe: defined by Hanushek, 58; defining boundaries of, 58, 93; determining representative sample of studies, 107-8; with differentiated data, 61-63; significance of term, 57; of studies, 57-58 validity: scoring, 106, 141; threats to, 125-28 variables: effect of interacting, 4-5; finding relationships among, 177; in medical studies, 84; multiple and interacting, 78; in parametric inference procedures, 176; random, 179; in simple and complex path models, 74-77
210
Index
variables, mediator, 51-53 variables, moderator: definition and effect of, 47-48,51; influence on effect size, 102, 175-76 variance: assessing factors creating, 62-63, 65-66; in homogeneity analysis, 177; meta-analysis approaches, 48, 176-77; parametric inference tests, 176; random and fixed effect size, 179 vote-counting: critique of, 22-25; defined, 22; in Hanushek analyses, 56,61,64-65 Wachter, Kenneth, 146 Wanner, Eric, 63 Watzlawick, Paul, 49 Waxman, Henry, 158
weighting: of effect size, 38, 85, 175; of odds ratios, 85; for quality, 85-86 White, Howard, 6 WI C program: chart of meta-analytic findings, 143-45; GAO metaanalysis for, 135-36, 139-46 WIC (Special Supplemental Food Program for Women, Infants, and Children). See WIC program Wilkins, Wallace, 125 Wolf, Frederic, 72 Wood, Wendy, 72 York, Robert, 138, 148 Yusuf, Salim, 97, Il9 z scores: conversion of p levels to, 39; conversion of r-index to, 173