A THERAPEUTICALLY EQUIVALENT PRODUCT MAY BE DISPENSED AND ADMINISTERED UNLESS CHECKED IN THE LEFT COLUMN. DATE AND TIME NEONATAL COMFORT CARE ORDERS MUST BE ENTERED – DISCHARGE ALLERGIES: Weight: _____________ kg Check (✓) all that apply and fill in the blank if applicable 1. May discharge to: Other __________________________ Medical Diagnosis: __________________
- A |
J |K |
U |V |
Doi:10.1016/j.jaci.2004.09.023Current reviews of allergy and clinical immunology (Supported by a grant from GlaxoSmithKline, Inc, Research Triangle Park, NC) Statistical errors in immunologic research This activity is available for CME credit. See page 22A for important information.
Medical research articles have always been subject to errors in reporting statistical results. Although most of these areminor, they raise questions about the integrity of medical A statistical test is a procedure to determine whether research. Most of the errors come from a misunderstanding a defined quantity is larger (smaller) than you should about the tools used in statistical analysis. This article discussessome of the most frequent errors and provides examples of how expect by chance. The major effect of the test comes from to deal with them correctly. (J Allergy Clin Immunol the way in which the study is designed (ie, set up and executed) to get you to a point where you can do the test.
Although readers might focus a great deal on whether Key words: Statistics, P value, power, errors in reporting, variance a test result is significant, this is really not very importantunless the study is designed properly in the first place. Astatistical test is therefore somewhat different from Recently, published a commentary indicating a laboratory test in which a change of color or light the number and type of statistical errors that she found in intensity might be all that is needed to detect a physiologic reviewing articles in Infection and Immunology. She also change, although even here you need to have proper referenced several articles indicating similar problems in design and procedure to make sure that contaminants have medical journals in other specialty areas. In a related paper in BMC Medical Research Methodology, Garcia-Berthou The indicator that some change has taken place greater and Alcaraznoted that almost 12% of the articles than the level of chance is the P value. There is consensus reviewed in Nature and the British Medical Journal in medical research that a P value of less than .05 means contained P values that did not match the values of the that the statistical test you are conducting is significant (ie, test statistics given. Given that these same problems have that the result you see is not just the result of chance). It is been previously reported at least as far back as and worth noting again that a significant P value does not appear to be ubiquitous, this article presents a more in- indicate a physiologic or chemical change in a process.
depth discussion of the problems, paying particular Instead it indicates that the accumulation of data by attention to how they might affect research in the areas consensus has reached a level that supports whatever change you proposed in your design. The P level you All of the problems hinge on the understanding of what achieve in a study is only one value on a continuous scale a statistical test is doing and what a P value means.
of P values. In immunologic research its closest counter- Therefore this article starts with a brief description. There part is probably a light intensity reading taken from are a variety of statistical tests used for comparison, a microarray in which the .05 level of significance is modeling, prediction, and reduction of data. The errors similar to the 2-fold difference in intensity required to discussed in the article deal with problems when indicate that upregulation or downregulation has occurred.
statistics are used to compare groups, and this article will Focusing on a significant P value provides a certain economy for making decisions when you are examininga large number of situations in which decisions need to bemade, such as examining a number of articles for possible From the Division of Biostatistics, National Jewish Medical and Research However, the difference between a P value of .051 Disclosure of potential conflict of interest: J. R. Murphy—none disclosed.
(nonsignificant) and a P value of .049 (significant) might Received for publication September 9, 2004; revised September 17, 2004; accepted for publication September 21, 2004.
mean nothing at all in terms of the underlying physiology Reprint requests: James R. Murphy, PhD, Division of Biostatistics, National or chemistry. Significant P values need to be supported Jewish Medical and Research Center, 1400 Jackson St, Denver, CO 80206.
with additional logic and secondary evidence to be truly meaningful. It might also be useful to think of the P value as an indication of how likely you are to be able to Ó 2004 American Academy of Allergy, Asthma and Immunologydoi:10.1016/j.jaci.2004.09.023 replicate your current result rather than as a signal that change has occurred. The smaller the P value, the less Methods: Children aged 6 to 17 years with mild to moderate likely the result is due to chance, and the more likely you persistent asthma, taking only albuterol as needed, were characterized during 2 visits 1 week apart before being The power of a study, which is a property of its sample randomly assigned into a clinical trial. At the screening visit, online measurements of eNO, spirometry before and after size, indicates how likely you are to be able to detect the bronchodilator, and biomarkers of peripheral blood eosino- size of difference (change) you have specified in the design.
phils, serum eosinophil cationic protein, total serum IgE, and Taken together, the statistical test, P value, and power can urinary leukotriene E4 were obtained . During a week also be understood in the context of the false-positive and characterization period before randomization, symptoms false-negative rates of a screening or diagnostic test. The were recorded on a diary and peak expiratory flows were statistical test is the test, the .05 level is the false-positive measured twice daily using an electronic device. At the rate you are willing to accept, and the power is the true- randomization visit, eNO was repeated followed by a meth- positive rate of your test if the change you are specifying is acholine challenge and aeroallergen skin testing, Correlations correct. Adopting this view might also help to explain that and rank regression analyses between eNO and clinical a nonsignificant result on a statistical test does not mean characteristics, pulmonary function and biomarkers wereevaluated.
that you have disproved the null hypothesis. It is morelikely to mean that because of inadequate sample size orimprecision in your measurements, you are getting morefalse-negative results than you expected.
It is important to note that if I would like to show that the difference between 2 groups is not significant, all I It is quite true that some of the errors Olsenfound, have to do is to design a study with too small a sample size items 1 to 6 below, could be the result of editorial policies and that result is guaranteed. Conversely, if I want to show that restricted some of the detail given in the original that something is significant, all I have to do is get a large articles. Because removing these errors is critical to the enough sample size and I can prove that. The logic and correct interpretation of statistical results, I hope that science of the process is in selecting a value for change that editorial problems are becoming less of an issue. Recent is meaningful in a scientific or clinical sense and then guidelines for publications in biomedical journals would specifying a statistical design and testing structure that suggest that the type of detail suggested in Olsen’s article allows you to demonstrate this meaningful difference with the smallest number of subjects. The design and testing 1. Failure to document the statistical method used or structure of the study, including the major statistical using an incorrect method. Like any test, statistical tests results, should be found in the abstract at the beginning work best in the specific situations for which they are constructed. Because most statistical tests use a P value of Below I list Olsen’s main problem areas with brief .05 or less as their significance indicator, it might seem like discussions. After each problem, I give examples of they are interchangeable, but differences in the type of acceptable ways to deal with the problem in a succinct data used, the variability of the data, and the study design manner. These examples are taken from articles by Strunk dictate that there is an appropriate test that should be used et al,Asero et Adams et al,Sorensen et and for each situation. Statistical software packages have made Eggesbe et alin recent issues of the JACI. The quotes it easy to try a variety of tests in situations in which they used in the current article are usually taken from a section might or might not apply. Using the incorrect test is similar of the quoted articles labeled ‘‘Statistical methods’’ or to performing a laboratory test without using proper ‘‘Statistical analysis,’’ although Sorensen et aluse the controls. You can get a significant result, but you will term ‘‘Data analysis.’’ All of these articles avoid the not know what it means. If you are uncertain of what type errors listed below and could be used as good examples of test to use, it is best to consult a statistician.
of the correct way to report statistics in the research In his ‘‘Statistical methods’’ section, Strunk et list 2 tests that were used for the analyses in the article. These I have used the abstract of Strunk et alas a good are the Spearman correlation coefficient as a measurement example of what you should find at the beginning of an of the main objective and stepwise rank regression for article. His abstract gives a good, concise description of further modeling. The results, tables, and figures are all the objectives and design of his study. All of the articles clearly related to one of the procedures mentioned.
cited have similar descriptions that immediately allow the Asero et note the test used, a correction for it, and the reader to determine what they should be looking for in the probability level considered significant. ‘‘Proportions article itself. In the case of Strunk et the reader knows were compared by using the chi-square test with the that in the article they should be seeing a discussion of Yates correction. Probability level of less than 5% were correlation, rank regression, randomization, and a discus- considered statistically significant.’’ sion of definitions for mild and moderate asthma.
Adams et alnote that a P value of .05 or less is not the only criteria that they use for significance. ‘‘Multiple Objective: The aim of this study was to find correlations logistic regression was used to assess independent effects between eNO and other characteristics of children with mild in models for anti-inflammatory therapy use. Variables to moderate asthma currently not taking medications.
significant at the 0.10 level in bivariate analysis were entered simultaneously into a multiple logistic regression be significant, no matter how they look. To really make the case that your result is larger than chance requires the Eggesbe et almake a comment about the power of their study for certain outcomes. As mentioned above, making Strunk et alpresent graphic scatter plots in their a note of the power or lack thereof is as important as that suggest a relationship between exhaled nitric oxide determining the significance of the test results.
and certain other variables. However, they make the case The small groups make the study prone to type II errors (ie, for these relationships not on the basis of the graphic but not detecting factors that might be of importance). This on the basis of the statistical test for correlation, which uncertainty affects, among other results, the results concern- they presents in the ‘‘Biomarkers results’’ section.
ing antibiotics. Furthermore, it might be the reason why the Sorensen et alpresent plots with standard error bars in association between egg allergy and cesarean delivery failed which do seem to indicate differences.
to reach significance in the strata of allergic mothers in spite of They make their point, however, by referring to the the strength of the observed association.
significance of the test results in the ‘‘Study data’’ section.
To determine the correctness of these tests, you need to 4. Using a statistical test that requires an underlying determine that they are appropriate for the objective of the normal distribution on data that are not normally study, the design used, and the type of data being used.
distributed. The normal (also known as the Gaussian or Part of the job of the statistical reviewer of a journal is to bell curve) distribution has some very nice properties that make sure that the statistical procedures published in that make it easy to combine measurements and calculate test results and P values. Before the advent of computers, these 2. Failure to adjust for multiple comparisons (multiple properties made it worthwhile to try to transform data so tests of the same data). In the same way that clinical tests that it was normally distributed and to concentrate on tests can produce false-positive and false-negative results, that were created for a normal distribution. It is also the statistical tests can do the same. As in clinical testing, case that a normal distribution might be a good approx- the more statistical tests you do, the more likely you are to imation to how a large group of observations are get a positive (true or false) result at some point. In the distributed. Knowing the distribution (particularly know- clinical setting the potentially misleading results can be ing the mean and variance and their relation to one discounted if they do not fit the developing pattern of the another) of your measurements puts you into a category differential diagnoses. In statistical testing an adjustment of tests called parametric. There are other distributions in is made to the P value’s level of significance so that the this category in addition to the normal, but the normal is more tests you do, the more difficult it is to get a significant the one for which most statistical tests were originally result. Even among statisticians, there is disagreement about when this adjustment needs to be made and exactly Unfortunately, much biomedical research data are not how much of an adjustment needs to be made. There is normally distributed, and using tests designed for nor- agreement, however, that results that are obtained as mally distributed data on this type of data can cause a result of frequent examination of the data are qualita- misleading results. Fortunately, most tests requiring tively different from results that come about because of normality have nonparametric counterparts that make a prestated hypothesis. At a minimum, an author should very minimal assumptions about how the data are inform his audience when this additional data examination distributed. Most statistical packages contain a series of has been necessary to provide a significant result. The nonparametric tests that can be used in place of or along audience is then free to question whether it is likely that with parametric tests. These tests usually have slightly this result could be replicated in a similar study.
lower powers than their normal counterparts, but they give Sorensen et alreport their multiple comparison An alternative to using a nonparametric test is to Post hoc multiple comparisons were only preformed on transform the data so that it becomes more normal. A variables and challenges for which significant differences popular transformation is to take the natural logarithm of between the CFS and control groups were found, including your measurements and create what is called a log normal group-by-time interaction. The Tukey-Kramer multiple com- distribution. Here again there are a number of trans- formations that can be used, and it is important to test the There are several types of adjustments in addition to the results to make sure that the resultant observations are Tukey-Kramer that can be used for multiple comparisons.
normal. You cannot simply assume that the transformation For a very practical discussion of this issue, see Curran- Strunk et identify nonnormal distributions and use 3. Reporting observations without a statistical test.
both nonparametric tests and transformations where they Significance, like beauty, might be in the eye of the are appropriate. The following is from his ‘‘Statistical beholder. Although it is true that there are situations in methods’’ section. Note that the Spearman test is a non- which a unique observation is all that is needed to make your case, these are becoming increasingly rare. Graphic Spearman correlation coefficients were used to assess the representations can be misleading, and large differences relationships among the biomarkers, clinical and pulmonary between groups that come with large variability might not function measures, and allergy skin test reactivity. In this TABLE II. Characteristics of the 144 CLIC participants: Demographics, pulmonary function test and clinicalcharacteristics and skin test reactivity Symptoms during characterization, days per wk Spirometry: forced vital capacity, % predicted Bronchodilator reversibility, % change in FEV1 from Skin test reactivity, No. positive tests of 8 tested (Reproduced with permission from: Strunk RC, Szefler SJ, Phillips BR, Zeiger RS, Chinchilli VM, Larsen G, et al. Relationship of exhaled nitric oxide to clinicaland inflammatory markers of persistent asthma in children. J Allergy Clin Immunol 2003;112(5):883-892.)*Geometric mean, coefficient of variation Median and 1st and 3rd quartiles are reported on the log2 scaleàAM PEF-PM PEF/([AM PEF 1 PM PEF]/2) 3 100 manner, potential predictors of eNO were identified. PC20 median, a nonparametric measure of the central value, and was analyzed on the log base 2 scale because of doubling the lower and upper quartiles of the variable’s distribution, doses in the methacholine challenge procedure. The eNO a nonparametric indication of variability. By providing all measurements and many of the biomarkers displayed of this information, they allow the reader to make his or a skewed distribution, and these were analyzed on the her own assessment of the variability of the data, which does not have to depend on an assumption that the datahave a Gaussian (normal) distribution.
Sorensen et note the nonnormality of their data and Adams et aland Eggesbe et alboth provide CIs for the need for transformation in the ‘‘Data analysis’’ section.
their measures of variability. The presentation of Adams et ‘‘The data were generally highly right-skewed, thus alis preferable because they specify that they are using natural log-transformed variables were analyzed.’’ 95% CIs, whereas Eggesbe et alonly use the term 5. Not identifying or properly labeling the type of ‘‘confidence interval.’’ It is a good assumption that what is variance estimate being used. There are a variety of intended is a 95% interval, but it is better to state that variance structures that might apply, but the 2 that are most commonly not labeled or mislabeled are the SD and the 6. Incomplete description of the statistical test used.
SE. The SD measures the variability in your original data.
When describing the statistical tests you have used, think The SE measures the variability in the mean or average in terms of providing the reader with enough information value of your data. When you want to demonstrate how so that if they had the data, they would be able to do variable your mean is (this is important for reproducibility exactly the same test that you did and hopefully get the and statistical testing), you want to use the SE and same result. It is not enough to say you have done a t test corresponding SE bars around the mean. When you when in fact you have done a 2-sample t test with unequal want to show how variable a single observation could be expected to be, you want to use the SD and error bars In the final paragraph of their ‘‘Statistical methods’’ based on the SD around a single measurement. You also section, Strunk et alprovide additional information that need to state whether you are using 1, 2, or sometimes 3 would allow the reader to duplicate their results. They times the SE or SD. Each one of these limits encompasses have also provided information on the statistical program a different proportion of the variables you are examining.
(software) they used for their analysis. This might also be For example, a 95% CI around a mean taken from needed for a reader to reproduce results because software normally distributed data is obtained by taking 62*SE.
programs might differ in how they do calculations, and Using 1 instead of 2 gives you a 68% CI.
different programs might give you slightly different In their Strunk et alclearly indicate that they are providing the mean 6 SD. What they want to de- A complete cohort of randomly assigned CLIC subjects was monstrate is the variability of his single observations. In used for the model building. The cohort was revised this presentation they used the mean as a representation of according to variables in the final model to include all a single observation and gave the SD as a measure of the possible patients. All summary statistics and analyses were variability in single observations. They also provide the performed by means of SAS Version 8 statistical software (SAS Institute, Cary, NC). Significance was established at 11. Specify any general use computer programs used.
12. Put general descriptions of statistical methods in the Sorensen et also provide information on the analysis Methods section. When data are summarized in the package used and, in the section below, additional Results section, specify the statistical methods used information that a reader might need to reproduce their results. ‘‘All reported P values in this article are based on 13. Restrict tables and figures to those needed to explain 2-tailed tests, unless otherwise mentioned. It should be the argument of the paper and to assess its support.
noted that the use of a 2-tailed test when the hypothesis Use graphs as an alternative to tables with many involves an expected increase (or decrease) yields a con- entries; do not duplicate data in graphs and tables.
14. Avoid nontechnical uses of technical terms in statistics, such as ‘‘random’’ (which implies a ran-domizing device), ‘‘normal,’’ ‘‘significant,’’ ‘‘corre- 15. Define statistical terms, abbreviations, and most It is important to have the right perspective about correctly reporting statistical tests in the medical literature.
Recently, several major physiology journals have It is very easy to think that because misreporting has been adopted guidelines for reporting statistical information going on for so long without seeming to harm medical that adopt many of the above These guide- research, this is not a big issue. In the words of John lines contain practical and straightforward suggestions for reporting statistical information that should help you to . there may be greater danger to the public welfare from statistical dishonesty than from avoid the problems discussed above. The single most almost any other form of dishonesty.’’ This article is not important thing to do is to consult a statistician well before about dishonesty but about carelessness in the reporting of you undertake the analysis of your data. If the early steps statistical methods. The harmful results, however, could of design and data collection are done correctly, the actual be the same, and the first step in separating the one from analysis of the data is usually a relatively minor issue.
the other is to agree on rigorous reporting in medicaljournals of statistical methods. This will make it easier for researchers to understand what constitutes the standard of 1. Olsen CH. Review of the use of statistics in infection and immunity.
excellence to which all of us should aspire.
2. Garcia-Berthou E, Alcaraz C. Incongruence between test statistics and P Dr Bailar’s quote is particularly appropriate because the values in medical papers. BMC Medical Research Methodology 2004.
recommendations for statistical reporting that he and Dr Mosteller laid out in 1988are still the basis for reporting statistical methods that are recommended by the 3. White S. Statistical errors in papers in the British Journal of Psychiatry.
International Committee of Medical Journal Editors.
4. Strunk RC, Szefler SJ, Phillips BR, Zeiger RS, Chinchilli VM, Larsen G, Their 15 points are listed below. More detail can be et al. Relationship of exhaled nitric oxide to clinical and inflammatory markers of persistent asthma in children. J Allergy Clin Immunol 2003;112:883-92.
1. Describe statistical methods with enough detail to 5. Asero R, Mistrello G, Roncarlo D, Amato S, Zanoni D, Barocci F, et al.
enable a knowledgeable reader with access to the Detection of clinical markers of sensitization to profiling in patients original data to verify the reported results.
allergic to plant-derived foods. J Allergy Clin Immunol 2003;112(2): 2. When possible, quantify findings and present them with appropriate indicators of measurement error or 6. Adams RJ, Weiss ST, Fuhlbrigge A. How and by whom care is delivered influences anti-inflammatory use in asthma: results of a national uncertainty (eg, confidence intervals).
population survey. J Allergy Clin Immunol 2003;112:445-50.
3. Avoid sole reliance on statistical hypothesis testing, 7. Sorensen B, Streib JE, Strand M, Make B, Giclas PC, Fleshner M, et al.
such as the use of P values, which fails to convey Complement activation n a model of chronic fatigue syndrome. J Allergy 8. Eggesbe M, Botten G, Stigum H, Nafstad P, Magnus P. Is delivery by 4. Discuss eligibility of experimental subjects.
cesarean section a risk factor for food allergy? J Allergy Clin Immunol 5. Give details about randomization.
6. Describe the methods for, and success of, any 9. International Committee of Medical Journal Editors. Uniform require- ments for manuscripts submitted to biomedical journals. Ann Intern Med 10. Curran-Everett D. Multiple comparisons: philosophies and illustrations.
Am J Physiol Regul Integr Comp Physiol 2000;279:R1-8.
9. Report losses to observation (eg, dropouts from 11. Bailar JC. Bailar’s laws of data analysis. Clin Pharmacol Ther 1976;20: 10. References for study design and statistical methods 12. Bailar JC, Mosteller F. Guidelines for statistical reporting in articles for medical journals. Ann Intern Med 1988;108:266-73.
should be to standard works (with pages stated) 13. Curran-Everett D, Benos DJ. Guidelines for reporting statistics in when possible rather than to papers where designs or journals published by the American Physiological Society. Am J Physiol
Instructions for a Renal Captopril Study Department of Nuclear Medicine & Centre for PET Name:________________________________ Appointment: Date:______________ Time: ______________ Your doctor would like you to have a Renal Captopril Scan . This is a test to look at the blood flow to kidneys as a possible cause of high blood pressure. Please report to the Depa