Current reviews of allergy and clinical immunology
(Supported by a grant from GlaxoSmithKline, Inc, Research Triangle Park, NC)
Statistical errors in immunologic research
This activity is available for CME credit. See page 22A for important information.
Medical research articles have always been subject to errors
in reporting statistical results. Although most of these areminor, they raise questions about the integrity of medical
A statistical test is a procedure to determine whether
research. Most of the errors come from a misunderstanding
a defined quantity is larger (smaller) than you should
about the tools used in statistical analysis. This article discussessome of the most frequent errors and provides examples of how
expect by chance. The major effect of the test comes from
to deal with them correctly. (J Allergy Clin Immunol
the way in which the study is designed (ie, set up and
executed) to get you to a point where you can do the test. Although readers might focus a great deal on whether
Key words: Statistics, P value, power, errors in reporting, variance
a test result is significant, this is really not very importantunless the study is designed properly in the first place. Astatistical test is therefore somewhat different from
Recently, published a commentary indicating
a laboratory test in which a change of color or light
the number and type of statistical errors that she found in
intensity might be all that is needed to detect a physiologic
reviewing articles in Infection and Immunology. She also
change, although even here you need to have proper
referenced several articles indicating similar problems in
design and procedure to make sure that contaminants have
medical journals in other specialty areas. In a related paper
in BMC Medical Research Methodology, Garcia-Berthou
The indicator that some change has taken place greater
and Alcaraznoted that almost 12% of the articles
than the level of chance is the P value. There is consensus
reviewed in Nature and the British Medical Journal
in medical research that a P value of less than .05 means
contained P values that did not match the values of the
that the statistical test you are conducting is significant (ie,
test statistics given. Given that these same problems have
that the result you see is not just the result of chance). It is
been previously reported at least as far back as and
worth noting again that a significant P value does not
appear to be ubiquitous, this article presents a more in-
indicate a physiologic or chemical change in a process.
depth discussion of the problems, paying particular
Instead it indicates that the accumulation of data by
attention to how they might affect research in the areas
consensus has reached a level that supports whatever
change you proposed in your design. The P level you
All of the problems hinge on the understanding of what
achieve in a study is only one value on a continuous scale
a statistical test is doing and what a P value means.
of P values. In immunologic research its closest counter-
Therefore this article starts with a brief description. There
part is probably a light intensity reading taken from
are a variety of statistical tests used for comparison,
a microarray in which the .05 level of significance is
modeling, prediction, and reduction of data. The errors
similar to the 2-fold difference in intensity required to
discussed in the article deal with problems when
indicate that upregulation or downregulation has occurred.
statistics are used to compare groups, and this article will
Focusing on a significant P value provides a certain
economy for making decisions when you are examininga large number of situations in which decisions need to bemade, such as examining a number of articles for possible
From the Division of Biostatistics, National Jewish Medical and Research
However, the difference between a P value of .051
Disclosure of potential conflict of interest: J. R. Murphy—none disclosed.
(nonsignificant) and a P value of .049 (significant) might
Received for publication September 9, 2004; revised September 17, 2004;
accepted for publication September 21, 2004.
mean nothing at all in terms of the underlying physiology
Reprint requests: James R. Murphy, PhD, Division of Biostatistics, National
or chemistry. Significant P values need to be supported
Jewish Medical and Research Center, 1400 Jackson St, Denver, CO 80206.
with additional logic and secondary evidence to be truly
meaningful. It might also be useful to think of the P value
as an indication of how likely you are to be able to
Ó 2004 American Academy of Allergy, Asthma and Immunologydoi:10.1016/j.jaci.2004.09.023
replicate your current result rather than as a signal that
change has occurred. The smaller the P value, the less
Methods: Children aged 6 to 17 years with mild to moderate
likely the result is due to chance, and the more likely you
persistent asthma, taking only albuterol as needed, were
characterized during 2 visits 1 week apart before being
The power of a study, which is a property of its sample
randomly assigned into a clinical trial. At the screening visit,
online measurements of eNO, spirometry before and after
size, indicates how likely you are to be able to detect the
bronchodilator, and biomarkers of peripheral blood eosino-
size of difference (change) you have specified in the design.
phils, serum eosinophil cationic protein, total serum IgE, and
Taken together, the statistical test, P value, and power can
urinary leukotriene E4 were obtained . During a week
also be understood in the context of the false-positive and
characterization period before randomization, symptoms
false-negative rates of a screening or diagnostic test. The
were recorded on a diary and peak expiratory flows were
statistical test is the test, the .05 level is the false-positive
measured twice daily using an electronic device. At the
rate you are willing to accept, and the power is the true-
randomization visit, eNO was repeated followed by a meth-
positive rate of your test if the change you are specifying is
acholine challenge and aeroallergen skin testing, Correlations
correct. Adopting this view might also help to explain that
and rank regression analyses between eNO and clinical
a nonsignificant result on a statistical test does not mean
characteristics, pulmonary function and biomarkers wereevaluated.
that you have disproved the null hypothesis. It is morelikely to mean that because of inadequate sample size orimprecision in your measurements, you are getting morefalse-negative results than you expected.
It is important to note that if I would like to show that
the difference between 2 groups is not significant, all I
It is quite true that some of the errors Olsenfound,
have to do is to design a study with too small a sample size
items 1 to 6 below, could be the result of editorial policies
and that result is guaranteed. Conversely, if I want to show
that restricted some of the detail given in the original
that something is significant, all I have to do is get a large
articles. Because removing these errors is critical to the
enough sample size and I can prove that. The logic and
correct interpretation of statistical results, I hope that
science of the process is in selecting a value for change that
editorial problems are becoming less of an issue. Recent
is meaningful in a scientific or clinical sense and then
guidelines for publications in biomedical journals would
specifying a statistical design and testing structure that
suggest that the type of detail suggested in Olsen’s article
allows you to demonstrate this meaningful difference with
the smallest number of subjects. The design and testing
1. Failure to document the statistical method used or
structure of the study, including the major statistical
using an incorrect method. Like any test, statistical tests
results, should be found in the abstract at the beginning
work best in the specific situations for which they are
constructed. Because most statistical tests use a P value of
Below I list Olsen’s main problem areas with brief
.05 or less as their significance indicator, it might seem like
discussions. After each problem, I give examples of
they are interchangeable, but differences in the type of
acceptable ways to deal with the problem in a succinct
data used, the variability of the data, and the study design
manner. These examples are taken from articles by Strunk
dictate that there is an appropriate test that should be used
et al,Asero et Adams et al,Sorensen et and
for each situation. Statistical software packages have made
Eggesbe et alin recent issues of the JACI. The quotes
it easy to try a variety of tests in situations in which they
used in the current article are usually taken from a section
might or might not apply. Using the incorrect test is similar
of the quoted articles labeled ‘‘Statistical methods’’ or
to performing a laboratory test without using proper
‘‘Statistical analysis,’’ although Sorensen et aluse the
controls. You can get a significant result, but you will
term ‘‘Data analysis.’’ All of these articles avoid the
not know what it means. If you are uncertain of what type
errors listed below and could be used as good examples
of test to use, it is best to consult a statistician.
of the correct way to report statistics in the research
In his ‘‘Statistical methods’’ section, Strunk et list 2
tests that were used for the analyses in the article. These
I have used the abstract of Strunk et alas a good
are the Spearman correlation coefficient as a measurement
example of what you should find at the beginning of an
of the main objective and stepwise rank regression for
article. His abstract gives a good, concise description of
further modeling. The results, tables, and figures are all
the objectives and design of his study. All of the articles
clearly related to one of the procedures mentioned.
cited have similar descriptions that immediately allow the
Asero et note the test used, a correction for it, and the
reader to determine what they should be looking for in the
probability level considered significant. ‘‘Proportions
article itself. In the case of Strunk et the reader knows
were compared by using the chi-square test with the
that in the article they should be seeing a discussion of
Yates correction. Probability level of less than 5% were
correlation, rank regression, randomization, and a discus-
considered statistically significant.’’
sion of definitions for mild and moderate asthma.
Adams et alnote that a P value of .05 or less is not the
only criteria that they use for significance. ‘‘Multiple
Objective: The aim of this study was to find correlations
logistic regression was used to assess independent effects
between eNO and other characteristics of children with mild
in models for anti-inflammatory therapy use. Variables
to moderate asthma currently not taking medications.
significant at the 0.10 level in bivariate analysis were
entered simultaneously into a multiple logistic regression
be significant, no matter how they look. To really make the
case that your result is larger than chance requires the
Eggesbe et almake a comment about the power of their
study for certain outcomes. As mentioned above, making
Strunk et alpresent graphic scatter plots in their
a note of the power or lack thereof is as important as
that suggest a relationship between exhaled nitric oxide
determining the significance of the test results.
and certain other variables. However, they make the case
The small groups make the study prone to type II errors (ie,
for these relationships not on the basis of the graphic but
not detecting factors that might be of importance). This
on the basis of the statistical test for correlation, which
uncertainty affects, among other results, the results concern-
they presents in the ‘‘Biomarkers results’’ section.
ing antibiotics. Furthermore, it might be the reason why the
Sorensen et alpresent plots with standard error bars in
association between egg allergy and cesarean delivery failed
which do seem to indicate differences.
to reach significance in the strata of allergic mothers in spite of
They make their point, however, by referring to the
the strength of the observed association.
significance of the test results in the ‘‘Study data’’ section.
To determine the correctness of these tests, you need to
4. Using a statistical test that requires an underlying
determine that they are appropriate for the objective of the
normal distribution on data that are not normally
study, the design used, and the type of data being used.
distributed. The normal (also known as the Gaussian or
Part of the job of the statistical reviewer of a journal is to
bell curve) distribution has some very nice properties that
make sure that the statistical procedures published in that
make it easy to combine measurements and calculate test
results and P values. Before the advent of computers, these
2. Failure to adjust for multiple comparisons (multiple
properties made it worthwhile to try to transform data so
tests of the same data). In the same way that clinical tests
that it was normally distributed and to concentrate on tests
can produce false-positive and false-negative results,
that were created for a normal distribution. It is also the
statistical tests can do the same. As in clinical testing,
case that a normal distribution might be a good approx-
the more statistical tests you do, the more likely you are to
imation to how a large group of observations are
get a positive (true or false) result at some point. In the
distributed. Knowing the distribution (particularly know-
clinical setting the potentially misleading results can be
ing the mean and variance and their relation to one
discounted if they do not fit the developing pattern of the
another) of your measurements puts you into a category
differential diagnoses. In statistical testing an adjustment
of tests called parametric. There are other distributions in
is made to the P value’s level of significance so that the
this category in addition to the normal, but the normal is
more tests you do, the more difficult it is to get a significant
the one for which most statistical tests were originally
result. Even among statisticians, there is disagreement
about when this adjustment needs to be made and exactly
Unfortunately, much biomedical research data are not
how much of an adjustment needs to be made. There is
normally distributed, and using tests designed for nor-
agreement, however, that results that are obtained as
mally distributed data on this type of data can cause
a result of frequent examination of the data are qualita-
misleading results. Fortunately, most tests requiring
tively different from results that come about because of
normality have nonparametric counterparts that make
a prestated hypothesis. At a minimum, an author should
very minimal assumptions about how the data are
inform his audience when this additional data examination
distributed. Most statistical packages contain a series of
has been necessary to provide a significant result. The
nonparametric tests that can be used in place of or along
audience is then free to question whether it is likely that
with parametric tests. These tests usually have slightly
this result could be replicated in a similar study.
lower powers than their normal counterparts, but they give
Sorensen et alreport their multiple comparison
An alternative to using a nonparametric test is to
Post hoc multiple comparisons were only preformed on
transform the data so that it becomes more normal. A
variables and challenges for which significant differences
popular transformation is to take the natural logarithm of
between the CFS and control groups were found, including
your measurements and create what is called a log normal
group-by-time interaction. The Tukey-Kramer multiple com-
distribution. Here again there are a number of trans-
formations that can be used, and it is important to test the
There are several types of adjustments in addition to the
results to make sure that the resultant observations are
Tukey-Kramer that can be used for multiple comparisons.
normal. You cannot simply assume that the transformation
For a very practical discussion of this issue, see Curran-
Strunk et identify nonnormal distributions and use
3. Reporting observations without a statistical test.
both nonparametric tests and transformations where they
Significance, like beauty, might be in the eye of the
are appropriate. The following is from his ‘‘Statistical
beholder. Although it is true that there are situations in
methods’’ section. Note that the Spearman test is a non-
which a unique observation is all that is needed to make
your case, these are becoming increasingly rare. Graphic
Spearman correlation coefficients were used to assess the
representations can be misleading, and large differences
relationships among the biomarkers, clinical and pulmonary
between groups that come with large variability might not
function measures, and allergy skin test reactivity. In this
TABLE II. Characteristics of the 144 CLIC participants: Demographics, pulmonary function test and clinicalcharacteristics and skin test reactivity
Symptoms during characterization, days per wk
Spirometry: forced vital capacity, % predicted
Bronchodilator reversibility, % change in FEV1 from
Skin test reactivity, No. positive tests of 8 tested
(Reproduced with permission from: Strunk RC, Szefler SJ, Phillips BR, Zeiger RS, Chinchilli VM, Larsen G, et al. Relationship of exhaled nitric oxide to clinicaland inflammatory markers of persistent asthma in children. J Allergy Clin Immunol 2003;112(5):883-892.)*Geometric mean, coefficient of variation Median and 1st and 3rd quartiles are reported on the log2 scaleàAM PEF-PM PEF/([AM PEF 1 PM PEF]/2) 3 100
manner, potential predictors of eNO were identified. PC20
median, a nonparametric measure of the central value, and
was analyzed on the log base 2 scale because of doubling
the lower and upper quartiles of the variable’s distribution,
doses in the methacholine challenge procedure. The eNO
a nonparametric indication of variability. By providing all
measurements and many of the biomarkers displayed
of this information, they allow the reader to make his or
a skewed distribution, and these were analyzed on the
her own assessment of the variability of the data, which
does not have to depend on an assumption that the datahave a Gaussian (normal) distribution.
Sorensen et note the nonnormality of their data and
Adams et aland Eggesbe et alboth provide CIs for
the need for transformation in the ‘‘Data analysis’’ section.
their measures of variability. The presentation of Adams et
‘‘The data were generally highly right-skewed, thus
alis preferable because they specify that they are using
natural log-transformed variables were analyzed.’’
95% CIs, whereas Eggesbe et alonly use the term
5. Not identifying or properly labeling the type of
‘‘confidence interval.’’ It is a good assumption that what is
variance estimate being used. There are a variety of
intended is a 95% interval, but it is better to state that
variance structures that might apply, but the 2 that are most
commonly not labeled or mislabeled are the SD and the
6. Incomplete description of the statistical test used.
SE. The SD measures the variability in your original data.
When describing the statistical tests you have used, think
The SE measures the variability in the mean or average
in terms of providing the reader with enough information
value of your data. When you want to demonstrate how
so that if they had the data, they would be able to do
variable your mean is (this is important for reproducibility
exactly the same test that you did and hopefully get the
and statistical testing), you want to use the SE and
same result. It is not enough to say you have done a t test
corresponding SE bars around the mean. When you
when in fact you have done a 2-sample t test with unequal
want to show how variable a single observation could be
expected to be, you want to use the SD and error bars
In the final paragraph of their ‘‘Statistical methods’’
based on the SD around a single measurement. You also
section, Strunk et alprovide additional information that
need to state whether you are using 1, 2, or sometimes 3
would allow the reader to duplicate their results. They
times the SE or SD. Each one of these limits encompasses
have also provided information on the statistical program
a different proportion of the variables you are examining.
(software) they used for their analysis. This might also be
For example, a 95% CI around a mean taken from
needed for a reader to reproduce results because software
normally distributed data is obtained by taking 62*SE.
programs might differ in how they do calculations, and
Using 1 instead of 2 gives you a 68% CI.
different programs might give you slightly different
In their Strunk et alclearly indicate that they
are providing the mean 6 SD. What they want to de-
A complete cohort of randomly assigned CLIC subjects was
monstrate is the variability of his single observations. In
used for the model building. The cohort was revised
this presentation they used the mean as a representation of
according to variables in the final model to include all
a single observation and gave the SD as a measure of the
possible patients. All summary statistics and analyses were
variability in single observations. They also provide the
performed by means of SAS Version 8 statistical software
(SAS Institute, Cary, NC). Significance was established at
11. Specify any general use computer programs used.
12. Put general descriptions of statistical methods in the
Sorensen et also provide information on the analysis
Methods section. When data are summarized in the
package used and, in the section below, additional
Results section, specify the statistical methods used
information that a reader might need to reproduce their
results. ‘‘All reported P values in this article are based on
13. Restrict tables and figures to those needed to explain
2-tailed tests, unless otherwise mentioned. It should be
the argument of the paper and to assess its support.
noted that the use of a 2-tailed test when the hypothesis
Use graphs as an alternative to tables with many
involves an expected increase (or decrease) yields a con-
entries; do not duplicate data in graphs and tables.
14. Avoid nontechnical uses of technical terms in
statistics, such as ‘‘random’’ (which implies a ran-domizing device), ‘‘normal,’’ ‘‘significant,’’ ‘‘corre-
15. Define statistical terms, abbreviations, and most
It is important to have the right perspective about
correctly reporting statistical tests in the medical literature.
Recently, several major physiology journals have
It is very easy to think that because misreporting has been
adopted guidelines for reporting statistical information
going on for so long without seeming to harm medical
that adopt many of the above These guide-
research, this is not a big issue. In the words of John
lines contain practical and straightforward suggestions for
reporting statistical information that should help you to
. there may be greater danger to the
public welfare from statistical dishonesty than from
avoid the problems discussed above. The single most
almost any other form of dishonesty.’’ This article is not
important thing to do is to consult a statistician well before
about dishonesty but about carelessness in the reporting of
you undertake the analysis of your data. If the early steps
statistical methods. The harmful results, however, could
of design and data collection are done correctly, the actual
be the same, and the first step in separating the one from
analysis of the data is usually a relatively minor issue.
the other is to agree on rigorous reporting in medicaljournals of statistical methods. This will make it easier for
researchers to understand what constitutes the standard of
1. Olsen CH. Review of the use of statistics in infection and immunity.
excellence to which all of us should aspire.
2. Garcia-Berthou E, Alcaraz C. Incongruence between test statistics and P
Dr Bailar’s quote is particularly appropriate because the
values in medical papers. BMC Medical Research Methodology 2004.
recommendations for statistical reporting that he and Dr
Mosteller laid out in 1988are still the basis for reporting
statistical methods that are recommended by the
3. White S. Statistical errors in papers in the British Journal of Psychiatry.
International Committee of Medical Journal Editors.
4. Strunk RC, Szefler SJ, Phillips BR, Zeiger RS, Chinchilli VM, Larsen G,
Their 15 points are listed below. More detail can be
et al. Relationship of exhaled nitric oxide to clinical and inflammatory
markers of persistent asthma in children. J Allergy Clin Immunol 2003;112:883-92.
1. Describe statistical methods with enough detail to
5. Asero R, Mistrello G, Roncarlo D, Amato S, Zanoni D, Barocci F, et al.
enable a knowledgeable reader with access to the
Detection of clinical markers of sensitization to profiling in patients
original data to verify the reported results.
allergic to plant-derived foods. J Allergy Clin Immunol 2003;112(2):
2. When possible, quantify findings and present them
with appropriate indicators of measurement error or
6. Adams RJ, Weiss ST, Fuhlbrigge A. How and by whom care is delivered
influences anti-inflammatory use in asthma: results of a national
uncertainty (eg, confidence intervals).
population survey. J Allergy Clin Immunol 2003;112:445-50.
3. Avoid sole reliance on statistical hypothesis testing,
7. Sorensen B, Streib JE, Strand M, Make B, Giclas PC, Fleshner M, et al.
such as the use of P values, which fails to convey
Complement activation n a model of chronic fatigue syndrome. J Allergy
8. Eggesbe M, Botten G, Stigum H, Nafstad P, Magnus P. Is delivery by
4. Discuss eligibility of experimental subjects.
cesarean section a risk factor for food allergy? J Allergy Clin Immunol
5. Give details about randomization.
6. Describe the methods for, and success of, any
9. International Committee of Medical Journal Editors. Uniform require-
ments for manuscripts submitted to biomedical journals. Ann Intern Med
10. Curran-Everett D. Multiple comparisons: philosophies and illustrations.
Am J Physiol Regul Integr Comp Physiol 2000;279:R1-8.
9. Report losses to observation (eg, dropouts from
11. Bailar JC. Bailar’s laws of data analysis. Clin Pharmacol Ther 1976;20:
10. References for study design and statistical methods
12. Bailar JC, Mosteller F. Guidelines for statistical reporting in articles for
medical journals. Ann Intern Med 1988;108:266-73.
should be to standard works (with pages stated)
13. Curran-Everett D, Benos DJ. Guidelines for reporting statistics in
when possible rather than to papers where designs or
journals published by the American Physiological Society. Am J Physiol
A THERAPEUTICALLY EQUIVALENT PRODUCT MAY BE DISPENSED AND ADMINISTERED UNLESS CHECKED IN THE LEFT COLUMN. DATE AND TIME NEONATAL COMFORT CARE ORDERS MUST BE ENTERED – DISCHARGE ALLERGIES: Weight: _____________ kg Check (✓) all that apply and fill in the blank if applicable 1. May discharge to: Other __________________________ Medical Diagnosis: __________________
Instructions for a Renal Captopril Study Department of Nuclear Medicine & Centre for PET Name:________________________________ Appointment: Date:______________ Time: ______________ Your doctor would like you to have a Renal Captopril Scan . This is a test to look at the blood flow to kidneys as a possible cause of high blood pressure. Please report to the Depa