Journal of the American Medical Informatics Association Volume 9 Number 3 May / Jun 2002
Factors Associated withSuccess in Searching MEDLINE and ApplyingEvidence to Answer Clinical Questions
WILLIAM R. HERSH, MD, M. KATHERINE CRABTREE, RN, DNSC, DAVID H. HICKAM, MD, MPH, LYNETTA SACHEREK, MLS, CHARLES P. FRIEDMAN, PHD, PATRICIA TIDMARSH, JD, CRAIG MOSBAEK, DALE KRAEMER, PHD
A b s t r a c t Objectives: This study sought to assess the ability of medical and nurse practitioner students to use MEDLINE to obtain evidence for answering clinical questions and to identify factors associated with the successful answering of questions. Methods: A convenience sample of medical and nurse practitioner students was recruited. After completing instruments measuring demographic variables, computer and searching attitudes and experience, and cognitive traits, the subjects were given a brief orientation to MEDLINE searching and the techniques of evidence-based medicine. The subjects were then given 5 questions (from a pool of 20) to answer in two sessions using the Ovid MEDLINE system and the Oregon Health & Science University library collection. Each question was answered using three possible responses that reflected the quality of the evidence. All actions capable of being logged by the Ovid system were captured. Statistical analysis was performed using a model based on generalized estimating equations. The relevance-based measures of recall and precision were measured by defining end queries and having relevance judgments made by physicians who were not associated with the study. Results: Forty-five medical and 21 nurse practitioner students provided usable answers to 324 questions. The rate of correctness increased from 32.3 to 51.6 percent for medical students and from 31.7 to 34.7 percent for nurse practitioner students. Ability to answer questions correctly was most strongly associated with correctness of the answer before searching, user experience with MEDLINE features, the evidence-based medicine question type, and the spatial visualization score. The spatial visualization score showed multi-collinearity with student type (medical vs. nurse practitioner). Medical and nurse practitioner students obtained comparable recall and precision, neither of which was associated with correctness of the answer. Conclusions: Medical and nurse practitioner students in this study were at best moderately successful at answering clinical questions correctly with the assistance of literature searching. The results confirm the importance of evaluating both search ability and the ability to use the resulting information to accomplish a clinical task.
■ J Am Med Inform Assoc. 2002;9:283–293.
Affiliations of the authors: Oregon Health & Science University,
Correspondence and reprints: William Hersh, MD, Division of
Portland, Oregon (WRH, MKC, DHH, LS, PT, CM, DK); Univer-
Medical Informatics and Outcomes Research, Oregon Health &
sity of Pittsburgh, Pittsburgh, Pennsylvania (CPF).
Science University, BICC, 3181 SW Sam Jackson Park Road,
This study was supported by grant LM-06311 from the National
Portland, OR 97201; e-mail: <[email protected]>.
Received for publication: 7/25/01; accepted for publication: 11/29/01.
HERSH ET AL., Searching MEDLINE to Answer Clinical Questions
The MEDLINE database and techniques of evidence-
■ What factors are associated with successful use of
based medicine are increasingly used by health care
an information retrieval system to obtain correct
providers, but little research has elucidated how
helpful they are in assisting with clinical decisions. A
■ Are recall and precision, as measured by conven-
great deal of work has focused on how well users are
tional recall-precision analyses, associated with
able to retrieve relevant documents using informa-
successful answering of clinical questions?
tion retrieval systems to search MEDLINE, but littlework has focused on how well the resulting use of
the literature leads to improving ability to answerclinical questions.1 A number of studies have shown
that the techniques of evidence-based medicine canbe learned and applied correctly in educational set-
On the basis of results from a prior study,14 we devel-
tings,2 but none has looked at how well they can be
oped a model of factors that could be associated with
applied by students to answer clinical questions.
the successful answering of questions. Most of these
In the evaluation of information retrieval systems,
factors were derived from an exhaustive categoriza-
most studies have focused on measuring the quanti-
tion of factors associated with successful use of infor-
ties of relevant documents retrieved, using measures
mation retrieval systems, developed by Fidel and
of recall and precision. Although useful in measuring
Soergel,15 with some modifications for end-user
retrieval system performance, these measures do not
searching in the health care domain. We also includ-
capture the interactive nature of the actual use of sys-
ed detailed attributes for determinants of search
tems,3 tend to focus the assessment on the system
experience, in particular whether searchers had
and ignore the user,4 and do not necessarily correlate
heard of or used certain advanced MEDLINE features;
specifically, Medical Subject Headings (MeSH)terms, subheadings, explosions, and publication
A more recent user-centered approach to the evalua-
types. Table 1 shows the final model of potential pre-
tion of information retrieval systems has focused on
dictor factors related to searching ability to be
the ability of users to perform tasks with the infor-
mation retrieval system. The approach assumes thatthe primary objective of the user is not to retrieve rel-
The dependent variable in the model is the ability of
evant documents but rather to answer questions or
the user to answer clinical questions correctly. The
obtain new knowledge. The first “task-oriented”
set of questions for this study was developed in the
evaluation of an information retrieval system was
prior study14 but modified for conversion to a format
performed by Egan et al.7 when evaluating the abili-
that incorporated a judgment of the adequacy of evi-
ty of students to answer questions on statistics using
dence supporting the answer. This was done by
the SuperBook hypertext system. Others have subse-
wording the questions so they could be answered by
quently used this general approach to evaluate the
one of three statements—”yes, with adequate evi-
abilities of college students to find information in a
dence”; “no, with adequate evidence”; or “insuffi-
textbook on Sherlock Holmes8 and of medical stu-
dents to answer questions in an online factual data-
Clinical Questions
The interactive track at the Text Retrieval Conference
The questions used for searching were taken from
(TREC) has adopted a task-oriented framework to
sources that represented a diverse spectrum of real-
assess how well real users can retrieve information
world and examination-style information queries. For
from the TREC test collection.11 This approach has
clinical relevance, the first group of questions was
also been used to assess medical students using online
generated by practicing clinicians, and these questions
were known to have answers that could be found bysearching MEDLINE.16 We also included some tradi-
The specific research questions addressed in this
tional examination-style questions from the Medical
Knowledge Self-Assessment Program (MKSAP,
■ How well are senior medical students and final-
American College of Physicians, Philadelphia, Penn-
year nurse practitioner students able to search
sylvania) after converting them from multiple-choice
MEDLINE with an information retrieval system to
to yes/no form. There were ten questions from each
Journal of the American Medical Informatics Association Volume 9 Number 3 May / Jun 2002
Potential Predicting Factors Influencing Successful Use of an Information Retrieval System by End-usersAnswering Clinical Questions in a Medical Library Setting Using MEDLINE
HrdSH—ever heard of subheadings (yes, no)
Answer—answer to question (yes, no, or insufficient
HrdExp—ever heard of explosions (yes, no)
Type—EBM type (therapy, diagnosis, harm, prognosis)
HrdPT—ever heard of publication types (yes, no)
PreAns—answer before searching (yes, no, or insufficient
UsedPT—ever used publication types (yes, no)
PracHard—practice easier or harder with computers
PreCorr—answer correct before search (true, false)
PreCert—certainty of answer before search (1 = high, 5 = low)
EnjComp—enjoy using computers (yes, no)
PostAns—answer after searching (yes, no, or insufficient
MedSpec—medical specialty will be entering (from list,
PostCorr—answer correct after search (true, false)
YrsNurse—years worked as a nurse (years, nurse practitioner
PostCert—certainty of answer after search (1 = high, 5 = low)
Preferred—who user would seek for answer (from list)
nurse practitionerSpec—nurse practitioner specialty (from list,
Stacks—whether searcher went to stacks (true, false)
Order—order question done by this search (2 to 6; all did
School—school student enrolled (medical or nurse practitioner)
Helpful—citations helpful answer (number)
Justified—citations justifying answer (number)
ProdSW—use productivity software once a week (yes, no)
Viewed—total MEDLINE references viewed (number)
OwnPC—own a personal computer (yes, no)
FTViewed—full-text documents viewed (number)
Modem—personal computer has a modem (yes, no)Internet—personal computer connects to Internet (yes, no)
LitSrch—literature searches per month (number)
Quis—QUIS average for this searcher (number)
WebSrch—Web searches per month (number)
WebMed—Web searches for medical information per month
Retrieved—number of articles retrieved by user in terminal set(s)
Precision—user’s precision for retrieval of definitely or
TrainEBM—ever had instruction in evidence-based medicine
Recall—user’s recall for retrieval of definitely or possibly
HrdMsh—ever heard of MeSH terms (yes, no)
Experimental Protocol
lowed by two hands-on sessions where they woulddo the actual searching, read the articles, and answer
To obtain subjects for the experiment, a convenience
sample of senior medical students from OregonHealth & Science University (OHSU) and nurse prac-
The large-group sessions, consisting of 3 to 15 sub-
titioner students from OHSU and Washington State
jects at a time, took place in a computer training
University–Vancouver was recruited by e-mail,
room. At each session, subjects were first adminis-
paper mail, and, in the case of nurse practitioner stu-
tered a questionnaire on their personal characteristics
dents, announcements in classes. Students were
and experience with computers and searching fac-
offered remuneration of $100 for successful comple-
tors, from Table 1. Next they were tested for the fol-
lowing cognitive attributes, measured by validatedinstruments from the Educational Testing Service Kit
The general experimental protocol was to participate
of Cognitive Factors17 (ETS mnemonic in parenthe-
in three sessions—a “large-group” session where the
ses)—paper folding test to assess spatial visualiza-
students would be administered questionnaires and
tion (VZ-2), nonsense syllogisms test to assess logical
receive an orientation to MEDLINE, the techniques of
reasoning (RL-1), and advanced vocabulary test I to
evidence-based medicine, and the experiment, fol-
HERSH ET AL., Searching MEDLINE to Answer Clinical Questions
These cognitive factors were assessed because theyhave been found to be associated with successful use
of computer systems in general and retrieval systems
1. Is there any benefit of routine Pap smear in persons who
have had a hysterectomy for benign disease?
Spatial visualization—The ability to visualize spa-
2. Is ultrasound the best diagnostic test available to exclude the
tial relationships among objects has been associat-
presence of lower extremity deep vein thrombosis?
ed with retrieval system performance by nurses,18
3. Are nonacetylated salicylates really safer, e.g., have less
ability to locate text in a general retrieval system,19
incidence of acid-peptic problems, in patients with NSAID
and ability to use a direct-manipulation (three-
(nonsteroidal anti-inflammatory drug) gastrointestinal
dimensional) retrieval system user interface.20
intolerance (who benefit from anti-inflammatory effect)?
4. Is the elevation of alkaline phosphatase a better indicator of
Logical reasoning—The ability to reason from prem-
recurring prostate cancer than a rising PSA (prostate-specific
ise to conclusion has been shown to improve selec-
tivity in assessing relevant and nonrelevant cita-
5. Is the Cytobrush superior to a spatula for obtaining cells for
Pap smears, in terms of technical quality (e.g., percentage of interpretable smears)?
Verbal reasoning—The ability to understand vocabu-
6. Does dietary protein effect the level of proteinuria in patients
lary has been shown to be associated with the use of
a larger number of search expressions and high-fre-
7. Is there any benefit of ultrasound as physical therapy for
quency search terms in a retrieval systems.21
8. Is penicillin superior to ciprofloxacin for the outpatient
The large-group session also included a brief orienta-
treatment of pelvic inflammatory disease?
tion to the searching task of the experiment as well as
9. Is anti-inflammatory therapy (NSAIDs) better than Tylenol
a 30-minute hands-on training session covering basic
for elderly patients with degenerative joint disease?
MEDLINE and evidence-based medicine principles.
10. Is there evidence of an association between petroleum
The following searching features were chosen for
coverage—MeSH headings, text words, explosions,
Questions derived from medical test questions:
combinations, limits, and scope notes. These features
1. Is a high-dose (1,200 to 1,500 mg daily) regimen of
were chosen because they are taught in medical
zidovudine therapeutically superior to a low-dose (500 to
informatics training courses for health care providers
600 mg daily) one for reducing the progression to AIDs in
offered at OHSU, and they constitute a basic skill set
2. Will PSA screening lower the mortality rate from prostate
for MEDLINE searching by a health care provider. The
cancer in low-risk men after they reach the age of 50 years?
overview of evidence-based medicine described the
3. Is there good evidence that an antibiotic can prevent
basic notions of framing the appropriate question,
endocarditis in an 18-year-old woman with rheumatic heart
determining which evidence would be most appro-
disease (mild mitral regurgitation) who is to have a dental
priate for a given question, and the best searching
strategies for finding such evidence. The teaching
4. A 52-year-old woman recently had a modified radical
was done by a medical informatician experienced in
mastectomy for infiltrating ductal carcinoma of the breast. Her axillary lymph nodes are negative for tumor. Would
teaching MEDLINE and evidence-based medicine to
estrogen receptor negativity be more likely to indicate a
relatively poor prognosis for this patient, rather than thyroid hormone receptor positivity?
The hands-on sessions took place 2 to 4 weeks after
5. A 40-year-old premenopausal woman consults you about
the subject had completed the large-group session.
her risk of breast cancer. Does prior use of birth control pills
He or she had been encouraged to practice the
searching skills taught in the large-group session but
6. Does anti-reflux surgery in patients with Barrett’s esophagus
was given no other explicit instructions. The search-
reduce the risk of developing adenocarcinoma?
ing sessions took place in the OHSU Library. All
7. Is long-distance running associated with intervertebral disc
searching was done using the Ovid information
retrieval system (Ovid Technologies, New York,
8. Would plasma norepinepherine levels indicate poor
New York), which accesses MEDLINE and a collection
prognosis in congestive heart failure better than hyponatremia?
of 85 full-text journals. We used the Web-based ver-
9. Is Trental (pentoxifylline) the best drug available to improve
sion of Ovid. We also employed its logging facility,
which enabled all search statements to be recorded as
10. Do the majority (> 50 percent) of terminal AIDS patients have
well as the number of citations presented to and
clinical symptoms of cardiac involvement?
Journal of the American Medical Informatics Association Volume 9 Number 3 May / Jun 2002
In the two hands-on sessions, subjects searched six
study team (WRH and MKC) graded the answer
questions. For the first question of the first session,
forms, resolving any differences by consensus. The
each user searched the same “practice” question,
primary measure of correctness was whether the sub-
which was not graded. This was done not only to
jects selected the correct answer from ”yes, with ade-
make searchers comfortable with the experimental
quate evidence”; “no, with adequate evidence”;
process but also because a previous study had sug-
or “insufficient evidence to answer question.”
gested a learning effect among inexperiencedsearchers.22 The remaining five questions (the last
Statistical Analysis
two from the first session and all three from the sec-
The usual appropriate statistical analysis for studies
ond session) were selected at random from the pool
with a binary outcome measure (correct vs. incorrect)
of 20 questions. Question selection was without
is logistic regression. However, traditional logistic
replacement, i.e., the same pool of questions was
regression is not appropriate with these data because
used for four consecutive searchers.
it does not take into account the within-subject corre-
Subjects were limited to one hour per question. Before
lation, i.e., the fact that individual questions are not
searching, each subject was asked to record a pre-
independent, because each searcher answered five
search answer and a rating of certainty on a scale of 1
questions. To account for this, the analyses were
(most) to 5 (least) for the questions on which they
done using generalized estimating equations (GEEs),
would search. Subjects were then instructed to per-
which account for within-subject correlation.24 All
form their searching in MEDLINE and to obtain any
analyses, both univariate and multivariate, were
articles that they wanted to read either in the library
done using GEE on version 8.01 of the SAS statistical
stacks or in the full-text collection available online.
They were asked to record on paper their post-searchanswer, the certainty of their answer (on the 1-to-5
Recall-Precision Analysis
scale), which articles justified their answer, and any
The goal of the recall-precision analysis was to iden-
article that they looked at in the stacks or in full-text on
tify a relative measure of recall and precision that
the screen. On completion of the searching, they were
could be used to determine its contribution to pre-
administered the Questionnaire for User Interface
dicting successful answering of the question. We
Satisfaction (QUIS) 5.0 instrument to measure their
aimed to carry out the study using the approaches
satisfaction with the searching system. QUIS measures
most commonly reported in the information retrieval
user satisfaction with a computer system, providing a
literature, such as using domain experts to judge rel-
score from 0 (poor) to 9 (excellent) on a variety of user
evance, pooling documents within a single query to
factors, with the overall score determined by averag-
mask the number of searchers who retrieved it, and
assessing interrater reliability. Because of limitations
Searching time for each question was measured
of the retrieval process and of study resources, we
using a wall clock. All user–system interactions were
were not able to calculate absolute recall and preci-
logged by the Ovid system software. The search logs
sion. We instead calculated relative measures for
were processed to count the number of search cycles
each that would allow assessment of their association
(each consisting of the entry of a search term or
Boolean combination of sets) and the number of full
The recall-precision analysis was performed by use of
MEDLINE references viewed on the screen.
searching logs. The first challenge in this process was
Answer Scoring
to determine which sets to use for each user and ques-tion in the analysis. Ovid and other Boolean-oriented
After all the hands-on searching sessions were com-
systems produce sets of results. Usually, the first sets
pleted, the actual answers to the questions were deter-
are large and later ones are smaller, as the search is
mined by the research team. This was done by assem-
refined. The user usually does not start looking at the
bling all the articles retrieved for each question and
sets until they are smaller and refined. For example, a
giving them, along with the question, to three mem-
search on the first question derived from medical test
bers of the study team (WRH, MKC, and DHH). The
questions shown in Table 2 (“Is a high dose (1,200 to
three first designated an answer individually (blinded
1,500 mg daily) regimen of zidovudine therapeutically
to any answers that subjects may have provided) and
superior to a low dose (500 to 600 mg daily) regimen
then worked out their differences by consensus. After
for reducing the progression to AIDs in patients with
the answers were designated, two members of the
positive HIV antibody?”) would probably begin with
HERSH ET AL., Searching MEDLINE to Answer Clinical Questions
sets created with the terms zidovudine and AIDS. Each of these sets yields large numbers of articles, but
Values of All Searching-related Factors for All
their combination with AND as well as applications of
Searches, Stratified by Student Type (Medical
limits (such as publication type) would yield a more
We therefore wanted to restrict our recall-precision
calculations to sets that the user would be likely tobrowse to view specific articles. We thus aimed to
identify the “end queries” of the search process,
which we identified as the terminal point of a search
strategy. This was defined as the point at which the
subject stopped refining (creating subsets) of a search
and began using new search terms or new combina-
tions of search terms. The document set retrieved by
the end queries also had to include the documents
cited by the subjects as justification for their post-
These rules for end queries were given to a graduate
medical informatics student who was asked to read
the rules and identify end queries in ten systematical-
ly selected query sets. The selected query sets repre-
sented different users and study questions and were
from the beginning, middle, and end of query logs.
The graduate student’s identification of end querieswas compared with the selection of end queries for the
same set by a member of the study team (PT). The
graduate student and study team member identified
34 end queries. They initially agreed on 23 of the 34
end queries (67.6 percent). The rules were refined by
consensus and then applied to all the study logs. End
queries that retrieved 200 or more citations were
excluded from the relevancy analysis. A total of 10,508
unique question/document pairs were identified and
To assess the reliability of the relevance judgment
process and determine the number of relevance
judges required per question/document pair, a pilot
study using 100 documents, selected from a random
sampling of five study questions, was performed.
The judgments were made by six physicians, alleither general internal medicine or medical informat-
ics postdoctoral fellows. All six judges rated the rele-
vance of all 100 documents using a three-point rating
scale of “not relevant,” “possibly relevant,” and “def-
initely relevant.” Using Cronbach’s alpha, measured
at 0.93, it was determined that three judges per ques-
tion/document pair were sufficient for reliable
assessment of relevance in the larger collection.
To have each question/document pair rated by three
judges, we could assess only half (5,254) of the docu-
ments retrieved by users, because of limited study
Journal of the American Medical Informatics Association Volume 9 Number 3 May / Jun 2002
resources. The six judges who participated in the
pilot study also participated in the complete study.
Cross-tabulation of Number and Percentage
Three of them judged each unique question/docu-
of Incorrect and Correct Answers Before and
ment pair. All judgments were done using the MED-
After Searching, for All Students, Medical Students,
LINE record distributed in an electronic file, although
they were encouraged to seek the full text of the arti-cle in the library, if necessary.
A total of 66 searchers—45 medical students and 21
nurse practitioner students—performed five searches
each, for a total of 330 searches. Six searches were dis-carded, five because the user did not search MEDLINE
and one because the user did not provide an answer,
which left 324 searches for analysis. General Results
NOTE: Percentages represent correct answers within each group ofstudents.
There were several differences between medical andnurse practitioner students in this study (Table 3). Use of computers and use of productivity software
Statistical Analysis
were higher for nurse practitioner students, butsearching experience was higher for medical stu-
The goal of the statistical analysis was to build a
dents. Medical students also had higher self-rating of
model of the factors associated with successful
knowledge and experience with advanced MEDLINE
searching, as defined by the outcome variable of cor-
features. Nurse practitioner students tended to be
rect answer after searching (PostCorr). A GEE model
older, and all were female (compared with medical
was built after individual variables were screened for
students, of whom 50 percent were female). Medical
their p values, using ANOVA for continuous vari-
students also had higher scores on the three cognitive
ables and chi-square tests for categorical variables
tests. In searching, medical students tended to view
(Table 5). We also made one adjustment in the data,
more sets but fewer references. They also had a high-
which was to combine the measures of MEDLINE
er level of satisfaction with the information retrieval
experience (asking subjects if they had heard of or
used four advanced MEDLINE search features—MeSHterms, subheadings, explosions, and publication
Prior to searching, the performance of all students was
types) into a set of scale variables. The most statisti-
slightly worse than chance, with 104 (32.1 percent) cor-
cally predictive scale variable was Used2, which allo-
rect and 220 (67.9 percent) incorrect answers. The rate
cated one point if the subject said they had used pub-
of correctness before searching for medical and nurse
lication types and one point if they had used explo-
practitioner students was virtually identical (32.3 vs.
31.7 percent), as was the rating of certainty (mean, 3.16for medical students and 3.23 for nurse practitioner
A backward variable selection scheme was per-
students), which was low for both groups.
formed to determine the best model that predictedcorrect answering of the question after the MEDLINE
Following searching, there were 150 (46.3 percent)
search. All variables that predicted the outcome with
correct answers and 174 (53.7 percent) incorrect
a p value less than 0.25 were included in the initial
answers. The medical students had a higher rate of
backward regression model. The variable with the
correctness than nurse practitioner students (51.6 vs.
highest p value was deleted from the model, and the
34.7 percent). Examination of the results in more
model was then re-run until all variables had p val-
detail (Table 4) shows that medical students were
better able to use searching to convert incorrectanswers into correct ones. Both groups had compara-
After the backward scheme, variables were put back
ble rates of initially correct answers staying correct or
into the model to see whether any were significant.
None of the excluded variables, when added to the
HERSH ET AL., Searching MEDLINE to Answer Clinical Questions
final model, had a p value less than 0.10. Interactionterms were tested with the final model, and none
Values of Searching-related Factors Stratified by
Correctness of Answer, along with p Values ofScreening, for Statistical Analysis
A forward variable selection scheme yielded thesame best model. The final model showed that
PreCorr, VZ2, Used2, and Type were significant(Table 6). For the variable Type (evidence-based
medicine question type), questions of prognosis had
the highest likelihood of being answered correctly,
followed by questions of therapy, diagnosis, and
harm. The analysis also found that the VZ2 and
School variables demonstrated multi-collinearity, i.e.,
they were very highly correlated, and once one was
in the model, the other did not provide any addition-
al statistical significance. The VZ2 variable wasincluded in the final model because it led to a higher
overall p value for the model than School.
Next, a similar analysis was done to find the best
model using the subset of cases (n = 220) in which thesubject did not have the right answer before the MED-
LINE search. As shown in Table 7, the final best model
was very similar to the model for all questions, with
PreCorr obviously excluded. VZ2 and School again
To further assess the finding that success in answering
questions varied on the basis of evidence-based medi-
cine question type, we looked at the rate of correctness
for the four types (Table 8). Because of the exploratory
nature of this analysis, we did not perform any statis-
tical analysis. We did find, however, that all subjects
did best with prognosis questions, intermediately well
with therapy questions, and worst with diagnosis and
Statistical Model for All Questions, Including
Regression Value with Its p Value, Odds Ratio,
and 95% Confidence Interval (CI) for the Odds
Overall p value for type: <0.0001
Journal of the American Medical Informatics Association Volume 9 Number 3 May / Jun 2002
harm questions. The largest gap between medical and
nurse practitioner students was with harm and thera-
Rate of Correctness by Evidence-based Medicine
py questions; nurse practitioner students did slightly
Recall–Precision Analysis
Three relevance judgments were made on each of the5,254 question/document pairs using the responses
“not relevant,” “possibly relevant,” and “definitely
relevant.” The judges achieved 100 percent agreement
(all three judges choose the same rating) for 4,265 of
the judgments (81.2 percent), with partial agreement
ABBREVIATION: NP indicates nurse practitioner.
(two of three judges choosing the same rating) for 918judgments (17.5 percent), and complete disagreementfor 71 (1.3 percent).
done defining relevant documents as those rated def-
There were 20 unique groupings of the six relevance
initely or possibly relevant. (Limiting relevance to
judges. For these 20 subsets of relevance judgments,
those defined as definitely relevant only would have
the range of reliability, measured by Cronbach’s
left many students’ questions with a recall of 0 per-
alpha, was 0.69 to 0.86. The weighted average of
cent.) As shown in Table 3, there was virtually no dif-
these measures was 0.81. Final document relevance
ference in recall and precision between medical and
was assigned according to the following rules: 1) If
nurse practitioner students. Likewise, Table 5 shows
all judges agreed, the document was assigned that
that there was no difference in recall and precision
rating. 2) If two judges agreed, the document was
between questions that were answered correctly and
assigned that rating. 3) If all three judges disagreed,
the document was assigned the “possibly relevant”rating. The relevance judgments were then used to
Discussion
calculate recall and precision for each user/questionpair.
This study assessed the ability of a convenience sam-
For the 20 questions, 131 documents were judged
ple of medical and nurse practitioner students to
definitely relevant (average, 6.6 per question) and
answer clinical questions by searching the literature
528 were judged possibly relevant (average, 26.4 per
and using the techniques of evidence-based medi-
question). The calculation of recall and precision was
cine. We found that this task was challenging for stu-dents at this level of experience. They spent an aver-age of more than 30 minutes conducting literaturesearches and were successful at correctly answering
Statistical Model for Questions Whose Answers
One of the main findings of the study was that medical
Were Incorrect before Searching, Including
students were able to use the information retrieval sys-
Regression Value with Its p Value, Odds Ratio,
tem to improve question answering, while nurse prac-
and 95% ConfidenceIinterval (CI) for the Odds
titioner students were led astray by the system as often
as they were helped by it. Another main finding wasthat experience in searching MEDLINE and spatial visu-
alization ability were associated with the successful
Subjects were also better able to answer certain types
of questions in the evidence-based medicine frame-
work than others, doing best with questions of prog-
nosis and worst with those of diagnosis and harm. Another major finding was that the often-studied
measures of recall and precision were virtually iden-
tical between medical and nurse practitioner stu-
HERSH ET AL., Searching MEDLINE to Answer Clinical Questions
dents and had no association with the correct
determine whether additional training, either
through the curricula or as part of the study, wouldchange this outcome. Because we used a convenience
Our results showed some similarities to and some
sample, further research is needed to see whether our
differences from a prior study.14 Somewhat similarly
findings of differences between medical and nurse
to this study, the prior study found that the most pre-
practitioner students are generalizable.
dictive factor of successful question answering wasstudent type (medical vs. nurse practitioner). In that
For both groups of students, the amount of time
study, spatial visualization showed a trend toward
taken to answer questions is longer than the amount
predicting successful answering, but it was short of
of time usually devoted to a single patient. Clearly
this type of information seeking is practical only“after hours” and not in the clinical setting. Indeed, a
In the previous study, unlike this one, the question-
growing trend in the evidence-based medicine move-
answering abilities of both medical and nurse practi-
ment is toward the development of “synthesized”
tioner students improved with use of the information
evidence-based content.28 It may well be that further
retrieval system. Literature searching experience in
emphasis should be put on the development of these
that study, as in this one, was associated with the cor-
sorts of information resources for the clinical setting.
rect answering of questions. Factors that did not pre-dict success in the previous study included age, gen-
One finding of the study, with uncertain meaning,
der, general computer experience, attitudes toward
was the strong association of spatial visualization
computers, other cognitive factors (logical reasoning,
ability with the ability to use an information retrieval
verbal reasoning, and associational fluency), Meyer-
system to successfully answer clinical questions. As
Briggs personality type, and user satisfaction with
this variable had such strong multi-collinearity with
the information retrieval system. One limitation of
whether a subject was enrolled in medical or nurse
the prior study was that it did not assess the applica-
practitioner school, determining which was causal
tion of evidence-based medicine principles in the
It may be instructive to explore other results that link
The findings of this study are consistent with (and
computer tasks to spatial visualization. Egan and
build on) the results of other studies of searching by
Gomez29 have shown that spatial visualization is
medical students. Previous studies have shown that
associated with two processes in text editing—find-
training and experience with MEDLINE lead to
ing the location of characters to be edited and gener-
improved retrieval of relevant articles25,26 and
ating a syntactically correct sequence of actions to
increased use of MEDLINE in clinical settings.27 Other
complete the task. Similarly, Vincente et al.30 have
than our previous study, described above, there are
found that the ability to use a hierarchic file system is
no other studies of searching by nurse practitioner
associated with spatial visualization as well as with
vocabulary skills. In addition, Allen21 has shown thatthis trait is associated with the appropriate selection
This study supports the observation that the tradi-
tional measures to evaluate information retrieval sys-tems, recall and precision, may have little value in the
This study had some additional limitations. The use
assessment of how well a system can be used in a
of students, albeit in late stages of their training, lim-
real-world setting. While users obviously need to
its the generalizability of the results beyond those at
retrieve relevant articles to answer questions, the
their level of clinical training. In future studies, com-
quantity of relevant articles retrieved had no bearing
munity practitioners will also be included. This study
on the ability to answer them correctly in this study.
was also limited by taking place in a laboratory set-
These findings give credence to those who argue that
ting, in that behaviors in the pursuit of actual clinical
researchers put too much emphasis on these meas-
knowledge in a real clinical setting may be different
ures as primary indicators of system efficacy.3–5
from those shown in this controlled environment.
They also verify the nonmedical TREC studies that
However, the ability to use a defined set of tasks and
questions provides a benefit that cannot be obtainedin the real clinical setting.
Our results have significant implications for the useof information retrieval systems in clinical settings.
In conclusion, this study shows that students in clin-
The ability to answer clinical questions with the aid
ical training are at best moderately successful at
of MEDLINE is low. Further research is needed to
answering clinical questions correctly with the assis-
Journal of the American Medical Informatics Association Volume 9 Number 3 May / Jun 2002
tance of searching the literature. Determining the rea-
Bull Med Libr Assoc. 2000;88:323–31.
sons for the limited success of question answering in
15. Fidel R, Soergel D. Factors affecting online bibliographic
this study requires further research. The possibilities
retrieval: a conceptual framework for research. J Am Soc InfSci. 1983;34:163–80.
include everything from inadequate training to an
16. Gorman PN, Helfand M. Information seeking in primary care:
inappropriate database (i.e., a large bibliographic
how physicians choose which clinical questions to pursue and
database instead of more concise, synthesized refer-
which to leave unanswered. Med Decis Making.
ences), problems with the retrieval system, and diffi-
culties in judging evidence. Further studies must
17. Ekstrom RB, French JW, Harmon HH. Manual for Kit of
Factor-referenced Cognitive Tests. Princeton, NJ: Educational
develop a priori hypotheses to determine the optimal
use of information retrieval systems by clinicians.
18. Staggers N, Mills ME. Nurse–computer interaction: staff per-
formance outcomes. Nurs Res. 1994;43:144–50.
19. Gomez LM, Egan D, Bowers C. Learning to use a text editor:
some learner characteristics that predict success. Human–
1. Hersh WR, Hickam DH. How well do physicians use elec-
Computer Interaction. 1986;2:1–23.
tronic information retrieval systems? A framework for investi-
20. Swan RC, Allan J. Aspect windows, 3-D visualization, and
gation and review of the literature. JAMA. 1998;280:1347–52.
indirect comparisons of information retrieval systems.
2. Norman GR, Shannon SI. Effectiveness of instruction in criti-
Proceedings of the 21st Annual International ACM Special
cal appraisal (evidence-based medicine): a critical appraisal.
Interest Group in Information Retrieval; Melbourne,
Australia. New York: ACM Press, 1998:173–81.
3. Swanson DR. Historical note: Information retrieval and the
21. Allen BL. Cognitive differences in end-user searching of a CD-
future of an illusion. J Am Soc Inf Sci. 1988;39:92–8.
ROM index. Proceedings of the 15th Annual International
4. Harter SP. Psychological relevance and information science. J
ACM Special Interest Group in Information Retrieval;
Copenhagen, Denmark. Copenhagen, Denmark. New York:
5. Hersh WR. Relevance and retrieval evaluation: perspectives
from medicine. J Am Soc Inf Sci. 1994;45:201–6.
22. Rose L, Crabtree K, Hersh W. Factors influencing successful
6. Hersh W, Turpin A, Price S, et al. Challenging conventional
use of information retrieval systems by nurse practitioner stu-
assumptions of automated information retrieval with real
dents. Proc AMIA Annu Fall Symp. 1998:1067.
users: Boolean searching and batch retrieval evaluations. Inf
23. Chin JP, Diehl VA, Norman KL. Development of an instru-
ment measuring user satisfaction of the human–computer
7. Egan DE, Remde JR, Gomez JM, Landauer TK, Eberhardt J,
interface. Proceedings of CHI ‘88—Human Factors in
Lochbaum CC. Formative design-evaluation of Superbook.
Computing Systems. New York: ACM Press, 1988:213–8.
24. Hu FB, Goldberg J, Hedeker D, Flay BR, Pentz MA. Com-
8. Mynatt BT, Leventhal LM, Instone K, Farhat J, Rohlman DS.
parison of population-averaged and subject-specific
Hypertext or book: Which is better for answering questions?
approaches for analyzing repeated binary outcomes. Am J
Proceedings of Computer-Human Interface ‘92. 1992:19–25.
9. Wildemuth BM, de Bliek R, Friedman CP, File DD. Medical
25. Pao ML, Grefsheim SF, Barclay ML, Woolliscroft JO, Shipman
students’ personal knowledge, searching proficiency, and data-
BL, McQuillan M. Effect of search experience on sustained
base use in problem solving. J Am Soc Inf Sci. 1995;46:590–607.
MEDLINE usage by students. Acad Med. 1994;69:914–20.
10. Friedman CP, Wildemuth BM, Muriuki M, et al. A comparison
26. Mitchell JA, Johnson ED, Hewett JE, Proud VK. Medical stu-
of hypertext and Boolean access to biomedical information.
dents using Grateful Med: analysis of failed searches and a six-
Proc AMIA Annual Fall Symp. 1996:2–6.
month follow-up study. Comput Biomed Res. 1992;25:43–55.
11. Hersh WR. Interactivity at the Text Retrieval Conference
27. Pao ML, Grefsheim SF, Barclay ML, Woolliscroft JO,
(TREC). Inf Proc Manag. 2001;37:365–6.
McQuillan M, Shipman BL. Factors affecting students’ use of
12. Hersh WR, Molnar A. Toward new measures of information
MEDLINE. Comput Biomed Res. 1993;26:541–55.
retrieval evaluation. Proceedings of the 18th Annual
28. Hersh WR. “A world of knowledge at your fingertips”: the
International ACM-SIGIR Conference on Research and
promise, reality, and future directions of online information
Development in Information Retrieval; Seattle, Washington.
retrieval. Acad Med. 1999;74:240–3.
29. Egan DE, Gomez LM. Assaying, isolating, and accommodat-
13. Hersh WR, Pentecost J, Hickam DH. A task-oriented approach
ing individual differences in learning a complex skill. In:
to information retrieval evaluation. J Am Soc Inf Sci.
Dillon R (ed). Individual Differences in Cognition, vol 2. New
14. Hersh WR, Crabtree MK, Hickam DH, Sacherek L, Rose L,
30. Vincente KJ, Leske JS, Williges RC. Assaying and isolating
Friedman CP. Factors associated with successful answering of
individual differences in searching a hierarchical file system.
clinical questions using an information retrieval system. Bull
B.O.C. y L. - N.º 221 Martes, 16 de noviembre 2004 I V. OTRAS DISPOSICIONES Y AC U E R D O S 2.– Sus órganos de go b i e rno y administración tendrán su sede en la loca-Y A D M I N I S T R ACIÓN T E R R I TO R I A L3.– En lo que respecta al personal al servicio de la Mancomunidad seregirá en todos sus extremos, por lo determinado en la Legislación deRégimen Local. ORDEN
Verbum Analecta Neolatina XI/1, pp. 167–187Italiano, Lingue Europee, Lingue Orientali (CLIEO) Abstract: This paper first presents some facts about the cultural background of young Italian people, showing subsequently a selection of the well-established characters of spoken and written juvenile Italian. Then it moves over to discuss the technological situation of Italy, also presenting a choic