Journal of the American Medical Informatics Association Volume 9 Number 3 May / Jun 2002 Factors Associated withSuccess in Searching MEDLINE and ApplyingEvidence to Answer Clinical Questions WILLIAM R. HERSH, MD, M. KATHERINE CRABTREE, RN, DNSC, DAVID H. HICKAM, MD, MPH, LYNETTA SACHEREK, MLS, CHARLES P. FRIEDMAN, PHD, PATRICIA TIDMARSH, JD, CRAIG MOSBAEK, DALE KRAEMER, PHD A b s t r a c t Objectives: This study sought to assess the ability of medical and nurse
practitioner students to use MEDLINE to obtain evidence for answering clinical questions and to
identify factors associated with the successful answering of questions.
Methods: A convenience sample of medical and nurse practitioner students was recruited.
After completing instruments measuring demographic variables, computer and searching
attitudes and experience, and cognitive traits, the subjects were given a brief orientation to
MEDLINE searching and the techniques of evidence-based medicine. The subjects were then
given 5 questions (from a pool of 20) to answer in two sessions using the Ovid MEDLINE system
and the Oregon Health & Science University library collection. Each question was answered using
three possible responses that reflected the quality of the evidence. All actions capable of being
logged by the Ovid system were captured. Statistical analysis was performed using a model based
on generalized estimating equations. The relevance-based measures of recall and precision were
measured by defining end queries and having relevance judgments made by physicians who were
not associated with the study.
Results: Forty-five medical and 21 nurse practitioner students provided usable answers to 324
questions. The rate of correctness increased from 32.3 to 51.6 percent for medical students and
from 31.7 to 34.7 percent for nurse practitioner students. Ability to answer questions correctly
was most strongly associated with correctness of the answer before searching, user experience
with MEDLINE features, the evidence-based medicine question type, and the spatial visualization
score. The spatial visualization score showed multi-collinearity with student type (medical vs.
nurse practitioner). Medical and nurse practitioner students obtained comparable recall and
precision, neither of which was associated with correctness of the answer.
Conclusions: Medical and nurse practitioner students in this study were at best moderately
successful at answering clinical questions correctly with the assistance of literature searching.
The results confirm the importance of evaluating both search ability and the ability to use the
resulting information to accomplish a clinical task.
J Am Med Inform Assoc. 2002;9:283–293.
Affiliations of the authors: Oregon Health & Science University, Correspondence and reprints: William Hersh, MD, Division of Portland, Oregon (WRH, MKC, DHH, LS, PT, CM, DK); Univer- Medical Informatics and Outcomes Research, Oregon Health & sity of Pittsburgh, Pittsburgh, Pennsylvania (CPF).
Science University, BICC, 3181 SW Sam Jackson Park Road, This study was supported by grant LM-06311 from the National Portland, OR 97201; e-mail: <[email protected]>.
Received for publication: 7/25/01; accepted for publication: 11/29/01. HERSH ET AL., Searching MEDLINE to Answer Clinical Questions The MEDLINE database and techniques of evidence- ■ What factors are associated with successful use of based medicine are increasingly used by health care an information retrieval system to obtain correct providers, but little research has elucidated how helpful they are in assisting with clinical decisions. A ■ Are recall and precision, as measured by conven- great deal of work has focused on how well users are tional recall-precision analyses, associated with able to retrieve relevant documents using informa- successful answering of clinical questions? tion retrieval systems to search MEDLINE, but littlework has focused on how well the resulting use of the literature leads to improving ability to answerclinical questions.1 A number of studies have shown that the techniques of evidence-based medicine canbe learned and applied correctly in educational set- On the basis of results from a prior study,14 we devel- tings,2 but none has looked at how well they can be oped a model of factors that could be associated with applied by students to answer clinical questions.
the successful answering of questions. Most of these In the evaluation of information retrieval systems, factors were derived from an exhaustive categoriza- most studies have focused on measuring the quanti- tion of factors associated with successful use of infor- ties of relevant documents retrieved, using measures mation retrieval systems, developed by Fidel and of recall and precision. Although useful in measuring Soergel,15 with some modifications for end-user retrieval system performance, these measures do not searching in the health care domain. We also includ- capture the interactive nature of the actual use of sys- ed detailed attributes for determinants of search tems,3 tend to focus the assessment on the system experience, in particular whether searchers had and ignore the user,4 and do not necessarily correlate heard of or used certain advanced MEDLINE features; specifically, Medical Subject Headings (MeSH)terms, subheadings, explosions, and publication A more recent user-centered approach to the evalua- types. Table 1 shows the final model of potential pre- tion of information retrieval systems has focused on dictor factors related to searching ability to be the ability of users to perform tasks with the infor- mation retrieval system. The approach assumes thatthe primary objective of the user is not to retrieve rel- The dependent variable in the model is the ability of evant documents but rather to answer questions or the user to answer clinical questions correctly. The obtain new knowledge. The first “task-oriented” set of questions for this study was developed in the evaluation of an information retrieval system was prior study14 but modified for conversion to a format performed by Egan et al.7 when evaluating the abili- that incorporated a judgment of the adequacy of evi- ty of students to answer questions on statistics using dence supporting the answer. This was done by the SuperBook hypertext system. Others have subse- wording the questions so they could be answered by quently used this general approach to evaluate the one of three statements—”yes, with adequate evi- abilities of college students to find information in a dence”; “no, with adequate evidence”; or “insuffi- textbook on Sherlock Holmes8 and of medical stu- dents to answer questions in an online factual data- Clinical Questions
The interactive track at the Text Retrieval Conference The questions used for searching were taken from (TREC) has adopted a task-oriented framework to sources that represented a diverse spectrum of real- assess how well real users can retrieve information world and examination-style information queries. For from the TREC test collection.11 This approach has clinical relevance, the first group of questions was also been used to assess medical students using online generated by practicing clinicians, and these questions were known to have answers that could be found bysearching MEDLINE.16 We also included some tradi- The specific research questions addressed in this tional examination-style questions from the Medical Knowledge Self-Assessment Program (MKSAP, ■ How well are senior medical students and final- American College of Physicians, Philadelphia, Penn- year nurse practitioner students able to search sylvania) after converting them from multiple-choice MEDLINE with an information retrieval system to to yes/no form. There were ten questions from each Journal of the American Medical Informatics Association Volume 9 Number 3 May / Jun 2002 Potential Predicting Factors Influencing Successful Use of an Information Retrieval System by End-usersAnswering Clinical Questions in a Medical Library Setting Using MEDLINE HrdSH—ever heard of subheadings (yes, no) Answer—answer to question (yes, no, or insufficient HrdExp—ever heard of explosions (yes, no) Type—EBM type (therapy, diagnosis, harm, prognosis) HrdPT—ever heard of publication types (yes, no) PreAns—answer before searching (yes, no, or insufficient UsedPT—ever used publication types (yes, no) PracHard—practice easier or harder with computers PreCorr—answer correct before search (true, false) PreCert—certainty of answer before search (1 = high, 5 = low) EnjComp—enjoy using computers (yes, no) PostAns—answer after searching (yes, no, or insufficient MedSpec—medical specialty will be entering (from list, PostCorr—answer correct after search (true, false) YrsNurse—years worked as a nurse (years, nurse practitioner PostCert—certainty of answer after search (1 = high, 5 = low) Preferred—who user would seek for answer (from list) nurse practitionerSpec—nurse practitioner specialty (from list, Stacks—whether searcher went to stacks (true, false) Order—order question done by this search (2 to 6; all did School—school student enrolled (medical or nurse practitioner) Helpful—citations helpful answer (number) Justified—citations justifying answer (number) ProdSW—use productivity software once a week (yes, no) Viewed—total MEDLINE references viewed (number) OwnPC—own a personal computer (yes, no) FTViewed—full-text documents viewed (number) Modem—personal computer has a modem (yes, no)Internet—personal computer connects to Internet (yes, no) LitSrch—literature searches per month (number) Quis—QUIS average for this searcher (number) WebSrch—Web searches per month (number) WebMed—Web searches for medical information per month Retrieved—number of articles retrieved by user in terminal set(s) Precision—user’s precision for retrieval of definitely or TrainEBM—ever had instruction in evidence-based medicine Recall—user’s recall for retrieval of definitely or possibly HrdMsh—ever heard of MeSH terms (yes, no) Experimental Protocol
lowed by two hands-on sessions where they woulddo the actual searching, read the articles, and answer To obtain subjects for the experiment, a convenience sample of senior medical students from OregonHealth & Science University (OHSU) and nurse prac- The large-group sessions, consisting of 3 to 15 sub- titioner students from OHSU and Washington State jects at a time, took place in a computer training University–Vancouver was recruited by e-mail, room. At each session, subjects were first adminis- paper mail, and, in the case of nurse practitioner stu- tered a questionnaire on their personal characteristics dents, announcements in classes. Students were and experience with computers and searching fac- offered remuneration of $100 for successful comple- tors, from Table 1. Next they were tested for the fol- lowing cognitive attributes, measured by validatedinstruments from the Educational Testing Service Kit The general experimental protocol was to participate of Cognitive Factors17 (ETS mnemonic in parenthe- in three sessions—a “large-group” session where the ses)—paper folding test to assess spatial visualiza- students would be administered questionnaires and tion (VZ-2), nonsense syllogisms test to assess logical receive an orientation to MEDLINE, the techniques of reasoning (RL-1), and advanced vocabulary test I to evidence-based medicine, and the experiment, fol- HERSH ET AL., Searching MEDLINE to Answer Clinical Questions These cognitive factors were assessed because theyhave been found to be associated with successful use of computer systems in general and retrieval systems 1. Is there any benefit of routine Pap smear in persons who have had a hysterectomy for benign disease? Spatial visualization—The ability to visualize spa- 2. Is ultrasound the best diagnostic test available to exclude the tial relationships among objects has been associat- presence of lower extremity deep vein thrombosis? ed with retrieval system performance by nurses,18 3. Are nonacetylated salicylates really safer, e.g., have less ability to locate text in a general retrieval system,19 incidence of acid-peptic problems, in patients with NSAID and ability to use a direct-manipulation (three- (nonsteroidal anti-inflammatory drug) gastrointestinal dimensional) retrieval system user interface.20 intolerance (who benefit from anti-inflammatory effect)? 4. Is the elevation of alkaline phosphatase a better indicator of Logical reasoning—The ability to reason from prem- recurring prostate cancer than a rising PSA (prostate-specific ise to conclusion has been shown to improve selec- tivity in assessing relevant and nonrelevant cita- 5. Is the Cytobrush superior to a spatula for obtaining cells for Pap smears, in terms of technical quality (e.g., percentage of interpretable smears)? Verbal reasoning—The ability to understand vocabu- 6. Does dietary protein effect the level of proteinuria in patients lary has been shown to be associated with the use of a larger number of search expressions and high-fre- 7. Is there any benefit of ultrasound as physical therapy for quency search terms in a retrieval systems.21 8. Is penicillin superior to ciprofloxacin for the outpatient The large-group session also included a brief orienta- treatment of pelvic inflammatory disease? tion to the searching task of the experiment as well as 9. Is anti-inflammatory therapy (NSAIDs) better than Tylenol a 30-minute hands-on training session covering basic for elderly patients with degenerative joint disease? MEDLINE and evidence-based medicine principles.
10. Is there evidence of an association between petroleum The following searching features were chosen for coverage—MeSH headings, text words, explosions, Questions derived from medical test questions: combinations, limits, and scope notes. These features 1. Is a high-dose (1,200 to 1,500 mg daily) regimen of were chosen because they are taught in medical zidovudine therapeutically superior to a low-dose (500 to informatics training courses for health care providers 600 mg daily) one for reducing the progression to AIDs in offered at OHSU, and they constitute a basic skill set 2. Will PSA screening lower the mortality rate from prostate for MEDLINE searching by a health care provider. The cancer in low-risk men after they reach the age of 50 years? overview of evidence-based medicine described the 3. Is there good evidence that an antibiotic can prevent basic notions of framing the appropriate question, endocarditis in an 18-year-old woman with rheumatic heart determining which evidence would be most appro- disease (mild mitral regurgitation) who is to have a dental priate for a given question, and the best searching strategies for finding such evidence. The teaching 4. A 52-year-old woman recently had a modified radical was done by a medical informatician experienced in mastectomy for infiltrating ductal carcinoma of the breast. Her axillary lymph nodes are negative for tumor. Would teaching MEDLINE and evidence-based medicine to estrogen receptor negativity be more likely to indicate a relatively poor prognosis for this patient, rather than thyroid hormone receptor positivity? The hands-on sessions took place 2 to 4 weeks after 5. A 40-year-old premenopausal woman consults you about the subject had completed the large-group session.
her risk of breast cancer. Does prior use of birth control pills He or she had been encouraged to practice the searching skills taught in the large-group session but 6. Does anti-reflux surgery in patients with Barrett’s esophagus was given no other explicit instructions. The search- reduce the risk of developing adenocarcinoma? ing sessions took place in the OHSU Library. All 7. Is long-distance running associated with intervertebral disc searching was done using the Ovid information retrieval system (Ovid Technologies, New York, 8. Would plasma norepinepherine levels indicate poor New York), which accesses MEDLINE and a collection prognosis in congestive heart failure better than hyponatremia? of 85 full-text journals. We used the Web-based ver- 9. Is Trental (pentoxifylline) the best drug available to improve sion of Ovid. We also employed its logging facility, which enabled all search statements to be recorded as 10. Do the majority (> 50 percent) of terminal AIDS patients have well as the number of citations presented to and clinical symptoms of cardiac involvement? Journal of the American Medical Informatics Association Volume 9 Number 3 May / Jun 2002 In the two hands-on sessions, subjects searched six study team (WRH and MKC) graded the answer questions. For the first question of the first session, forms, resolving any differences by consensus. The each user searched the same “practice” question, primary measure of correctness was whether the sub- which was not graded. This was done not only to jects selected the correct answer from ”yes, with ade- make searchers comfortable with the experimental quate evidence”; “no, with adequate evidence”; process but also because a previous study had sug- or “insufficient evidence to answer question.” gested a learning effect among inexperiencedsearchers.22 The remaining five questions (the last Statistical Analysis
two from the first session and all three from the sec- The usual appropriate statistical analysis for studies ond session) were selected at random from the pool with a binary outcome measure (correct vs. incorrect) of 20 questions. Question selection was without is logistic regression. However, traditional logistic replacement, i.e., the same pool of questions was regression is not appropriate with these data because used for four consecutive searchers.
it does not take into account the within-subject corre- Subjects were limited to one hour per question. Before lation, i.e., the fact that individual questions are not searching, each subject was asked to record a pre- independent, because each searcher answered five search answer and a rating of certainty on a scale of 1 questions. To account for this, the analyses were (most) to 5 (least) for the questions on which they done using generalized estimating equations (GEEs), would search. Subjects were then instructed to per- which account for within-subject correlation.24 All form their searching in MEDLINE and to obtain any analyses, both univariate and multivariate, were articles that they wanted to read either in the library done using GEE on version 8.01 of the SAS statistical stacks or in the full-text collection available online.
They were asked to record on paper their post-searchanswer, the certainty of their answer (on the 1-to-5 Recall-Precision Analysis
scale), which articles justified their answer, and any The goal of the recall-precision analysis was to iden- article that they looked at in the stacks or in full-text on tify a relative measure of recall and precision that the screen. On completion of the searching, they were could be used to determine its contribution to pre- administered the Questionnaire for User Interface dicting successful answering of the question. We Satisfaction (QUIS) 5.0 instrument to measure their aimed to carry out the study using the approaches satisfaction with the searching system. QUIS measures most commonly reported in the information retrieval user satisfaction with a computer system, providing a literature, such as using domain experts to judge rel- score from 0 (poor) to 9 (excellent) on a variety of user evance, pooling documents within a single query to factors, with the overall score determined by averag- mask the number of searchers who retrieved it, and assessing interrater reliability. Because of limitations Searching time for each question was measured of the retrieval process and of study resources, we using a wall clock. All user–system interactions were were not able to calculate absolute recall and preci- logged by the Ovid system software. The search logs sion. We instead calculated relative measures for were processed to count the number of search cycles each that would allow assessment of their association (each consisting of the entry of a search term or Boolean combination of sets) and the number of full The recall-precision analysis was performed by use of MEDLINE references viewed on the screen.
searching logs. The first challenge in this process was Answer Scoring
to determine which sets to use for each user and ques-tion in the analysis. Ovid and other Boolean-oriented After all the hands-on searching sessions were com- systems produce sets of results. Usually, the first sets pleted, the actual answers to the questions were deter- are large and later ones are smaller, as the search is mined by the research team. This was done by assem- refined. The user usually does not start looking at the bling all the articles retrieved for each question and sets until they are smaller and refined. For example, a giving them, along with the question, to three mem- search on the first question derived from medical test bers of the study team (WRH, MKC, and DHH). The questions shown in Table 2 (“Is a high dose (1,200 to three first designated an answer individually (blinded 1,500 mg daily) regimen of zidovudine therapeutically to any answers that subjects may have provided) and superior to a low dose (500 to 600 mg daily) regimen then worked out their differences by consensus. After for reducing the progression to AIDs in patients with the answers were designated, two members of the positive HIV antibody?”) would probably begin with HERSH ET AL., Searching MEDLINE to Answer Clinical Questions sets created with the terms zidovudine and AIDS.
Each of these sets yields large numbers of articles, but Values of All Searching-related Factors for All their combination with AND as well as applications of Searches, Stratified by Student Type (Medical limits (such as publication type) would yield a more We therefore wanted to restrict our recall-precision calculations to sets that the user would be likely tobrowse to view specific articles. We thus aimed to identify the “end queries” of the search process, which we identified as the terminal point of a search strategy. This was defined as the point at which the subject stopped refining (creating subsets) of a search and began using new search terms or new combina- tions of search terms. The document set retrieved by the end queries also had to include the documents cited by the subjects as justification for their post- These rules for end queries were given to a graduate medical informatics student who was asked to read the rules and identify end queries in ten systematical- ly selected query sets. The selected query sets repre- sented different users and study questions and were from the beginning, middle, and end of query logs.
The graduate student’s identification of end querieswas compared with the selection of end queries for the same set by a member of the study team (PT). The graduate student and study team member identified 34 end queries. They initially agreed on 23 of the 34 end queries (67.6 percent). The rules were refined by consensus and then applied to all the study logs. End queries that retrieved 200 or more citations were excluded from the relevancy analysis. A total of 10,508 unique question/document pairs were identified and To assess the reliability of the relevance judgment process and determine the number of relevance judges required per question/document pair, a pilot study using 100 documents, selected from a random sampling of five study questions, was performed.
The judgments were made by six physicians, alleither general internal medicine or medical informat- ics postdoctoral fellows. All six judges rated the rele- vance of all 100 documents using a three-point rating scale of “not relevant,” “possibly relevant,” and “def- initely relevant.” Using Cronbach’s alpha, measured at 0.93, it was determined that three judges per ques- tion/document pair were sufficient for reliable assessment of relevance in the larger collection.
To have each question/document pair rated by three judges, we could assess only half (5,254) of the docu- ments retrieved by users, because of limited study Journal of the American Medical Informatics Association Volume 9 Number 3 May / Jun 2002 resources. The six judges who participated in the pilot study also participated in the complete study.
Cross-tabulation of Number and Percentage Three of them judged each unique question/docu- of Incorrect and Correct Answers Before and ment pair. All judgments were done using the MED- After Searching, for All Students, Medical Students, LINE record distributed in an electronic file, although they were encouraged to seek the full text of the arti-cle in the library, if necessary.
A total of 66 searchers—45 medical students and 21 nurse practitioner students—performed five searches each, for a total of 330 searches. Six searches were dis-carded, five because the user did not search MEDLINE and one because the user did not provide an answer, which left 324 searches for analysis.
General Results
NOTE: Percentages represent correct answers within each group ofstudents.
There were several differences between medical andnurse practitioner students in this study (Table 3).
Use of computers and use of productivity software Statistical Analysis
were higher for nurse practitioner students, butsearching experience was higher for medical stu- The goal of the statistical analysis was to build a dents. Medical students also had higher self-rating of model of the factors associated with successful knowledge and experience with advanced MEDLINE searching, as defined by the outcome variable of cor- features. Nurse practitioner students tended to be rect answer after searching (PostCorr). A GEE model older, and all were female (compared with medical was built after individual variables were screened for students, of whom 50 percent were female). Medical their p values, using ANOVA for continuous vari- students also had higher scores on the three cognitive ables and chi-square tests for categorical variables tests. In searching, medical students tended to view (Table 5). We also made one adjustment in the data, more sets but fewer references. They also had a high- which was to combine the measures of MEDLINE er level of satisfaction with the information retrieval experience (asking subjects if they had heard of or used four advanced MEDLINE search features—MeSHterms, subheadings, explosions, and publication Prior to searching, the performance of all students was types) into a set of scale variables. The most statisti- slightly worse than chance, with 104 (32.1 percent) cor- cally predictive scale variable was Used2, which allo- rect and 220 (67.9 percent) incorrect answers. The rate cated one point if the subject said they had used pub- of correctness before searching for medical and nurse lication types and one point if they had used explo- practitioner students was virtually identical (32.3 vs.
31.7 percent), as was the rating of certainty (mean, 3.16for medical students and 3.23 for nurse practitioner A backward variable selection scheme was per- students), which was low for both groups. formed to determine the best model that predictedcorrect answering of the question after the MEDLINE Following searching, there were 150 (46.3 percent) search. All variables that predicted the outcome with correct answers and 174 (53.7 percent) incorrect a p value less than 0.25 were included in the initial answers. The medical students had a higher rate of backward regression model. The variable with the correctness than nurse practitioner students (51.6 vs.
highest p value was deleted from the model, and the 34.7 percent). Examination of the results in more model was then re-run until all variables had p val- detail (Table 4) shows that medical students were better able to use searching to convert incorrectanswers into correct ones. Both groups had compara- After the backward scheme, variables were put back ble rates of initially correct answers staying correct or into the model to see whether any were significant.
None of the excluded variables, when added to the HERSH ET AL., Searching MEDLINE to Answer Clinical Questions final model, had a p value less than 0.10. Interactionterms were tested with the final model, and none Values of Searching-related Factors Stratified by Correctness of Answer, along with p Values ofScreening, for Statistical Analysis A forward variable selection scheme yielded thesame best model. The final model showed that PreCorr, VZ2, Used2, and Type were significant(Table 6). For the variable Type (evidence-based medicine question type), questions of prognosis had the highest likelihood of being answered correctly, followed by questions of therapy, diagnosis, and harm. The analysis also found that the VZ2 and School variables demonstrated multi-collinearity, i.e., they were very highly correlated, and once one was in the model, the other did not provide any addition- al statistical significance. The VZ2 variable wasincluded in the final model because it led to a higher overall p value for the model than School.
Next, a similar analysis was done to find the best model using the subset of cases (n = 220) in which thesubject did not have the right answer before the MED- LINE search. As shown in Table 7, the final best model was very similar to the model for all questions, with PreCorr obviously excluded. VZ2 and School again To further assess the finding that success in answering questions varied on the basis of evidence-based medi- cine question type, we looked at the rate of correctness for the four types (Table 8). Because of the exploratory nature of this analysis, we did not perform any statis- tical analysis. We did find, however, that all subjects did best with prognosis questions, intermediately well with therapy questions, and worst with diagnosis and Statistical Model for All Questions, Including Regression Value with Its p Value, Odds Ratio, and 95% Confidence Interval (CI) for the Odds Overall p value for type: <0.0001 Journal of the American Medical Informatics Association Volume 9 Number 3 May / Jun 2002 harm questions. The largest gap between medical and nurse practitioner students was with harm and thera- Rate of Correctness by Evidence-based Medicine py questions; nurse practitioner students did slightly Recall–Precision Analysis
Three relevance judgments were made on each of the5,254 question/document pairs using the responses “not relevant,” “possibly relevant,” and “definitely relevant.” The judges achieved 100 percent agreement (all three judges choose the same rating) for 4,265 of the judgments (81.2 percent), with partial agreement ABBREVIATION: NP indicates nurse practitioner.
(two of three judges choosing the same rating) for 918judgments (17.5 percent), and complete disagreementfor 71 (1.3 percent). done defining relevant documents as those rated def- There were 20 unique groupings of the six relevance initely or possibly relevant. (Limiting relevance to judges. For these 20 subsets of relevance judgments, those defined as definitely relevant only would have the range of reliability, measured by Cronbach’s left many students’ questions with a recall of 0 per- alpha, was 0.69 to 0.86. The weighted average of cent.) As shown in Table 3, there was virtually no dif- these measures was 0.81. Final document relevance ference in recall and precision between medical and was assigned according to the following rules: 1) If nurse practitioner students. Likewise, Table 5 shows all judges agreed, the document was assigned that that there was no difference in recall and precision rating. 2) If two judges agreed, the document was between questions that were answered correctly and assigned that rating. 3) If all three judges disagreed, the document was assigned the “possibly relevant”rating. The relevance judgments were then used to Discussion
calculate recall and precision for each user/questionpair.
This study assessed the ability of a convenience sam- For the 20 questions, 131 documents were judged ple of medical and nurse practitioner students to definitely relevant (average, 6.6 per question) and answer clinical questions by searching the literature 528 were judged possibly relevant (average, 26.4 per and using the techniques of evidence-based medi- question). The calculation of recall and precision was cine. We found that this task was challenging for stu-dents at this level of experience. They spent an aver-age of more than 30 minutes conducting literaturesearches and were successful at correctly answering Statistical Model for Questions Whose Answers One of the main findings of the study was that medical Were Incorrect before Searching, Including students were able to use the information retrieval sys- Regression Value with Its p Value, Odds Ratio, tem to improve question answering, while nurse prac- and 95% ConfidenceIinterval (CI) for the Odds titioner students were led astray by the system as often as they were helped by it. Another main finding wasthat experience in searching MEDLINE and spatial visu- alization ability were associated with the successful Subjects were also better able to answer certain types of questions in the evidence-based medicine frame- work than others, doing best with questions of prog- nosis and worst with those of diagnosis and harm.
Another major finding was that the often-studied measures of recall and precision were virtually iden- tical between medical and nurse practitioner stu- HERSH ET AL., Searching MEDLINE to Answer Clinical Questions dents and had no association with the correct determine whether additional training, either through the curricula or as part of the study, wouldchange this outcome. Because we used a convenience Our results showed some similarities to and some sample, further research is needed to see whether our differences from a prior study.14 Somewhat similarly findings of differences between medical and nurse to this study, the prior study found that the most pre- practitioner students are generalizable.
dictive factor of successful question answering wasstudent type (medical vs. nurse practitioner). In that For both groups of students, the amount of time study, spatial visualization showed a trend toward taken to answer questions is longer than the amount predicting successful answering, but it was short of of time usually devoted to a single patient. Clearly this type of information seeking is practical only“after hours” and not in the clinical setting. Indeed, a In the previous study, unlike this one, the question- growing trend in the evidence-based medicine move- answering abilities of both medical and nurse practi- ment is toward the development of “synthesized” tioner students improved with use of the information evidence-based content.28 It may well be that further retrieval system. Literature searching experience in emphasis should be put on the development of these that study, as in this one, was associated with the cor- sorts of information resources for the clinical setting.
rect answering of questions. Factors that did not pre-dict success in the previous study included age, gen- One finding of the study, with uncertain meaning, der, general computer experience, attitudes toward was the strong association of spatial visualization computers, other cognitive factors (logical reasoning, ability with the ability to use an information retrieval verbal reasoning, and associational fluency), Meyer- system to successfully answer clinical questions. As Briggs personality type, and user satisfaction with this variable had such strong multi-collinearity with the information retrieval system. One limitation of whether a subject was enrolled in medical or nurse the prior study was that it did not assess the applica- practitioner school, determining which was causal tion of evidence-based medicine principles in the It may be instructive to explore other results that link The findings of this study are consistent with (and computer tasks to spatial visualization. Egan and build on) the results of other studies of searching by Gomez29 have shown that spatial visualization is medical students. Previous studies have shown that associated with two processes in text editing—find- training and experience with MEDLINE lead to ing the location of characters to be edited and gener- improved retrieval of relevant articles25,26 and ating a syntactically correct sequence of actions to increased use of MEDLINE in clinical settings.27 Other complete the task. Similarly, Vincente et al.30 have than our previous study, described above, there are found that the ability to use a hierarchic file system is no other studies of searching by nurse practitioner associated with spatial visualization as well as with vocabulary skills. In addition, Allen21 has shown thatthis trait is associated with the appropriate selection This study supports the observation that the tradi- tional measures to evaluate information retrieval sys-tems, recall and precision, may have little value in the This study had some additional limitations. The use assessment of how well a system can be used in a of students, albeit in late stages of their training, lim- real-world setting. While users obviously need to its the generalizability of the results beyond those at retrieve relevant articles to answer questions, the their level of clinical training. In future studies, com- quantity of relevant articles retrieved had no bearing munity practitioners will also be included. This study on the ability to answer them correctly in this study.
was also limited by taking place in a laboratory set- These findings give credence to those who argue that ting, in that behaviors in the pursuit of actual clinical researchers put too much emphasis on these meas- knowledge in a real clinical setting may be different ures as primary indicators of system efficacy.3–5 from those shown in this controlled environment.
They also verify the nonmedical TREC studies that However, the ability to use a defined set of tasks and questions provides a benefit that cannot be obtainedin the real clinical setting.
Our results have significant implications for the useof information retrieval systems in clinical settings.
In conclusion, this study shows that students in clin- The ability to answer clinical questions with the aid ical training are at best moderately successful at of MEDLINE is low. Further research is needed to answering clinical questions correctly with the assis- Journal of the American Medical Informatics Association Volume 9 Number 3 May / Jun 2002 tance of searching the literature. Determining the rea- Bull Med Libr Assoc. 2000;88:323–31.
sons for the limited success of question answering in 15. Fidel R, Soergel D. Factors affecting online bibliographic this study requires further research. The possibilities retrieval: a conceptual framework for research. J Am Soc InfSci. 1983;34:163–80.
include everything from inadequate training to an 16. Gorman PN, Helfand M. Information seeking in primary care: inappropriate database (i.e., a large bibliographic how physicians choose which clinical questions to pursue and database instead of more concise, synthesized refer- which to leave unanswered. Med Decis Making.
ences), problems with the retrieval system, and diffi- culties in judging evidence. Further studies must 17. Ekstrom RB, French JW, Harmon HH. Manual for Kit of Factor-referenced Cognitive Tests. Princeton, NJ: Educational develop a priori hypotheses to determine the optimal use of information retrieval systems by clinicians.
18. Staggers N, Mills ME. Nurse–computer interaction: staff per- formance outcomes. Nurs Res. 1994;43:144–50.
19. Gomez LM, Egan D, Bowers C. Learning to use a text editor: some learner characteristics that predict success. Human– 1. Hersh WR, Hickam DH. How well do physicians use elec- Computer Interaction. 1986;2:1–23.
tronic information retrieval systems? A framework for investi- 20. Swan RC, Allan J. Aspect windows, 3-D visualization, and gation and review of the literature. JAMA. 1998;280:1347–52.
indirect comparisons of information retrieval systems.
2. Norman GR, Shannon SI. Effectiveness of instruction in criti- Proceedings of the 21st Annual International ACM Special cal appraisal (evidence-based medicine): a critical appraisal.
Interest Group in Information Retrieval; Melbourne, Australia. New York: ACM Press, 1998:173–81.
3. Swanson DR. Historical note: Information retrieval and the 21. Allen BL. Cognitive differences in end-user searching of a CD- future of an illusion. J Am Soc Inf Sci. 1988;39:92–8.
ROM index. Proceedings of the 15th Annual International 4. Harter SP. Psychological relevance and information science. J ACM Special Interest Group in Information Retrieval; Copenhagen, Denmark. Copenhagen, Denmark. New York: 5. Hersh WR. Relevance and retrieval evaluation: perspectives from medicine. J Am Soc Inf Sci. 1994;45:201–6.
22. Rose L, Crabtree K, Hersh W. Factors influencing successful 6. Hersh W, Turpin A, Price S, et al. Challenging conventional use of information retrieval systems by nurse practitioner stu- assumptions of automated information retrieval with real dents. Proc AMIA Annu Fall Symp. 1998:1067.
users: Boolean searching and batch retrieval evaluations. Inf 23. Chin JP, Diehl VA, Norman KL. Development of an instru- ment measuring user satisfaction of the human–computer 7. Egan DE, Remde JR, Gomez JM, Landauer TK, Eberhardt J, interface. Proceedings of CHI ‘88—Human Factors in Lochbaum CC. Formative design-evaluation of Superbook.
Computing Systems. New York: ACM Press, 1988:213–8.
24. Hu FB, Goldberg J, Hedeker D, Flay BR, Pentz MA. Com- 8. Mynatt BT, Leventhal LM, Instone K, Farhat J, Rohlman DS.
parison of population-averaged and subject-specific Hypertext or book: Which is better for answering questions? approaches for analyzing repeated binary outcomes. Am J Proceedings of Computer-Human Interface ‘92. 1992:19–25.
9. Wildemuth BM, de Bliek R, Friedman CP, File DD. Medical 25. Pao ML, Grefsheim SF, Barclay ML, Woolliscroft JO, Shipman students’ personal knowledge, searching proficiency, and data- BL, McQuillan M. Effect of search experience on sustained base use in problem solving. J Am Soc Inf Sci. 1995;46:590–607.
MEDLINE usage by students. Acad Med. 1994;69:914–20.
10. Friedman CP, Wildemuth BM, Muriuki M, et al. A comparison 26. Mitchell JA, Johnson ED, Hewett JE, Proud VK. Medical stu- of hypertext and Boolean access to biomedical information.
dents using Grateful Med: analysis of failed searches and a six- Proc AMIA Annual Fall Symp. 1996:2–6.
month follow-up study. Comput Biomed Res. 1992;25:43–55.
11. Hersh WR. Interactivity at the Text Retrieval Conference 27. Pao ML, Grefsheim SF, Barclay ML, Woolliscroft JO, (TREC). Inf Proc Manag. 2001;37:365–6.
McQuillan M, Shipman BL. Factors affecting students’ use of 12. Hersh WR, Molnar A. Toward new measures of information MEDLINE. Comput Biomed Res. 1993;26:541–55.
retrieval evaluation. Proceedings of the 18th Annual 28. Hersh WR. “A world of knowledge at your fingertips”: the International ACM-SIGIR Conference on Research and promise, reality, and future directions of online information Development in Information Retrieval; Seattle, Washington.
retrieval. Acad Med. 1999;74:240–3.
29. Egan DE, Gomez LM. Assaying, isolating, and accommodat- 13. Hersh WR, Pentecost J, Hickam DH. A task-oriented approach ing individual differences in learning a complex skill. In: to information retrieval evaluation. J Am Soc Inf Sci.
Dillon R (ed). Individual Differences in Cognition, vol 2. New 14. Hersh WR, Crabtree MK, Hickam DH, Sacherek L, Rose L, 30. Vincente KJ, Leske JS, Williges RC. Assaying and isolating Friedman CP. Factors associated with successful answering of individual differences in searching a hierarchical file system.
clinical questions using an information retrieval system. Bull

