Extraction of Conditional Probabilities of the Relationships Between Drugs, Diseases, and Genes from PubMed Guided by Relationships in PharmGKB
Martin Theobald, Ph.D1, Nigam Shah, M.B.B.S., Ph.D2, and Jeff Shrager, Ph.D3
Departments of (1) Computer Science, (2) Biomedical Informatics,
Stanford University, Stanford, CA 94305 USA
Abstract
Suppose, for example, a patient is given a
particular diagnosis, and a genomic-analysis
Guided by curated associations between genes,
reveals a mutation in a gene for which a targeted
treatments (i.e., drugs), and diseases in
treatment has been explored in the scientific
literature, but for which there is no approved
networks based on conditional probability tables
specific treatment. P(t|g,d) gives one a sense of
(cpt’s) extracted from co-occurrence statistics
which treatments have been explored the most in
over the entire Pubmed corpus, producing a
this circumstance. Once a treatment is chosen,
broad-coverage analysis of the relationships
between these biological entities. The networks
responses to watch in this patient (often not be
suggest hypotheses regarding drug mechanisms,
the same as the mutated gene). Here p(g|t,d) may
treatment biomarkers, and/or potential markers
offer a sense of what the literature suggests as
of genetic disease. The cpt’s enable Trio, an
inferential database, to query indirect (inferred)
relationships via an SQL-like query language.
Thus the conditional probabilities among these
entities across the scientific literature may lead to
practical new hypotheses, and support inference
to p(c|t,d), or eventually even to the holy grail of
The goal of clinical research can be thought of as
personalized genetic medicine: p(c|t,d,g).
seeking the conditional probability (cp) of a cure
given particular treatments and diseases; in terms
Background
of conditional probabilities: statistically
quantified and directed relationships of the form
Many researchers have extracted association-
p(cure|treatment,disease) [hereafter: p(c|t,d)].
based knowledge from the medical literature.
Meta-analysis over clinical trials can obtain an
Zhu, et al. [1] computed co-occurrence of
compounds and genes, and Jenssen et al. [2]
statistically across trials. Such meta-analyses
computed a gene-to-gene co-citation network.
extract a p(c|t,d) that is statistically tacit in the
These are relatively simple computations.
literature. In the present work we explore other
Extracting conditional multivariate statistics is
potentially useful statistically tacit results
available in the medical literature. Specifically,
computing all combinations of associations, plus
we compute conditional probabilities between
background counts for normalization, and the
treatments (usually drugs), diseases, and genes:
potential vocabulary is very large. Wren [3]
p(t|d,g), by analyzing their co-occurrence in
extracted a network of associations among genes,
Pubmed (www.ncbi.nlm.nih.gov/pubmed). diseases, phenotypes, drugs, etc. using the Although not as directly useful as p(c|t,d), these
mutual information of shared associations from
cp’s and their algebraic co-forms may be
Pubmed abstracts over a set of 10,000 common
interpreted in a number of useful ways, for
words. In order to control the computational
example as personalized (e.g., genetically
complexity and avoid saturation (which is likely
guided) treatment hypotheses [p(t|g,d)], as drug
mechanism hypotheses [p(g|t)], as treatment-
calculations to only 100,000 abstracts. Similarly,
response predictive biomarkers [p(g|t,d)], or as
Narayanasamy, et al. [4] mined co-occurrence in
potential markers of genetic diseases [p(g|d)].
Pubmed to build an association graph and ranked
associations co-occurring with both the objects
have to be calculated. In order to make headway
(equivalent to mutual information). Although
in this endeavor, we need guidance on which
most of these projects uncovered various
combinations to explore. One source of guidance
suggestive associations, they have either used a
could be a user query about the relationships
small corpus, focused on only one kind of entity
between particular treatments, diseases, and
(e.g. gene-gene), focused only on co-occurrence
genes. It seems unlikely, however, that a user
(which is symmetric as opposed to conditional
would come up with likely combinations a priori.
probability), or recognized concepts from only
one (or few) ontologies. In the present work we
mine quantified, directed (i.e., asymmetric)
which explicitly (although non-statistically)
drug/gene/disease relationships over the entirety
relates drugs, genes, and diseases (and other
of more than 19 million Pubmed abstracts and
entities). PharmGKB offers relationships
between drugs, diseases, and genes, based on
specific papers and different types of evidence
ranging from “clinical outcome” to simply
“discussed”. (We dropped any marked “not
We seek to extract all-way co-occurrence-based
related”.) Note that, although we use these
Bayesian networks among treatments (primarily
relations in pharmGKB to guide our analysis, we
drugs for this study), diseases, and genes. These
do not prioritize the specific papers used in
can be estimated from subsets of conditional and
pharmGKB, but use the entire Pubmed database
non-conditional probabilities which are in turn
for our statistics. Thus no quantitative bias is
derived from raw co-occurrence counts of
introduced by the papers curated into pharmGKB.
drug/disease/gene entities in domain-specific
corpora such as Pubmed. For non-conditional
Guided by the relations in pharmGKB 1 , we
statistics, such a co-occurrence probability would
combined information from a tagged Pubmed
corpus created by processing all Medline
abstracts) that mention these items together,
abstracts 2 using the Mgrep tool (University of
divided by the total number of documents
Michigan). Mgrep uses all of the alternative
contained in the corpus. The desired conditional
strings for UMLS concepts3 and identifies their
probabilities are: p(drug|gene), p(drug|disease),
occurrence in the abstract using a radix tree
p(drug|gene,disease), etc. One can easily see how
to compute such conditional probabilities over an
processing without sacrificing precision [5]. In
appropriately annotated Pubmed database,
our experience, this method has an average
simply by counting the single and combinational
precision of about 85% for diseases [6]. (We
co-occurrences of all of these entities, and
have not evaluated precision for other entities.)
performing the obvious calculation, i.e., p(drug
A|gene B, disease C) = (# distinct abstracts
containing A and B and C)/(# distinct abstracts
experiment contains ~19 million articles and
containing B and C). Notice that more general
~200 concepts assigned to each article resulting
relationships are conceivable, i.e., considering
many-to-many relationships between drugs,
Concept Unique Identifier (CUI) assignments.
diseases, and genes. In the present experiment
We combined these data with the highly reliable
we limit our Bayesian network to a maximum of
six conditional variables and a single target
gene/DATA). Using only the relationships
variable, thus extracting up to 26 conditional
marked as “related” or “positively related”, we
extracted 1,730 disease/drug/gene relationships
combinatorial complexity, and hence the number
with up to 6 conditional variables, and extracted
of co-occurrence queries issued against our
their respective conditional probability tables
using co-occurrence statistics over the ~3 billion
distinct Pubmed ID/CUI pairs, resulting in
The problem with this approach is that an enormous number of combinations must be
computed. If there are, say, 20,000 genes, a
1 Late 2007 snapshot of the pharmGKB database.
thousand drugs or investigational drugs, and a
thousand diseases, 222,000 combinations would
19,092 conditional probabilities (again, compare
burgeoning set of direct relationships. By
with ~222,000 for the full-joint distribution).
directly extracting the conditional probabilities
Although this is clearly still an offline process,
as input into a Bayesian net, our method allows
requiring several days, once extracted, these
for a more compact representation of the desired
tables serve as input for our Bayesian nets and
dependencies than it would be possible via
allow for an efficient execution of arbitrary
capturing the full joint-distribution of all
inferential queries; any conditional probability of
variables involved in such a relationship. Any
variables/entities expressed in a pharmGKB
conditional probability of variables involved in
relationship can be directly computed from these.
the net can be efficiently derived via Bayesian
inference. For example, marginalized conditional
Results and Extensions
probabilities of the form p(disease|gene) can be
directly calculated from a conditional probability
The result of our method is a miniature Bayesian
table initially extracted for p(drug|gene,disease)
network for each of the pharmGKB relationships.
without going back to the source to extract more
For example: p(antidepressants | affective
co-occurrence statistics. In this special setting,
disorders, GNB3) = 0.33 (abbreviated: p(an|af,g)
the obtained net always has a tree structure,
= 0.33. (Values are rounded to 2 decimal places,
permitting linear-time inference queries. This
approach can be extended to extract arbitrary
abbreviations.) That is, out of all the documents
that mention “affective disorder” and gene
relationships, by sampling co-occurrence
“GNB3”, about one third also mention anti-
statistics from pubmed for arbitrary text tokens.
depressants. The subordinate relationships in this
set include: p(~an|af,g) =.67, p(an|~af,g) = 0.04,
Implementation
p(~an|~af,g) = 0.96, p(an|af,~g) = 0.11,
p(~an|af,~g) = 0.89, p(an|~af, ~g) = 0.0, and
The present work is implemented as an extension
p(~an|~af,~g) = 1.0. (Note that complementary
of the Trio system [7], a database system for the
conditional probabilities add to 1.0) Many of
integrated management of data, uncertainty and
these subordinate relationships may, of course,
lineage. Trio uses an extended relational schema
to capture data uncertainty (in the form of
p(azidothymidine | HIV, ABCC4) = 0.6. That is,
alternative attribute values and confidences
in 60% of the papers where HIV and ABCC4 are
mentioned, azidothymidine is also mentioned.
alternatives), as well as data lineage (i.e.,
pointers to internal or external sources of the
Note that the order of the contexts (following the
data). For the specific inference setting explored
vertical bar in the c.p.) is not relevant, but the
here, it turns out that the notion of lineage can
targeted posterior (right side of the vertical bar)
nicely be generalized to capturing arbitrary
is relevant, and that the relationships are not
relationships between entities (or records in a
symmetric across the conditional (vertical bar).
database), thus providing pointers to other
Contrast, for example: p(mercaptopurine |
entities (again other records), which allows for a
azathioprine, thioguanine, TPMT) = 0.84,
convenient way of encoding Bayesian nets
p(thioguanine | azathioprine, mercaptopurine,
directly on top of this extended relational setting.
mercaptopurine, thioguanine, TPMT) = 0.89.
present/absent combination of variables in a
And the subordinate relationships: p(azathioprine
pharmGKB relationship is encoded as a different
| thioguanine, TPMT) = 0.86, p(thioguanine |
alternative of such an “uncertain” record, along
azathioprine, TPMT) = 0.44. The most clinically
with a confidence value, which allows us to
important results are, of course, the conditional
probabilities of different treatments (drugs), for
distributions (including cpt’s) for each record in
the same disease&gene combination. For
example: p(salmeterol | Asthma, ADRB2) = 0.07
and p(salbutamol | Asthma, ADRB2) = 0.16.
Moreover, this affords a simple, declarative way
of issuing true inference queries on top of the
Aside from direct relationships, one may want to
precomputed conditional and non-conditional
assess indirect (i.e., inferred) relationships based
on the very same precomputed nets. There are, of
course, many more of these than the already
about the potential biomarkers (genes) given the
context of a disease&treatment combination.
Interpretation aside, many aspects of this present
would simply select the conditional probabilities
analysis need improvement before this method
p(mercaptopurine|azathioprine,thioguanine) from
can be applied. First, as with most statistical
the precomputed cpt’s for DRUGS. Conversely
methods applied to natural language, we have
(still based on the same input table DRUGS
ignored the specific relationships between the
capturing the cpt’s for the initial target variable
entities, both in pharmGKB and in Pubmed, and
mercapturine, but also pointers to the non-
especially the possibility of negatively expressed
conditional priors of azathioprine and
relationships. Of course, we already know, from
thioguanine), we can, for example, initiate an on-
correlation, because we filter out those that are
p(azathioprine|mercaptopurine), thus swapping
marked as “not related” in that database.
the direction of the conditional probability and
Moreover, given that our statistics include a
marginalizing the distribution (i.e., eliminating
huge number of papers, it is unlikely that a large
the conditional variable thioguanine) in a single,
fraction of them are telling us that "drug X does
NOT have any effect on disease Y" (etc.),
especially as the scientific literature does not
often report negative results. Second, the
particular tagger that we used does a poor job of
resolution is dependent on the UMLS CUIs. We
The result of this inferential query is a new cpt
use all synonymous strings for a CUI while
for p(azathioprine|mercaptopurine) that had not
doing the tagging, and the output only contains
been precomputed, and whose computation is
the CUI. This is not a particularly good solution
triggered by the “COMPUTE INFERENCE”
clause using the inferencing extension in Trio.
Issuing such an inference query is much faster
We recognize that our use of Mgrep, as well as
over these simple (in our case tree-like) Bayesian
our use of co-occurrence as a substitute for
nets than going back to the entire Pubmed
actual relationships may introduce significant
database and mining for the respective co-
inaccuracies. In a project in progress we are
occurrence statistics at query processing time.
generating parse trees of each sentence in the
Issuing an inference query in Trio over the
abstract and then only using the noun phrases for
precomputed cpt’s takes less than a second,
recognizing mentions of diseases and drugs. This
whereas extracting the raw co-occurrence
statistics from the entire set of Pubmed abstracts
for a single pharmGKB relation with up to 6
More critically, because this method is focused
variables may take several minutes in our current,
relations that might be important, but which are
Limitations and Directions
resolve this might be to build the method into a
search engine and use the combinations that are
The probabilities that we derive reflect only co-
explicitly searched for as guidance (instead of
occurrence in the literature, and not, for example,
using pharmGKB); indeed, this is explicitly
recommendations, so one must be cautious in
enabled by the Trio infrastructure, and given the
interpreting these results. What, then, are they
precalculated Pubmed/CUI database, seeking any
telling us, and is what they are telling us useful?
given relationship set takes only a few minutes
Because the literature is historical, these results
over the entire set of annotated Pubmed abstracts.
are not telling us what to try, but what has been
There could be other sources of such guidance as
tried, or, possibly, what has been suggested (if
well. For example, one could use the literature
not actually tried). Under this analysis one might
itself: co-mentions in specific abstracts, either all
regard a high conditional probability as a sort of
of them (only a few million computations vs.
ranking of hypotheses regarding potential
222,000), or perhaps a reduced hash that selects all
treatments given the context of disease&gene
unique co-mentions (certainly less than the
combinations, and, symmetrically: hypotheses
Regardless, of these proposed extensions to the
3. Wren, J.D. 2004. Extending the mutual
information measure to rank inferred literature
approaches to its limitations, exhaustive
relationships. BMC Bioinformatics. 5:145.
evaluation is clearly needed to justify its utility.
4. Narayanasamy V, Mukhopadhyay S, Palakal
Acknowledgements
transitive associations among biological objects
MT’s work was supported by NSF grants IIS-
from text. J Biomed Sci. 11(6):864-73.
0324431 and IIS-0414762 and by grants from
the Boeing and Hewlett-Packard Corporations.
5. Dai, M., Shah, N.H., Xuan, W., Musen, M.A.,
NS’s work was supported by NIH grant U54
Watson, S.J., Athey, B., Meng, F. 2008. An
HG004028 and a gift from CommerceNet. JS’s
Efficient Solution for Mapping Free Text to
work was supported by CollabRx, Inc. The
Ontology Terms. Poster at AMIA Summit on
Translational Bioinformatics, San Francisco.
6. Bhatia N., Shah N.H., Rubin D.L., Chiang A.P.
References
Recognizers for Ontology-Based Indexing:
1. Zhu, S., Okuno, Y., Tsujimoto G., Mamitsuka
MGREP vs. MetaMap. Paper accepted to the
H. 2005. Mining literature co-occurrence data
AMIA Summit on Translational Bioinformatics,
using a probabilistic model. IPSJ SIG Technical
7. Benjelloun, O. Das Sarma, A., Halevy, A.
2. Jenssen, T-K, Lægreid, A, Komorowski, J.,
Hovig, E. A literature network of human genes
databases with uncertainty and lineage. The
for high-throughput analysis of gene expression.
International Journal on Very Large Data Bases,
Theory-guided Content Analysis in Architectural Research Case: The Colouration of the Home During the Post-War Reconstruction Period: The Everyday and Architecture. Aulikki HERNEOJA Head of Laboratory of Art end Design, Doctor of Science (Technology), Architect University of Oulu, Department of Architecture Postal address: Aulikki Herneoja, University of Oulu, Department of
Name ________ Key ________________ Lab: Onion Root Mitosis Analysis: Answer each question in complete sentences as part of your lab report write-up in the Analysis section. Analysis section should be in paragraph format! 1. What evidence did you observe that shows that the cell cycle is a continuous process, not a series of separate events? Some of the cells are in an “in betwe