CABG]i. It is well tolerated when started beforeand continued after the surgery. However, further prospective studies are needed to clarify [this is- issue anaphora. We propose a candidateranking model for this-issue anaphora (3) In principle, he said, airlines should be allowed resolution that explores different issue- [to sell standing-room-only tickets for adults]i — as long as [this decision]i was approved by to nominal or verbal antecedents; rather, These examples highlight a difficulty not found with it is able to identify antecedents thatare arbitrary spans of text.
nominal anaphora. First, the anaphors refer to ab- sults show that (a) the model outperforms stract concepts that can be expressed with differ- ent syntactic shapes which are usually not nominals.
The anaphor That in (1) refers to the proposition in as distinguished from issue-specific fea- the previous utterance, whereas the anaphor this is- sue in (2) refers to a clause from the previous text.
In (3), the anaphoric expression this decision refers to a verb phrase from the same sentence. Second, NPs such as this problem and this debate;and (c) it is possible to reduce the search the antecedents do not always have precisely defined space in order to improve performance.
boundaries. In (2), for example, the whole sentencecontaining the marked clause could also be thought to be the correct antecedent. Third, the actual refer-ents are not always the precise textual antecedents.
Anaphora in which the anaphoric expression refers The actual referent in (2), the issue to be clarified, to an abstract object such as a proposition, a prop- is whether oral carvedilol is more effective than oral erty, or a fact is known as abstract object anaphora.
metoprolol in the prevention of AF after on-pump This is seen in the following examples.
CABG or not, a variant of the antecedent text.
Generally, abstract anaphora, as distinguished (1) [Be careful what you wish. because wishes from nominal anaphora, is signalled in English by Semiconductor Industry Association, which rep- pronouns this, that, and it (M¨uller, 2008). But in resents U.S. manufacturers, has been learning.
abstract anaphora, English prefers demonstratives to personal pronouns and definite articles (2) This prospective study suggested [that oral carvedilol is more effective than oral meto- 1This is not to say that personal pronouns and definite arti- prolol in the prevention of AF after on-pump cles do not occur in abstract anaphora, but they are not common.
tives can be used in isolation (That in (1)) or with semantic forms, which makes the problem interest- nouns (e.g., this issue in (2)). The latter follows ing and non-trivial. Third, issue referents in scien- the pattern demonstrative {modifier}* noun. The tific literature generally lie in the previous sentence demonstrative acts as a determiner and the noun fol- or two, which makes the problem tractable. Fourth, lowing the demonstrative imposes selectional con- issues in Medline abstracts are generally associated straints for the antecedent, as in examples (2) and with clinical problems in the medical domain and (3). Francis (1994) calls such nouns label nouns, spell out the motivation of the research presented in which “serve to encapsulate or package a stretch the article. So extraction of this information would be useful in any biomedical information retrieval shell nouns, a metaphoric term which reflects differ- ent functions of these nouns such as encapsulation,pointing, and signalling.
Demonstrative nouns, along with pronouns like both and either, are referred to as sortal anaphors Anaphora resolution has been extensively studied (Casta˜no et al., 2002; Lin and Liang, 2004; Torii in computational linguistics (Hirst, 1981; Mitkov, and Vijay-Shanker, 2007). Casta˜no et al. observed 2002; Poesio et al., 2011). But CL research has that sortal anaphors are prevalent in the biomedi- mostly focused on nominal anaphora resolution cal literature. They noted that among 100 distinct (e.g., resolving multiple ambiguous mentions of a anaphors derived from a corpus of 70 Medline ab- single entity representing a person, a location, or an stracts, 60% were sortal anaphors. But how often organization) mainly for two reasons. First, nominal do demonstrative nouns refer to abstract objects? anaphora is the most frequently occurring anaphora We observed that from a corpus of 74,000 randomly in most domains, and second, there is a substantial chosen Medline2 abstracts, of the first 150 most fre- amount of annotated data available for this kind of quently occurring distinct demonstrative nouns (fre- quency > 30), 51.3% were abstract, 41.3% were Besides pronominal anaphora, some work has concrete, and 7.3% were discourse deictic.
been done on complement anaphora (Modjeska, shows that abstract anaphora resolution is an impor- 2003) (e.g., British and other European steelmak- tant component of general anaphora resolution in the ers). There is also some research on resolving sor- biomedical domain. However, automatic resolution tal anaphora in the medical domain using domain of this type of anaphora has not attracted much atten- knowledge (Casta˜no et al., 2002; Lin and Liang, tion and the previous work for this task is limited.
2004; Torii and Vijay-Shanker, 2007). But all these The present work is a step towards resolving ab- approaches focus only on the anaphors with nominal stract anaphora in written text. In particular, we choose the interesting abstract concept issue and By contrast, the area of abstract object anaphora demonstrate the complexities of resolving this-issue remains relatively unexplored mainly because the anaphora manually as well as automatically in the standard anaphora resolution features such as agree- Medline domain. We present our algorithm, results, ment and apposition cannot be applied to abstract and error analysis for this-issue anaphora resolution.
anaphora resolution. Asher (1993) built a theoreti- The abstract concept issue was chosen for the fol- cal framework to resolve abstract anaphora. He di- vided discourse abstract anaphora into three broad kinds of text from newspaper articles to novels to categories: event anaphora, proposition anaphora, scientific articles. There are 13,489 issue anaphora and fact anaphora, and discussed how abstract en- instances in the New York Times corpus and 1,116 tities can be resolved using discourse representa- instances in 65,000 Medline abstracts. Second, it is tion theory. Chen et al. (2011) focused on a sub- abstract enough that it can take several syntactic and set of event anaphora and resolved event corefer-ence chains in terms of the representative verbs of 2
the events from the OntoNotes corpus. Our task dif- fers from their work as follows. Chen et al. mainly focus on events and actions and use verbs as a proxy for the non-nominal antecedents. But this-issue an- tecedents cannot usually be represented by a verb.
Our work is not restricted to a particular syntactic type of the antecedent; rather we provide the flexibil-ity of marking arbitrary spans of text as antecedents.
Figure 1: Example of annotated data. Bold segments There are also some prominent approaches to ab- denote the marked antecedents for the corresponding stract anaphora resolution in the spoken dialogue anaphor ids. rh j is the jth section identified by the an- domain (Eckert and Strube, 2000; Byron, 2004; M¨uller, 2008). These approaches go beyond nom-inal antecedents; however, they are restricted to spo- ken dialogues in specific domains and need seriousadaptation if one wants to apply them to arbitrary This kind of annotation — identifying and marking arbitrary units of text that are not necessarily con-stituents — requires a non-trivial variant of the usual In addition to research on resolution, there is inter-annotator agreement measures. We use Krip- also some work on effective annotation of abstract pendorff’s reliability coefficient for unitizing (α anaphora (Strube and M¨uller, 2003; Botley, 2006; (Krippendorff, 1995) which has not often been used Poesio and Artstein, 2008; Dipper and Zinsmeister, or described in CL. In our context, unitizing means 2011). However, to the best of our knowledge, there marking the spans of the text that serve as the an- is currently no English corpus annotated for issue tecedent for the given anaphors within the given text.
The coefficient αu assumes that the annotated sec-tions do not overlap in a single annotator’s output and our data satisfies this criterion.4 The generalform of coefficient To create an initial annotated dataset, we collected 188 this {modifier}* issue instances along with thesurrounding context from Medline abstracts.3 Five instances were discarded as they had an unrelated(publication related) sense. Among the remaining where uDo and uDe are observed and expected dis- 183 instances, 132 instances were independently an- agreements respectively. Both disagreement quanti- notated by two annotators, a domain expert and a ties express the average squared differences between non-expert, and the remaining 51 instances were an- the mismatching pairs of values assigned by anno- notated only by the domain expert. We use the for- tators to given units of analysis. αu = 1 indicates mer instances for training and the latter instances perfect reliability and αu = 0 indicates the absence (unseen by the developer) for testing. The anno- of reliability. When αu < 0, the disagreement is sys- tator’s task was to mark arbitrary text segments tematic. Annotated data with reliability of αu ≥ 0.80 as antecedents (without concern for their linguistic is considered reliable (Krippendorff, 2004).
types). To make the task tractable, we assumed that Krippendorff’s αu is non-trivial, and explaining it an antecedent does not span multiple sentences but in detail would take too much space, but the general lies in a single sentence (since we are dealing with idea, in our context, is as follows. The annotators singular this-issue anaphors) and that it is a continu- mark the antecedents corresponding to each anaphor in their respective copies of the text, as shown in Fig-ure 1. The marked antecedents are mutually exclu- 3Although our dataset is rather small, its size is similar to sive sections r; we denote the jth section identified other available abstract anaphora corpora in English: 154 in-stances in Eckert and Strube (2000), 69 instances in Byron 4If antecedents overlap with each other in a single annota- (2003), 462 instances annotated by only one annotator in Botley tor’s output (which is a rare event) we construct data that satis- (2006), and 455 instances restricted to those which have only fies the non-overlap criterion by creating different copies of the nominal or clausal antecedents in Poesio and Artstein (2008).
same text corresponding to each anaphor instance.
There is a controversial debate (SBAR whether back school program might improvequality of life in back pain patients). This study aimed to address this issue.
(S Reduced serotonin function and abnormalities in the hypothalamic-pituitary-adrenalaxis are thought to play a role in the aetiology of major depression.) We sought toexamine this issue in the elderly .
(S (PP Given these data) (, ,) (NP decreasing HTD to < or = 5 years) (VP may havea detrimental effect on patients with locally advanced prostate cancer) (. .)) Only arandomized trial will conclusively clarify this issue.
As (NP the influence of estrogen alone on breast cancer detection) is not established,we examined this issue in the Women’s Health Initiative trial.
Table 1: Antecedent types. In examples, the antecedent type is in bold and the marked antecedent is in italics.
by the annotator h by rh j. In Figure 1, annotators 1 6 instances and we broke the tie by writing to the and 2 have reached different conclusions by identi- authors of the articles and using their response to fying 9 and 10 sections respectively in their copies resolve the disagreement. In the gold standard cor- of the text. Annotator 1 has not marked any an- pus, 95.5% of the antecedents were in the current or tecedent for the anaphor with id = 1, while annotator previous sentence and 99.2% were in the current or 2 has marked r21 for the same anaphor. Both anno- previous two sentences. Only one antecedent was tators have marked exactly the same antecedent for found more than two sentences back and it was six the anaphor with id = 4. The difference between two sentences back. One instance was a cataphor, but annotated sections is defined in terms of the square the antecedent occurred in the same sentence as the of the distance between the non-overlapping parts of anaphor. This suggests that for an automatic this- the sections. The distance is 0 when the sections are issue resolution system, it would be reasonable to unmarked by both annotators or are marked and ex- consider only the previous two sentences along with actly same, and is the summation of the squares of the sentence containing the anaphor.
the unmatched parts if they are different. The coeffi- The distribution of the different linguistic forms cient is computed using intersections of the marked that an antecedent of this-issue can take in our data sections. In Figure 1, annotators 1 and 2 have a to- set is shown in Table 1. The majority of antecedents tal of 14 intersections. The observed disagreement are clauses or whole sentences. A number of an- uDo is the weighted sum of the differences between tecedents are noun phrases, but these are gener- all mismatching intersections of sections marked by ally nominalizations that refer to abstract concepts the annotators, and the expected disagreement is the (e.g., the influence of estrogen alone on breast can- summation of all possible differences of pairwise cer detection). Some antecedents are not even well- combinations of all sections of all annotators nor- defined syntactic constituents5 but are combinations malized by the length of the text (in terms of the of several well-defined constituents. We denote the number of tokens) and the number of pairwise com- type of such antecedents as mixed. In the corpus, 18.2% of the antecedents are of this type, suggest- For our data, the inter-annotator agreement was ing that it is not sufficient to restrict the antecedent αu = 0.86 (uDo = 0.81 and uDe = 5.81) despite the search space to well-defined syntactic constituents.6 fact that the annotators differed in their domain ex- In our data, we did not find anaphoric chains for pertise, which suggests that abstract concepts such any of the this-issue anaphor instances, which indi- cates that the antecedents of this-issue anaphors are 5We refer to every syntactic constituent identified by the parser as a well-defined syntactic constituent.
A gold standard corpus was created by resolving the 6Indeed, many of mixed type antecedents (nearly three- cases where the annotators disagreed. Among 132 quarters of them) are the result of parser attachment errors, but training instances, the annotators could not resolve in the reader’s local memory and not in the global having a number of leaves (words) less than a thresh- memory. This observation supports the THIS-NPs old8 are discarded to give the final set of candidate hypothesis (Gundel et al., 1993; Poesio and Mod- jeska, 2002) that this-NPs are used to refer to enti-ties which are active albeit not in focus, i.e., they are not the center of the previous utterance.
We explored the effect of including 43 automati-cally extracted features (12 feature classes), which are summarized in Table 2. The features can also bebroadly divided into two groups: issue-specific fea- tures and general abstract-anaphora features. Issue- For correct resolution, the set of extracted candidates specific features are based on our common-sense must contain the correct antecedent in the first place.
knowledge of the concept of issue and the different The problem of candidate extraction is non-trivial in semantic forms it can take; e.g., controversy (X is abstract anaphora resolution because the antecedents controversial), hypothesis (It has been hypothesized are of many different types of syntactic constituents X), or lack of knowledge (X is unknown), where X such as clauses, sentences, and nominalizations.
is the issue. In our data, we observed certain syn- Drawing on our observation that the mixed type an- tactic patterns of issues such as whether X or not tecedents are generally a combination of different and that X and the IP feature class encodes this in- well-defined syntactic constituents, we extract the formation. Other issue-specific features are IVERB set of candidate antecedents as follows. First, we and IHEAD. The feature IVERB checks whether create a set of candidate sentences which contains the governing verb of the candidate is an issue the sentence containing the this-issue anaphor and verb (e.g., speculate, hypothesize, argue, debate), the two preceding sentences. Then, we parse every whereas IHEAD checks whether the candidate head candidate sentence with the Stanford Parser7. Ini- in the dependency tree is an issue word (e.g., contro- tially, the set of candidate constituents contains a versy, uncertain, unknown). The general abstract- list of well-defined syntactic constituents. We re- anaphora resolution features do not make use of quire that the node type of these constituents be in the semantic properties of the word issue. Some the set {S, SBAR, NP, SQ, SBARQ, S+V}. This of these features are derived empirically from the set was empirically derived from our data. To each training data (e.g., ST, L, D). The EL feature is bor- constituent, there is associated a set of mixed type rowed from M¨uller (2008) and encodes the embed- constituents. These are created by concatenating the ding level of the candidate within the candidate sen- original constituent with its sister constituents. For tence. The MC feature tries to capture the idea of the example, in (4), the set of well-defined eligible can- THIS-NPs hypothesis (Gundel et al., 1993; Poesio didate constituents is {NP, NP1} and so NP1 PP1 is and Modjeska, 2002) that the antecedents of this- NP anaphors are not the center of the previous utter-ance. The general abstract-anaphora features in the SR feature class capture the semantic role of the can-didate in the candidate sentence. We used the Illinois Semantic Role Labeler9 for SR features. The gen-eral abstract-anaphora features also contain a few The set of candidate constituents is updated with lexical features (e.g., M, SC). But these features are the extracted mixed type constituents. Extracting independent of the semantic properties of the word mixed type candidate constituents not only deals issue. The general abstract-anaphora resolution fea- with mixed type instances as shown in Table 1, but tures also contain dependency-tree features, lexical- as a side effect it also corrects some attachment er-rors made by the parser. Finally, the constituents 8The threshold 5 was empirically derived. Antecedents in our training data had on average 17 words.
1 iff the candidate follows the pattern SBAR → (IN whether) (S .) 1 iff the candidate follows the pattern SBAR → (IN that) (S .) 1 iff the candidate follows the pattern SBAR → (IN iff) (S .) 1 iff the candidate node is a sentence node 1 iff the candidate node is an SQ or SBARQ node 1 iff the candidate node is of type mixed EMBEDDING LEVEL (EL) (M ¨uller, 2008)TLEMBEDDING level of embedding of the given candidate in its top clause (the root node of the syntactic tree) level of embedding of the given candidate in its immediate clause (the closest parent of type S or SBAR) 1 iff the candidate is in the main clause 1 iff the candidate is in the same sentence as anaphor 1 iff the candidate is in the adjacent sentence 1 iff the candidate occurs 2 or more sentences before the anaphor 1 iff the antecedent occurs before anaphor 1 iff the governing verb of the given candidate is an issue verb 1 iff the candidate is the agent of the governing verb 1 iff the candidate is the patient of the governing verb 1 iff the candidate is the instrument of the governing verb 1 iff the candidate plays the role of modiffication 1 iff the candidate plays no well-defined semantic role in the sentence 1 iff the candidate head in the dependency tree is an issue word (e.g., controversial, unknown) 1 iff the dependency relation of the candidate to its head is of type nominal, controlling or clausal subject 1 iff the dependency relation of the candidate to its head is of type direct object or preposition obj 1 iff the dependency relation of the candidate to its head is of type dependent 1 iff the candidate is the root of the dependency tree 1 iff the dependency relation of the candidate to its head is of type preposition 1 iff the dependency relation of the candidate to its head is of type continuation 1 iff the dependency relation of the candidate to its head is of type clausal or adjectival complement 1 iff candidate’s head is the root node 1 iff the given candidate contains a modal verb PRESENCE OF SUBORDINATING CONJUNCTION (SC)ISCONT 1 iff the candidate starts with a contrastive subordinating conjunction (e.g., however, but, yet) 1 iff the candidate starts with a causal subordinating conjunction (e.g., because, as, since) 1 iff the candidate starts with a conditional subordinating conjunction (e.g., if, that, whether or not) normalized ratio of the overlapping words in candidate and the title of the article normalized ratio of the overlapping words in candidate and the anaphor sentence proportion of domain-specific words in the candidate 1 iff the preceding word of the candidate is a preposition 1 iff the following word of the candidate is a preposition 1 iff the preceding word of the candidate is a punctuation 1 iff the following word of the candidate is a punctuation Table 2: Feature sets for this-issue resolution. All features are extracted automatically.
overlap features, and context features.
which could not be corrected by our candidate ex-traction method.10 In these cases, the parts of the antecedent had been placed in completely different branches of the parse tree. For example, in (5), the correct antecedent is a combination of the NP from anaphora resolution is to choose the best candidate the S → V P → NP → PP → NP branch and the PP from S → V P → PP branch. In such a case, concate- model proposed by Denis and Baldridge (2008).
nating sister constituents does not help.
The advantage of the candidate-ranking model over (5) The data from this pilot study (VP (VBP provide) the mention-pair model is that it overcomes the (NP (NP no evidence) (PP (IN for) (NP a dif- strong independence assumption made in mention- ference in hemodynamic effects between pulse pair models and evaluates how good a candidate is HVHF and CPFA))) (PP in patients with sep- tic shock already receiving CRRT)). A larger We train our model as follows. If the anaphor sample size is needed to adequately explore thisissue.
is a this-issue anaphor, the set C is extracted us-ing the candidate extraction algorithm from Section 4.1. Then a corresponding set of feature vectors, We propose two metrics for abstract anaphora eval- Cf = {Cf 1,Cf 2, .,Cf k}, is created using the features uation. The simplest metric is the percentage of an- in Table 2. The training instances are created as de- tecedents on which the system and the annotated scribed by Soon et al. (2001). Note that the instance gold data agree. We denote this metric as EXACT- creation is simpler than for general coreference res- M (Exact Match) and compute it as the ratio of olution because of the absence of anaphoric chains number of correctly identified antecedents to the to- in our data. For every anaphor ai and eligible can- tal number of marked antecedents. This metric is didates Cf = {Cf 1,Cf 2, .,Cf k}, we create training a good indicator of a system’s performance; how- examples (ai,Cfi, label), ∀Cfi ∈ Cf . The label is 1 ever, it is a rather strict evaluation because, as we if Ci is the true antecedent of the anaphor ai, oth- noted in section 1, issues generally have no precise erwise the label is −1. The examples with label 1 boundaries in the text. So we propose another met- get the rank of 1, while other examples get the rank ric called RLL, which is similar to the ROUGE-L of 2. We use SVMrank (Joachims, 2002) for train- metric (Lin, 2004) used for the evaluation of auto- ing the candidate-ranking model. During testing, the matic summarization. Let the marked antecedents trained model is used to rank the candidates of each of the gold corpus for k anaphor instances be G = test instance of this-issue anaphor.
g1, g2, ., gk and the system-annotated antecedents be A = a1, a2, ., ak . Let the number of words inG and A be m and n respectively. Let LCS(gi, ai) In this section we present the evaluation of each be the the number of words in the longest common subsequence of gi and ai. Then the precision (PRLL)and recall (RRLL) over the whole data set are com- puted as shown in equations (2) and (3). PRLL is The set of candidate antecedents extracted by the the total number of word overlaps between the gold method from Section 4.1 contained the correct an- and system-annotated antecedents normalized by the tecedent 92% of the time. Each anaphor had, on number of words in system-annotated antecedents average, 23.80 candidates, of which only 5.19 can- and RRLL is the total number of such word overlaps didates were nominal type. The accuracy dropped normalized by the number of words in the gold an- to 84% when we did not extract mixed type candi- tecedents. If the system picks too much text for an- dates. The error analysis of the 8% of the instances tecedents, RRLL is high but PRLL is low. The F-score, where we failed to extract the correct antecedent re- 10Extracting candidate constituents from the dependency vealed that most of these errors were parsing errors trees did not add any new candidates to the set of candidates.
Oracle candidate sentence extractor + row 3 Table 3: this-issue resolution results with SVMrank. All means evaluation using all features. Issue-specific features ={IP, IVERB, IHEAD}. EX-M is EXACT-M.
We carried out two sets of systematic experi- ments in which we considered all combinations ofour twelve feature classes. The first set consists of 5-fold cross-validation experiments on our training data. The second set evaluates how well the model built on the training data works on the unseen test Table 3 gives results of our system. The first two rows are the baseline results. Rows 3 to 8 give re- sults for some of the best performing feature sets.
The lower bound of FRLL is 0, where no true an- All systems based on our features beat both base- tecedent has any common substring with the pre- lines on F-scores and EXACT-M. The empirically dicted antecedents and the upper bound is 1, where derived feature sets IP (issue patterns) and D (dis- all the predicted and true antecedents are exactly the tance) appeared in almost all best feature set com- same. In our results we represent these scores in binations. Removing D resulted in a 6 percentage points drop in FRLL and a 4 percentage points drop There are no implemented systems that resolve is- in EXACT-M scores. Surprisingly, feature set ST sue anaphora or abstract anaphora signalled by label (syntactic type) was not included in most of the best nouns in arbitrary text to use as a comparison. So performing set of feature sets. The combination of we compare our results against two baselines: ad- syntactic and semantic feature sets {IP, D, EL, MC, jacent sentence and random. The adjacent sentence L, SR, DT} gave the best FRLL and EXACT-M scores baseline chooses the previous sentence as the correct for the cross-validation experiments. For the test- antecedent. This is a high baseline because in our data experiments, the combination of semantic and data 84.1% of the antecedents lie within the adjacent lexical features {D, C, LO, L, SC, SR, DT} gave sentence. The random baseline chooses a candidate the best FRLL results, whereas syntactic, discourse, drawn from a uniform random distribution over the and semantic features {IP, D, C, EL, L, SC, SR, DT} gave the best EXACT-M results. Overall, row 3 of the table gives reasonable results for both cross- Note that our FRLL scores for both baselines are rather high because candidates often have considerable overlap with one validation and test-data experiments with no statisti- another; hence a wrong choice may still have a high FRLL score.
cally significant difference to the corresponding best EXACT-M scores in rows 6 and 5 respectively.12 Our results show that general abstract-anaphora To pinpoint the errors made by our system, we resolution features (i.e., other than issue-specific carried out three experiments. In the first experi- features) play a crucial role in resolving this-issue ment, we examined the contribution of issue-specific anaphora. This is encouraging, as it suggests that features versus non-issue features (rows 9 and 10).
the approach could be generalized for other NPs — Interestingly, when we used only non-issue features, especially NPs having similar semantic constraints the performance dropped only slightly. The FRLL re- such as this problem, this decision, and this conflict.
sults from using only issue-specific features were The results also show that reduction of search below baseline, suggesting that the more general space markedly improves the resolution perfor- features associated with abstract anaphora play a mance, suggesting that a two-stage process that first crucial role in resolving this-issue anaphora.
identifies the broad region of the antecedent and then In the second experiment, we determined the er- pinpoints the exact antecedent might work better ror caused by the candidate extractor component of than the current single-stage approach. The rationale our system. Row 12 of the table gives the result behind this two-stage process is twofold. First, the when an oracle candidate extractor was used to add search space of abstract anaphora is large and noisy the correct antecedent in the set of candidates when- compared to nominal anaphora.13 And second, it is ever our candidate extractor failed. This did not possible to reduce the search space and accurately affect cross-validation results by much because of identify the broad region of the antecedents using the rarity of such instances. However, in the test- simple features such as the location of the anaphor data experiment, the EXACT-M improvements that in the anaphor sentence (e.g., if the anaphor occurs resulted were statistically significant. This shows at the beginning of the sentence, the antecedent is that our resolution algorithm was able to identify an- most likely present in the previous sentence).
tecedents that were arbitrary spans of text.
We chose scientific articles over general text be- In the last experiment, we examined the effect of cause in the former domain the actual referents are the reduction of the candidate search space. We as- seldom discourse deictic (i.e., not present in the sumed an oracle candidate sentence extractor (Row text). In the news domain, for instance, which we 13) which knows the exact candidate sentence in have also examined and are presently annotating, a which the antecedent lies. We can see that both large percentage of this-issue antecedents lie out- RLL and EXACT-M scores markedly improved in side the text. For example, newspaper articles often this setting. In response to these results, we trained quote sentences of others who talk about the issues a decision-tree classifier to identify the correct an- in their own world, as shown in example (6).
tecedent sentence with simple location and lengthfeatures and achieved 95% accuracy in identifying (6) As surprising and encouraging to organizers of the movement are the Wall Street names addedto their roster. Prominent among them is Paul Singer, a hedge fund manager who is straightand chairman of the conservative Manhattan We have demonstrated the possibility of resolv- Institute. He has donated more than $8 million ing complex abstract anaphora, namely, this-issue to various same-sex marriage efforts, in states anaphora having arbitrary antecedents. The work including California, Maine, New Hampshire,New Jersey, New York and Oregon, much of it takes the annotation work of Botley (2006) and Dip- per and Zinsmeister (2011) to the next level by re-solving this-issue anaphora automatically. We pro- “It’s become something that gradually peo- posed a set of 43 automatically extracted features that can be used for resolving abstract anaphora.
If we consider all well-defined syntactic constituents of a sentence as issue candidates, in our data, a sentence has on av- 12We performed a simple one-tailed, k-fold cross-validated erage 43.61 candidates. Combinations of several well-defined paired t-test at significance level p = 0.05 to determine whether syntactic constituents only add to this number. Hence if we the difference between the EXACT-M scores of two feature consider the antecedent candidates from the previous 2 or 3 sen- classes is statistically significant.
tences, the search space can become quite large and noisy.
