The goal of relation extraction is to automatically build structured, queryable representations of unstructured text. The sheer volume of available data prohibits manual extraction, yet the ambiguity of natural language makes automated extraction difficult. I construct a joint probabilistic model of the components necessary for relation extraction – entity classification and relation extraction – that allows for a rich representation of the relation extraction process. This joint model helps alleviate the problem of error propagation common in pipelined models, where the output of one component is fed as a (potentially incorrect) input to the next component.
Even just considering the scientific domain, the speed at which scientists output new knowledge in the form of unstructured text far outpaces the ability of any one individual to read and process. This fundamental inability of individuals to keep up with the amount of textual information available has led to attempts to develop tools to automate part of the process. One such tool is known as relation extraction, and attempts to automatically identify relationships expressed in text.
In this post I show that joint modeling of entity classification and relation extraction is able to improve performance in both tasks over individual approaches. Joint inference allows confident predictions from one classifier to correct errors in the output of another classifier. Beyond use of the joint model for entity classification and relation extraction on the corpus used in Roth and Yih , I also applied the same techniques to the recently released Google relation extraction corpus. While the granularity of labels with the Google data is more suited to a distant supervision setting, I was able to extract a subset amenable to the approach of this post. I also compare exact and approximate inference in joint models and find that, for the models I construct, there is little difference in the final predictions relying on approximate or exact inference.
2 Related Work
GuoDong et al.  and Zhao and Grishman  are representative of early, supervised approaches to the relation extraction problem. GuoDong et al.  use lexical, syntactic, and semantic features to attempt to predict relationships. Zhao and Grishman  use string kernel methods to perform the same supervised classification task. Jiang and Zhai  present a comprehensive overview of feature selection for relation extraction.
Recently, distant supervision (Mintz et al. ) techniques have become the focus of much relation extraction research. Distant supervision begins with a database of known facts then mines large bodies of text to find sentences that contain two entities participating in a relation. These sentences are considered positive examples for training. This method incorporates large amounts of potentially incorrect examples, but compensates for this noise by building training sets far larger than is possible through human annotation.
Both direct and distant supervision methods typically rely on a “black box” named entity recognizer to examine each sentence and extract the entity mentions. This pipelined approach can introduce errors into the relation extraction by a failure in the named entity recognizer to extract the correct entity mention. GuoDong et al.  acknowledge this problem in the pipelined approach and attempt to compensate for it by including a wide variety of “shallow” and “deep” linguistic features, with the expectation that errors in one set of features can be compensated for by additional features.
Roth and Yih  present an early approach to modeling relation extraction as the joint inference problem of recognizing entities and relations simultaneously. Roth and Yih  refine this work and presents additional results and new evaluation metrics.
These, and other works, introduce a joint model of the components of relation extraction to overcome propagation of errors from the pipelined approach. In many cases, however, joint models have been reported to perform no better, or even worse, than the “simpler” pipelined approach. See Riedel and McCallum  for a discussion of this issue.
My work further explores the questions raised in Roth and Yih  and Kate and Mooney , where joint inference is applied to the supervised entity classification and relation extraction setting.
3 Problem Statement
Given an input sentence X with distinguished entity mentions E1,E2,…,E2, the goal is to predict the following:
- The types of each entity Ei
- The type of each possible relation between ordered pairs of entities (Ei,Ej),i≠j.
With this definition, I am restricting my attention to relationships expressed entirely within a single sentence. Relationships that span multiple sentences or documents are a more challenging problem beyond the scope of my current model.
Each corpus, in addition to sentences requiring the above classification of entities and relations, also has an associated set of constraints imposed between relation and entity types to ensure coherent predictions.
For example, the data set from Roth and Yih  has the constraint (Live In, Person, Location), expressing that only a person and location can participate in the lives in relation.
4 Tested Approaches
4.1 Relation Classification Only
My initial approach was to use a popular natural language processing toolkit from Stanford to process each input sentence and perform the entity classification as a pre-processing step. I then trained a variety of classifiers from the scikit-learn library to perform the relation classification between entities, treating the entity type predictions of the Stanford NLP toolkit as a ground truth.
This is a common approach in relation extraction, and representative of approaches following the so-called “pipeline” model, where distinct components are executed in sequence on the input data and the final prediction is made by the last classifier, using all previous models’ output as input features. This approach can be very successful, but it had obvious limitations for my intended work:
- This approach treats entity classification as a “black box”, with no associated probabilities output for every class
- Furthermore, it abstracts away a large part of the difficulty of performing relation extraction by assuming entity classification is a solved problem
- Relies on input sentences from a domain for which the Stanford named entity classification is already trained, which does hold in a general setting
For these reasons, I moved beyond using the Stanford named entity classification toolkit and developed my own models to handle the entity classification task.
4.2 Separate Entity and Relation Classification
My baseline implementation used two models to independently classify entities and relations.
The entity classifier accepts a set of features from the words in and around an entity mention, along with features such as the capitalization pattern of words. The word-based features include unigrams, bigrams, and trigrams of the words themselves, the parts-of-speech tags, and word/part-of-speech pairs. These are standard features from the literature (Roth and Yih , Kate and Mooney ).
The relation classifier uses a set of similar features from the two entity mentions and similar features drawn from the words between them.
I used scikit-learn’s implementations to build the entity and relation classifiers, with my own code written to process the data, extract features, perform k-fold cross-validation and report statistics. I tested several classifiers:
- Naive Bayes – while a simple naive Bayes model performed reasonably well in terms of precision and recall, the probability estimates derived from a multiclass naive Bayes classifier are essentially worthless. Almost all of the probability mass returned for a given input is centered on one class. In other words, the classifier is usually far too confident in its predictions. This precludes using its output for a joint inference model, and I did not consider this model further.
- Multiclass Logistic Regression – This performed well, with a short training and testing time on my datasets, and returns a reasonable approximation of the probability distribution over the output classes. This is the model I primarily used and for which I report results below.
- Linear SVM – This also performed well in terms of recall and precision, and returned probability estimates that worked well for the subsequent joint inference model. However, scitkit-learn implements Platt scaling to return probability estimates for SVMs. This involves fitting a logistic regression model to the SVMs output scores. This is a high-overhead operation and significantly increased train and test times. The study by Niculescu-Mizil and Caruana  indicates that, with proper tuning, the probabilities output by an SVM using Platt scaling will be superior to those of logistic regression, I viewed quantifying this over my datasets as beyond the scope of my project and relied on the (possibly slightly inferior) probability estimates of logistic regression for reasons of speed.
4.2.1 Alternate Feature Set
I also trained the separate models using an augmented feature set, known in the literature as an Omniscient classifier. This feature set uses the true labels of the entities as input features for the relation classifier, and the true labels of the relations as input features for the entity classifier. This is an entirely unrealistic model, as it presupposes the true labels are always known, but it serves as an upper bound on the performance possible by jointly considering entity and relation classification. Omniscient features significantly increase precision, recall, and F1 scores, as reported in the results. Interestingly, even with the true labels available at train and test time, the joint inference models are even able to enhance the performance of the Omniscient classifiers.
4.3 Bayesian Network for Joint Inference – Approach #1
Roth and Yih  present, to the best of my knowledge, one of the earliest attempts at performing relation extraction by building a joint model over entity and relation classifiers. Following their approach, I built a Bayesian network model that incorporated the independent classifiers trained as I described in the previous section. Training the classifiers proceeds as before, but at test time, the predictions of the classifiers are used to construct a Bayesian network. The most probable explanation (MPE) of the constructed network determines the final predictions for entities and relations within an entire sentence.
My initial approach to modeling the extraction process as a Bayesian network constructed a network for each sentence, with a node for each entity and potential relation, as in figure 1. This is consistent with the method described in Roth and Yih .
The network probabilities for entity nodes come directly from the probability estimates output by the trained entity classifier. The conditional probability tables (CPTs) for the relation nodes are likewise determined by the probabilities from the relation classifier, except for the application of constraints.
The intent of the joint model is to enforce constraints, as described in section 3, and thereby help correct erroneous predictions from one classifier by considering the predictions of another classifier. Therefore, the conditional probabilities of a relation node taking on a value that is not consistent with its constraints is set to zero, regardless of the probabilities output by the relation classifier.
As an example of this, consider the entity pair, “Sirhan Sirhan” and “Senator Kennedy”. If the entity classifier predicts with 90% probability that “Sirhan Sirhan” is a person and with 51% probability “Senator Kennedy” is a location, yet the relation classifier reports 95% probability that the “kill” relation holds between them, the goal of the Bayesian network model is to infer that “Senator Kennedy” is a person and, in a sense, override the entity classifier’s wrong but not very confident prediction that “Senator Kennedy” is a location.
Unfortunately, this network configuration heavily promotes a MPE that, if the base classifier predictions violate the constraints, simply predicts the same entity types as the base classifiers and defaults the relation to “other”. This is because relation predictions that are not consistent with the constraints have probability 0, which means that the “other” relation must have probability 1. This dominates even confident predictions of relations that satisfy the constraints.
This results in a joint model that only in rare cases outputs a final result better than just using the base classifiers independently, and overall performance (as measured by precision, recall, and F1 scores) was almost unchanged from the separate classifier approach. This was consistent with the results reported in Roth and Yih , where the joint model only made a small difference. As a consequence, I developed a slightly more complicated Bayesian network for modeling sentence-level predictions that was better able to enforce the constraints and increase recall, precision, and F1 scores.
4.4 Bayesian Network for Joint Inference – Approach #2
Instead of directly connecting entity to relation nodes, I defined an additional indicator variable for each possible relation, with incoming edges from the corresponding entities and relation nodes. The indicator CPTs are entirely deterministic. For all assignments of its entity and relation variables that do not violate a constraint, its value is true with probability 1. For all other assignments, its value is false with probability 1. An example network following this construction is shown in figure 2.
At inference time, the indicator variables are “observed” as evidence; in other words, instead of finding the most probable explanation’s assignment of all variables, we find the MPE of the entity and relation variables given that all indicator variables are observed to be true. In contrast to the previous approach, this network is able to correct many mistakes made by the underlying classifiers: the MPE explanation simply chooses the assignment of entity and relation variables that maximizes their joint likelihood subject to the constraints imposed by the network, which is exactly the desired effect. As seen in the following results section, using this construction of the Bayesian network has a significant positive impact on overall performance.
4.5 Circuit Compilation
A significant drawback of the joint model is the increased time spent in generating predictions. My initial program, written in Python, trains and tests the entity and relation classifiers, then generates a text file encoding the network and probabilities in the SamIAm Bayesian network file format (Darwiche). Next, a call is made to a Java program (the core of which is supplied with the SamIAm library, which I modified to interact with Python and return the MPE instantiation assignments). This performs approximate inference on the network. For these simple networks, the approximate answers are very close to exact; however, even approximate inference takes several seconds, as shown in figure 3.
As dynamic inference requires approximately 3 seconds to complete per sentence, in a corpus with only 1000 sentences, k-fold cross-validation requires 50 minutes just for the dynamic inference. To speed this up, I made use of the ACE package (Darwiche and Chavira), which compiles Bayesian networks into an arithmetic circuit.
For each corpus and its associated constraints, I used the ACE package to compile a Bayesian network for sentences with between two and eleven entities. Each compiled circuit represents a network encoding the appropriate constraints between all n(n- 1) possible relations between the n entities. The ACE package is distributed with a Java program that reads in a compiled network and returns the marginal probabilities; I altered this inference engine to accept as arguments the updated CPTs of the circuit gates corresponding to entity and relation probabilities, and to perform two passes over the circuit to calculate the MPE.
The benefits of compiling a Bayesian network into a circuit comes from the following:
- Circuits enable calculating the (exact) MPE of a Bayesian network in time linear in the size of the circuit (Darwiche )
- In contrast, exact calculation of the MPE in the network is NP-hard
- This enables a significant speed increase over dynamic inference (see figure 3)
- Compiling a circuit only has to be done once. The CPTs encoded by the Bayesian network are translated into the circuit, and can be updated efficiently to incorporate the probability estimates of the base classifier
Note that the circuits themselves may be of size exponential in the input network, but in many cases, such as the networks I generated, much more compact circuits can be found.
Using compiled circuits, overall inference time for a corpus of 1000 sentences drops from ≈ 50 minutes to ≈ 6 minutes.
5.1 Roth and Yih Data
The first dataset I used came from Roth and Yih . It was manually created from 1437 sentences exhibiting at least one of the following relations: located in, works for, organization based in, lives in, or kills. From these sentences, there are 5336 entities and 19048 pairs of entities (potential binary relations). A sample sentence from this dataset is shown in figure 4. This dataset has been used in other papers subsequently, including Kate and Mooney .
Results are presented, averaged over 5-fold cross-validation, in tables 1 (for entities) and 2 (for relations).
The Separate lines present the results using separately trained and tested entity and relation classifiers, with no joint inference performed. This can be seen as a baseline for comparing with the performances of the joint model.
The Joint lines present the results of building a Bayesian network, according to the approach described in section 4.4. The probabilities used in the Bayesian network come from the Separate classifiers, but predictions are made at the sentence-level based on the most probable explanation of the Bayesian network.
For entity classification, the joint model provides a slight increase in F1 for the person and location entity types. These are consistent in magnitude with the results reported in Roth and Yih .
However, for relation classification, the joint model increases F1 scores for every relation. The increases for the works for and kills relations are particularly high. It is interesting to see that the joint model tends to not increase, or even harm, recall, but increase precision by a significant amount. The results here are equivalent to those in Roth and Yih , with the largest increases in F1 having the same magnitude as in my model.
These results show that the joint model is capable of significantly improving the relation classification performance.
Finally, the Omniscient and Omniscient + Joint lines display the results for entity and relation classifiers augmented with the omniscient features described in section 4.2.1. While this represents an unrealistic feature set for a true application, it is interesting to note that even in this case, the joint model never hurts and often improves performance.
5.2 Google’s Relation Extraction Corpus
My original goal was to use the recently released Google relation extraction corpus. It contains a number of relations, such as birth place, where each relation is accompanied by a paragraph from Wikipedia which contains text sufficient to infer the relation.
However, the Google corpus offers a challenge for evaluating results: it contains one label per paragraph, but many of the paragraphs contain multiple relations. I ran into problems such as the following: the classifier predicts that a person was born in Minnesota, which is correct and labeled so in the data, but also that he was affiliated with several schools and other relations that may or may not be correct (the paragraph they are drawn from only has the birthplace label).
To bypass this problem, and to enable testing my model on a second corpus, I only considered the birth place and place of death relations from the Google corpus. For these, there is presumably only one correct answer. I further simplified the problem by filtering the relations to those which contained a single sentence mentioning both the person and the location. Some of these sentences may not actually express the relation, but I relied on there being enough correct labels to outweigh any noise in the data.
Additionally, unlike the previous corpus, the Google corpus does not explicitly mark the entity mentions in each sentence. To overcome this, I used the Stanford NLP toolkit to automatically extract entities and used its predicted types as the ground truth.
Results are presented, averaged over 5-fold cross-validation, in tables 3 (for entities) and 4 (for relations).
Because the entity labels are generated automatically and likely contain numerous errors, it is difficult to draw any conclusions from the entity results over the Google corpus. However, for the relations (which are human-annotated), the joint model once again improves on the performance of the independent classifiers. As with the previous corpus, joint inference tends to not help or slightly harm recall but boost precision, even when omniscient features are used.
- Benefits of Joint Inference Consistent with previously published result, my approach finds that joint inference between entity and relation classification at the sentence level provides a small but significant increase in overall performance, as measured by precision, recall, and F1 scores.
- Consistency of Predictions By incorporating the outputs of entity and relation classifiers into a joint probabilistic model, I ensured that the predictions for each sentence are self-consistent: the probabilistic models enforce constraints present in the data, such as only two people can participate in the kills relation.
- Compiling Bayesian Networks: Pre-compiling Bayesian networks greatly speeds up the inference process. Contrary to my original expectations, the ability to perform exact inference does not significantly enhance final prediction quality.
- Exact versus Approximate Inference: In the graphical models I constructed to examine the influence of joint inference, whether the inference procedure returns exact or approximate makes essentially no difference. There are two primary reasons for this. One, the graphical model structures are relatively simple, and approximate inference such as loopy belief propagation has been empirically shown to achieve good approximations for these case. Second, the underlying probabilities encoded in the Bayesian networks are themselves highly noisy approximations. In hindsight, this is an obvious limitation when considering any potential performance gain granted by exact inference.
7 Future Work
- Quality of Classifier-Generated Probabilities: An interesting aspect suggested, but not explored, by my work is quantifying the dependence of the joint model on the quality of the probabilities generated by the individual classifiers. An investigation of this aspect would combine elements of sensitivity analysis of Bayesian networks and accurate probability estimates from classifiers, two well-researched areas that have not (to the best of my knowledge) been studied from the vantage point of estimating the joint impact of sensitivity and probability estimates on an information extraction task.
- Learning Constraints from Data This is mentioned as a possibility in Roth and Yih , but not pursued. Rather than having to manually specify the constraints for each corpus, it should be possible to learn such constraints from the data, similar to the method of learning the MLE estimates of Bayesian network parameters discussed in class.
Adnan Darwiche. Samiam. Software available from http://reasoning.cs.ucla.edu/samiam.
Adnan Darwiche. Modeling and reasoning with Bayesian networks. Cambridge University Press, 2009.
Adnan Darwiche and Mark Chavira. Ace, an arithmetic circuit compiler. 2007. URL: http://reasoning.cs.ucla.edu/ace.
Zhou GuoDong, Su Jian, Zhang Jie, and Zhang Min. Exploring various knowledge in relation extraction. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 427–434. Association for Computational Linguistics, 2005.
Jing Jiang and ChengXiang Zhai. A systematic exploration of the feature space for relation extraction. In HLT-NAACL, pages 113–120, 2007.
Rohit J Kate and Raymond J Mooney. Joint entity and relation extraction using card-pyramid parsing. In Proceedings of the Fourteenth Conference on Computational Natural Language Learning, pages 203–212. Association for Computational Linguistics, 2010.
Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pages 1003–1011. Association for Computational Linguistics, 2009.
Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning, pages 625–632. ACM, 2005.
Sebastian Riedel and Andrew McCallum. Fast and robust joint models for biomedical event extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1–12. Association for Computational Linguistics, 2011.
Dan Roth and Wen-tau Yih. Probabilistic reasoning for entity & relation recognition. In Proceedings of the 19th international conference on Computational linguistics-Volume 1, pages 1–7. Association for Computational Linguistics, 2002.
Dan Roth and Wen-tau Yih. Global inference for entity and relation identification via a linear programming formulation. Introduction to Statistical Relational Learning, pages 553–580, 2007.
Shubin Zhao and Ralph Grishman. Extracting relations with integrated information using kernel methods. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 419–426. Association for Computational Linguistics, 2005.