I’ve been reading a lot of papers since starting at UW and I thought it might be nice to keep some written notes about a few of them.
Overview of the Paper
The paper discusses exact probabilistic inference and presents a framework (called relax, compensate, and recover) for approximate probabilistic inference. From what I have encountered so far, probabilistic inference in the CS community is used exclusively when talking about calculating probabilities using a graphical models (such as Bayesian networks), which are statistical models used to represent the joint distribution and conditional dependencies over a set of random variables.
For example, variables might represent our knowledge about particular aspects of the world, such as whether the lawn is wet or whether it is raining outside. Usually, some form of evidence is noted, such as the absence of a wet lawn, and an example of probabilistic inference is then to calculate the probability that it is raining after taking into account this evidence. Calculating probabilities on graphical models is a difficult computational problem, and so this paper explores ways to expand the cases where exact inference can be used and find better approximation techniques for intractable cases.
The approximate inference technique is based on the idea of taking the input graphical model, then “relaxing” some of its equivalence constraints, and performing exact inference on the now approximate model. The notions of compensation and recovery are explained later. Their purpose is to nudge the approximate model closer to the original model, which hopefully means that our solution will move closer to the true solution of the original problem.
The exact inference technique is based on the idea of converting a probabilistic inference problem into a propositional knowledge base (this Darwiche paper also refers to these as Boolean functions). The idea here is that, given a knowledge base in a certain form (detailed later), the problem can be compiled into a form that supports tractable answering of queries.
Interestingly, even though their approximate inference technique relies on exact inference over a modified model, they note that their exact and inexact approaches have not been integrated together in practice (at time of writing, at least).
They conclude the introduction with a note that their frameworks have performed well at several recent Uncertainty in Artificial Intelligence (UAI) inference competitions.
Exact Probabilistic Inference
The first concept introduced in this section is treewidth. Treewidth is a numeric characterization of a graph’s structure that is often used in graph algorithms to characterize complexity. As the paper notes, existing exact inference algorithms have runtimes that are generally exponential in the model’s treewidth.
The second concept introduced is negation normal form (NNF), which they define as a directed acyclic graph (or circuit) where internal nodes are labeled with either AND or OR, and the leaf nodes are all literals or explicitly labeled true or false. In other words, an NNF is a boolean circuit diagram built with only AND and OR gates.
An NNF can exhibit zero, one or both of the following properties: decomposability, which means that the left and right branches of an AND node share no variables, and determinism, which means that the left and right branches of an OR node are mutually exclusive (the expression in the left branch cannot be true if the expression in the right branch is true, and vice versa).
The basic idea of this section is that any type of graphical model can be encoded into a propositional knowledge base in conjunctive normal form (CNF), where each clause has a weight. If you translate the CNF into a decomposable, deterministic NNF (dd-NNF), then you can perform efficient inference in time linear in the NNF size. In this case, the NNF represents an arithmetic circuit and a single pass from the leaves to the root will compute a probability. This sounds great, and often is, but they point out the linear run time is in the size of the dd-NNF, not in the original size of the original knowledge base, and unfortunately the conversion from CNF to dd-NNF can result in an exponential size increase.
Still, they state that a key advantage of this compilation into a dd-NNF is that it “exploits local structure (i.e., the structure of the parameters that quantify the network)”. I wasn’t sure what this meant, and the paper doesn’t elaborate, but there’s actually a chapter about this in Darwiche’s book, and in a less easily digestible form in this paper. The simplest form of “local structure”, which just means any structure arising from the conditional probability tables of a graphical model instead of its network topology, is determinism: instead of saying X is 50% likely to occur when Y occurs, you say that X occurs with certainty when Y occurs. These constraints are deterministic because they always happen, not just some of the time. Awareness of deterministic constraints can help you avoid performing inference over the entire structure of the graph. Another example of local structure that can make inference easier is context-specific independence: if X takes on a specific value, one of its children may become independent of its other parents. There are other examples, and I haven’t read far enough to see exactly how dd-NNF’s make it easier to exploit local structure, but at least this gives some of the intuition behind what local structure is referring to.
Approximate Probabilistic Inference
In this section, Choi and Darwiche elaborate on how to perform approximate inference by taking the original problem, creating a relaxed version of it, and performing exact inference on the now-approximate model.
First up is the concept of relaxing equivalence constraints, which say that the value of variable X must be the same as Y in a valid model. Relaxing this equality constraint means simply ignoring the fact that X and Y should take on the same value. Exact inference in the relaxed model may produce a solution where, for instance, X is true and Y is false; this is invalid in the original problem, but okay in the relaxed model.
Next is compensation, which helps move the solutions on the relaxed model closer to the exact solution to the original model. In general, this is done by manipulating the approximate model to enforce a weaker notion of equivalence on the relaxed equality constraint. What exactly a weaker notion of equivalence looks like depends on the type of model, and I’ll talk about that more below.
Finally, recovery refers to restoring the equivalence constraints that, when removed, caused the worst decrease in accuracy of the approximate solution. They point to their previous work that proposes a number of recovery heuristics to try to identify the equivalence constraints whose relaxation most damaged the accuracy of the solution.
The authors also note that the exact solutions on a relaxed model coincide with the solutions produced by the mini-bucket elimination algorithm of Rina Dechter (http://www.ics.uci.edu/~csp/r62a.pdf), and that the relax, compensate, and recover framework subsumes loopy belief propagation. Perhaps some discussion of how and why this is so could serve as a future post; I don’t know enough details of either algorithm to contribute much now.
It’s worth noting that there is a connection between this inference framework and solving weighted MAX-SAT, which the authors briefly mention in this paper. They have a previous paper that discusses how to relax and compensate for weighted MAX-SAT that I found quite useful for wrapping my head around what “weaker notion of equivalency” actually means: it’s any property of the variables involved in a relaxation that, while not necessarily implying that the variables adopt the same value, helps in some fashion to move the solution of the approximate model closer to the solution of the original models. For weighted MAX-SAT, weaker equivalency corresponds to a relationship between the weights of the relaxed variables and the weights of (slightly) constrained optimal solutions in the relaxed model. For Bayesian networks, weaker equivalency corresponds to the relaxed variables having the same marginal probabilities (meaning that the probability of X being true is the same as the probability of Y being true), which makes a kind of intuitive sense: if the values of X and Y are the same, as in the original model, they must have the same marginals. So a slightly weaker definition of “the same” could be that their marginals alone are equal. Their is some more detail about this concept in this paper, and much more detail in their other work.
One not-completely obvious point about the relaxation of equivalence constraints: many times what you have in a graphical model is not an equivalence constraint but a correlation between variables. For instance, if you have connected variables X and Y, you can insert a new variable Y’ between X and Y, with the equivalence constraint that Y = Y’. When you relax the model, you let Y and Y’ take on different values: this means that connection between X and Y is broken, as X interacts with Y’ which is not related to Y in the relaxed model.
The paper concludes with a pointer to their Bayesian network toolkit, SamIam, which includes implementations of the inference framework they discuss.
They identify as future work the integration between their approximate and exact inference frameworks, as well as noting the potential for the further incorporation of knowledge compilation.