Home » Probabilistic Databases » The Hardness of Probabilistic Queries

# The Hardness of Probabilistic Queries

Since I started at UW last quarter, I have been spending a lot of time working on hardness results for queries over probabilistic databases. A (tuple-independent) probabilistic database is a normal relational database except that every tuple is annotated with a number between 0 and 1. This number gives the probability that the tuple is true in a randomly chosen world. Each tuple is assumed to be independent of all other tuples. The probability that a Boolean query (ie, a query which returns true or false) holds on a probabilistic database is given by the sum of the probabilities of all possible worlds in which that query holds.

The normal measure of complexity used in database theory is data complexity – for a given query, how does the running time of the query change with the size of the database. The normal operators of the relational calculus have polynomial data complexity. When considering only data complexity, any query, no matter how complex, runs in polynomial time in the size of the database.

The status for queries over probabilistic databases is not so nice. The results returned by a probabilistic query must also include a probability. In the case of a simple Boolean query such as:

${Q = \exists x.R(x)}$

We want the output to be a number between 0 and 1 indicating the probability that $Q$ is true. We use the fact that $P(Q) = 1 - P(\lnot Q)$, and $\lnot Q = \forall x.\lnot R(x)$. This shows that $P(\lnot Q)$ is the probability that all the tuples in $R$ are false. We can compute this easily, using the tuple-indepence condition:

${P(\lnot R(x_1) \land \lnot R(x_2) \land \cdots \land \lnot R(x_n))}$

${= P(\lnot R(x_1)) \cdots (P(\lnot R(x_n))}$

So $Q$ can be answered in polynomial time.

Consider this query:

${h_0 = \exists x \exists y.R(x) \land S(x,y) \land T(y)}$

We still have tuple-level independence, but now that we have added joins, we introduced correlations between possible elements of the answer set. There is no obvious way to efficiently calculate the probability that $h_0$ is true. In fact, a large amount of work has gone into showing that no such algorithm is ever likely to be found, for $h_0$ and a broad class of other positive, existentially quantified first-order queries. The culmination of this, for tuple-independent probabilistic databases and positive queries, can be found in the paper The dichotomy of probabilistic inference for unions of conjunctive queries (2012). It establishes that, for any query without universal quantifiers or negation, evaluating that query over a probabilistic database is either in PTIME or #P-hard in the size of the database.

## A Brief Introduction to #P

#P is the complexity class containing counting problems associated with problems in NP. An example of a #P problem is counting the number of satisfying assignments of the variables in an instance of 3SAT. A #P-hard problem is a problem such that a polynomial time solution would allow you to compute, in polynomial time, the solution for any problem in #P.  #P-hard problems are at least as hard as NP-hard problems. A #P-complete problem is a problem that is #P-hard and is itself in #P. The canonical #P-complete problem is #SAT, counting the number of satisfying solutions to an arbitrary Boolean formula.

One of the interesting things about the #P class is that there are #P-complete problems where the associated decision problem is in P. Or in other words, determining whether a solution exists can be done in polynomial time, but counting the number of solutions is #P-hard.

In a 1983 paper, The complexity of counting cuts and of computing the probability that a graph is connected, Provan and Ball prove that #PP2CNF is also #P-complete, where the problem is defined as follows: Let Boolean variables $X_1, \dots, X_n$ and $Y_1, \dots, Y_n$, let $E$ be the edges of a bipartite graph connecting $X$ variables to $Y$ variables. The positive, partitioned 2-CNF (PP2CNF) formula is:

${\Phi = \land_{(i,j)\in E} (X_i \lor Y_j)}$

#PP2CNF is the number of satisfying assignments of $\Phi$. An immediate consequence is that the dual problem, #PP2DNF, is also  #P-hard:

${\Phi = \lor_{(i,j)\in E} (X_i \land Y_j)}$

## Proving $h_0$ is #P-hard

To show that $h_0$ is #P-hard, we reduce from #PP2DNF. If we can show that any #PP2DNF instance can be mapped to an instance of $h_0$, then we have shown that $h_0$ is at least as hard as #PP2DNF. What follows is directly from the paper by Dalvi and Suciu (The dichotomy of probabilistic inference for unions of conjunctive queries).

Recall $h_0$‘s definition:

${h_0 = \exists x \exists y.R(x) \land S(x,y) \land T(y)}$

Given a PP2DNF instance $\Phi$ with $E \subseteq [n_1] \times [n_2]$, define a database with three tables: R(x), S(x,y), and T(y).

Set $R = [n_1], S = E, T = [n_2]$. Define the probabilities as follows:

${P(R(i)) = \frac{1}{2}}$

${P(S(i, j)) = 1}$

${P(T(j)) = \frac{1}{2}}$

Then every possible world has probability $1 / 2^{n_1 + n_2}$. On each possible world, $h_0$ is true if and only if there exists an $(i, j)$ pair in $E$ such that $R(i)$ and $T(j)$ are true. This is the same condition necessary for the PP2DNF instance $\Phi$ to be satisfied. We have:

${P(h_0) = 1 / 2^{n_1 + n_2} \, \# \Phi }$

If we can find $P(h_0)$ in polynomial time, we can also find $\# \Phi$. This completes the hardness proof for $h_0$.

## Wrapping Up

Although this is a fairly straightforward proof that $h_0$ is #P-hard, the full dichotomy result for unions of positive existential queries into PTIME or #P-hard require an extensive amount of machinery. While the discussion in this post barely scratches the surface, it hopefully has illustrated that evaluating queries in probabilistic databases is a challenging problem and relates directly to an interesting area of complexity theory.