Friday, November 14, 2008

Cross-column statistics III

I'm hoping, possibly in vain, that in this post I'll actually get to the point I was hoping for when I started these posts (previous editions available here (part 1) and here (part 2). This wasn't originally supposed to be a series, much less an epic.

The last post introduced a few questions we need to answer in order to implement cross-column statistics, and got as far as demonstrating how PostgreSQL uses histograms and why it assumes columns are independent. Again, this assumption stems entirely from the fact that the planner has no better information go on, not because it's a particularly correct assumption. As an example, height and weight of an individual are fairly closely correlated values; people who are relatively tall tend to be relatively heavy. In statistical terms, they're "dependent random variables". But since PostgreSQL doesn't know that, it can't make query plans that take advantage of the fact.

There has been a fair bit of study into modeling the dependence between random variables. Some go so far as to say dependence is the most studied topic in statistics[1]. So there are a few different well-studied ways of modeling interactions between database columns. But the simple histogram method used for single columns probably won't work, for several reasons. Principle among them is what's called the "curse of dimensionality"[2], which says in effect that as you add more and more dimensions (in this case, columns) to a dataset (in this case, a histogram), the volume of the space increases exponentially, but the amount of data available to describe that space doesn't, so the details of the space are less and less defined as dimensions are added.

There are other ways. Right now I'm interested in what's called a copula[3], suggested on the pgsql-hackers list by Nathan Boley. A copula is a function defined on the domain [0, 1] in multiple dimensions whose arguments are uniform random variables. Assume we are given two random variables x and y, drawn from populations X and Y respectively, and functions F(x) and G(y), where F(x) gives the percentage of members of the population X that are less than the value x, and similarly for G(y). In our application, X could be all the values from one column of a table, and Y could be all the values from another column in either the same table or some other table. Existing histogram techniques allow us to determine F(x) and G(y). Our cross column statistics are trying to give us a value for H(x, y), which is the proportion of the joint population X + Y whose values are less than both x and y. For two variables, Sklar's theorem[see 3, again] states that H(x, y) = C(F(x), G(y)), where C is a copula.

So what's so neat about copulas that we'd want to investigate them? Perhaps most important is that they help overcome the curse of dimensionality. They model a probability space, not a population space, and every new dimension you add to a copula is defined across its entire probability. Just think about that one, because I can't figure out how to explain it better. That's a blow to my ego, but a blessing in that it means this post won't be as long. :)

Copulas (or if you want to be pretentious, copulae) are also interesting because they're getting a lot of attention, which means there are people figuring out how to use them effectively (which is nice if we want to use them effectively within PostgreSQL). Economists and actuaries and the like use copulas a lot to model interactions between variables so they can tell you, for instance, how much they're raising your insurance premium.

All that said, if we decide copulas work to model this stuff, there will be a bunch more questions to answer:

1) What kind of copula will we use? There are lots to choose from, with names like "Ali-Mikhail-Haq" and "Farlie-Gumbel-Morgenstern"
2) How will we store the copula in some quickly usable but not terribly huge format?
3) How will we actually use these things, anyway?
4) Which columns will we store this way?
... and lots of others.

Finally, by request, an almost entirely unrelated point.

[1] K. Jogdeo, Dependence, concepts of. in: Samuel kotz, norman l. johnson, and campbell b.
read, Encyclopedia of Statistical Sciences 2 (1982), 324–334. as cited by
Ioannidis, Christos and Williams, Julian M.,A Black Box Approach to Copulas: The Non-Parametric Empirical Copula Approach. Available at SSRN: http://ssrn.com/abstract=955493
[2] http://en.wikipedia.org/wiki/Curse_of_dimensionality
[3] http://en.wikipedia.org/wiki/Copula_(statistics)
[4] http://archives.postgresql.org/pgsql-hackers/2008-10/msg00865.php

1 comment:

The Wattson Family said...

Thanks for the support...literally. The article is great and now I know what I want for Christmas:)