Friday, November 14, 2008

Cross-column statistics III

I'm hoping, possibly in vain, that in this post I'll actually get to the point I was hoping for when I started these posts (previous editions available here (part 1) and here (part 2). This wasn't originally supposed to be a series, much less an epic.

The last post introduced a few questions we need to answer in order to implement cross-column statistics, and got as far as demonstrating how PostgreSQL uses histograms and why it assumes columns are independent. Again, this assumption stems entirely from the fact that the planner has no better information go on, not because it's a particularly correct assumption. As an example, height and weight of an individual are fairly closely correlated values; people who are relatively tall tend to be relatively heavy. In statistical terms, they're "dependent random variables". But since PostgreSQL doesn't know that, it can't make query plans that take advantage of the fact.

There has been a fair bit of study into modeling the dependence between random variables. Some go so far as to say dependence is the most studied topic in statistics[1]. So there are a few different well-studied ways of modeling interactions between database columns. But the simple histogram method used for single columns probably won't work, for several reasons. Principle among them is what's called the "curse of dimensionality"[2], which says in effect that as you add more and more dimensions (in this case, columns) to a dataset (in this case, a histogram), the volume of the space increases exponentially, but the amount of data available to describe that space doesn't, so the details of the space are less and less defined as dimensions are added.

There are other ways. Right now I'm interested in what's called a copula[3], suggested on the pgsql-hackers list by Nathan Boley. A copula is a function defined on the domain [0, 1] in multiple dimensions whose arguments are uniform random variables. Assume we are given two random variables x and y, drawn from populations X and Y respectively, and functions F(x) and G(y), where F(x) gives the percentage of members of the population X that are less than the value x, and similarly for G(y). In our application, X could be all the values from one column of a table, and Y could be all the values from another column in either the same table or some other table. Existing histogram techniques allow us to determine F(x) and G(y). Our cross column statistics are trying to give us a value for H(x, y), which is the proportion of the joint population X + Y whose values are less than both x and y. For two variables, Sklar's theorem[see 3, again] states that H(x, y) = C(F(x), G(y)), where C is a copula.

So what's so neat about copulas that we'd want to investigate them? Perhaps most important is that they help overcome the curse of dimensionality. They model a probability space, not a population space, and every new dimension you add to a copula is defined across its entire probability. Just think about that one, because I can't figure out how to explain it better. That's a blow to my ego, but a blessing in that it means this post won't be as long. :)

Copulas (or if you want to be pretentious, copulae) are also interesting because they're getting a lot of attention, which means there are people figuring out how to use them effectively (which is nice if we want to use them effectively within PostgreSQL). Economists and actuaries and the like use copulas a lot to model interactions between variables so they can tell you, for instance, how much they're raising your insurance premium.

All that said, if we decide copulas work to model this stuff, there will be a bunch more questions to answer:

1) What kind of copula will we use? There are lots to choose from, with names like "Ali-Mikhail-Haq" and "Farlie-Gumbel-Morgenstern"
2) How will we store the copula in some quickly usable but not terribly huge format?
3) How will we actually use these things, anyway?
4) Which columns will we store this way?
... and lots of others.

Finally, by request, an almost entirely unrelated point.

[1] K. Jogdeo, Dependence, concepts of. in: Samuel kotz, norman l. johnson, and campbell b.
read, Encyclopedia of Statistical Sciences 2 (1982), 324–334. as cited by
Ioannidis, Christos and Williams, Julian M.,A Black Box Approach to Copulas: The Non-Parametric Empirical Copula Approach. Available at SSRN: http://ssrn.com/abstract=955493
[2] http://en.wikipedia.org/wiki/Curse_of_dimensionality
[3] http://en.wikipedia.org/wiki/Copula_(statistics)
[4] http://archives.postgresql.org/pgsql-hackers/2008-10/msg00865.php

Thursday, November 6, 2008

Cross-column statistics II

My last post talked generally about how PostgreSQL uses statistics for query planning, and was written to introduce the topic of cross-column statistics, in which I've long been interested. This post will blather on for a while about what such things are, and why they might be neat. Or it might end up being more on single-column statistics, because a basis in one-column stats is useful for understanding cross-column stats.

PostgreSQL's statistics are kept for each column, and each column's statistics are independent of those for another column. The planner assumes that the data in one column are independent of the data in another column; that is, that the probability of a particular value showing up in a given column and row is independent of the other values in the row. The planner makes this assumption not because it's reasonable (it's not), but because it's a lot of work to do otherwise. In order to get past this independence assumption, the planner would need to keep "cross-column" statistics, or statistics that model the interactions between values of different columns.

There are a few questions to answer when considering implementing something like this. They include:
  • What statistics could we keep that would improve query planning? In other words, what statistics can we use?

  • How can we keep those statistics without taking too long to build them, too much space to keep them, or too long to use them when planning queries?

  • How do we know which columns have interesting interactions? In other words, which are dependent enough and involved in enough queries together to make it worth tracking their interactions?

  • Does this work for columns that are part of the same table, or can it work for columns in different tables, joined by some defined relationship?



What statistics could we keep that would be useful?
To answer this question, it helps to know how PostgreSQL uses its current statistics. Remember that the point of keeping column statistics is to estimate the number of rows in the column that meet a given condition, such as "LAST_NAME = 'Heiseldorf'" from "SELECT * FROM PERSON WHERE LAST_NAME = 'Heiseldorf'". To answer this question, PostgreSQL keeps several statistics values for each column, including:
  • the most common values in a column

  • the number of distinct values in the column

  • the fraction of NULL values (NULLs are often treated separately, as it's reasonable to assume the distribution of NULLs would behave differently than the distribution of other values)

  • the most common values in the column and their frequencies

  • a histogram describing the distribution of values in the column. "Histogram" means different things for different people; in this case it's a list of the 1/Nth quantiles, where N is the column's statistics target value. In other words, if statistics target is 10, this will include a list of 10 values in order. The first value in the list is greater than 10% of the values in the column, the second value in the list is greater than 20% of the values in the column, and so on.


The planner can use these values to approximate the number of rows where a particular column's value is equal to, greater than, or less than any other given value. For instance, to find how many values in a column equal "Foo", the planner will look for "Foo" in the most common values list. If it finds it, it uses the estimated number of total rows in the relation (pg_class.reltuples) and the estimated frequency of that particular value (available in pg_stats.most_common_freqs) to guess the number of rows containing that value. If the value isn't in the most common values list, it assumes it's uniformly distributed with all other "less common" values. For inequalities (e.g. "WHERE some_field < 'Foo'"), it looks through the histogram buckets to find which one contains "Foo". Assuming it finds "Foo" in the 5th bucket, it knows "Foo" is greater than all the values in the first four buckets. If there are 20 buckets (e.g. statistics target = 20), that means "Foo" is greater than 4/20, or 25% of the rows in the table. Multiplying by pg_class.reltuples gives the number of rows the inequality operator is expected to return. For data types where a distance metric exists (like numeric values, where it can use the subtraction operator), PostgreSQL can assume a linear distribution of values in the histogram bucket and get a presumably more accurate estimate. But this can only work if the data type's values are from a metric space, that is, where there's a distance metric.

Trying to extend this idea into multiple dimensions quickly becomes difficult, especially when one remembers that it has to apply to data types without a distance metric. If everything had a distance metric, we could (for instance) divide the space between the maximum and minimum values in each column into N uniform segments (N again represents the column's statistics target), create a P-dimensional matrix (where P is the number of columns involved) with N^P cells, where each cell contains the count of rows whose values fall within the histogram bucket defined by that cell.

This approach is simple, but fails miserably in practice. For starters, as you add columns to the mix, the number of histogram cells goes up exponentially. Also, this falls victim to the "curse of dimensionality"; put simply, although the size of the matrix increases exponentially as you add columns, the amount of data does not increase, so the amount of data in each part of the matrix decreases, until it's so small that you know effectively nothing about most of the matrix, and it becomes useless for estimation.

Further posts will explore more of the questions above, as well as ways to implement cross-column statistics.