Thursday, November 6, 2008

Cross-column statistics II

My last post talked generally about how PostgreSQL uses statistics for query planning, and was written to introduce the topic of cross-column statistics, in which I've long been interested. This post will blather on for a while about what such things are, and why they might be neat. Or it might end up being more on single-column statistics, because a basis in one-column stats is useful for understanding cross-column stats.

PostgreSQL's statistics are kept for each column, and each column's statistics are independent of those for another column. The planner assumes that the data in one column are independent of the data in another column; that is, that the probability of a particular value showing up in a given column and row is independent of the other values in the row. The planner makes this assumption not because it's reasonable (it's not), but because it's a lot of work to do otherwise. In order to get past this independence assumption, the planner would need to keep "cross-column" statistics, or statistics that model the interactions between values of different columns.

There are a few questions to answer when considering implementing something like this. They include:
  • What statistics could we keep that would improve query planning? In other words, what statistics can we use?

  • How can we keep those statistics without taking too long to build them, too much space to keep them, or too long to use them when planning queries?

  • How do we know which columns have interesting interactions? In other words, which are dependent enough and involved in enough queries together to make it worth tracking their interactions?

  • Does this work for columns that are part of the same table, or can it work for columns in different tables, joined by some defined relationship?



What statistics could we keep that would be useful?
To answer this question, it helps to know how PostgreSQL uses its current statistics. Remember that the point of keeping column statistics is to estimate the number of rows in the column that meet a given condition, such as "LAST_NAME = 'Heiseldorf'" from "SELECT * FROM PERSON WHERE LAST_NAME = 'Heiseldorf'". To answer this question, PostgreSQL keeps several statistics values for each column, including:
  • the most common values in a column

  • the number of distinct values in the column

  • the fraction of NULL values (NULLs are often treated separately, as it's reasonable to assume the distribution of NULLs would behave differently than the distribution of other values)

  • the most common values in the column and their frequencies

  • a histogram describing the distribution of values in the column. "Histogram" means different things for different people; in this case it's a list of the 1/Nth quantiles, where N is the column's statistics target value. In other words, if statistics target is 10, this will include a list of 10 values in order. The first value in the list is greater than 10% of the values in the column, the second value in the list is greater than 20% of the values in the column, and so on.


The planner can use these values to approximate the number of rows where a particular column's value is equal to, greater than, or less than any other given value. For instance, to find how many values in a column equal "Foo", the planner will look for "Foo" in the most common values list. If it finds it, it uses the estimated number of total rows in the relation (pg_class.reltuples) and the estimated frequency of that particular value (available in pg_stats.most_common_freqs) to guess the number of rows containing that value. If the value isn't in the most common values list, it assumes it's uniformly distributed with all other "less common" values. For inequalities (e.g. "WHERE some_field < 'Foo'"), it looks through the histogram buckets to find which one contains "Foo". Assuming it finds "Foo" in the 5th bucket, it knows "Foo" is greater than all the values in the first four buckets. If there are 20 buckets (e.g. statistics target = 20), that means "Foo" is greater than 4/20, or 25% of the rows in the table. Multiplying by pg_class.reltuples gives the number of rows the inequality operator is expected to return. For data types where a distance metric exists (like numeric values, where it can use the subtraction operator), PostgreSQL can assume a linear distribution of values in the histogram bucket and get a presumably more accurate estimate. But this can only work if the data type's values are from a metric space, that is, where there's a distance metric.

Trying to extend this idea into multiple dimensions quickly becomes difficult, especially when one remembers that it has to apply to data types without a distance metric. If everything had a distance metric, we could (for instance) divide the space between the maximum and minimum values in each column into N uniform segments (N again represents the column's statistics target), create a P-dimensional matrix (where P is the number of columns involved) with N^P cells, where each cell contains the count of rows whose values fall within the histogram bucket defined by that cell.

This approach is simple, but fails miserably in practice. For starters, as you add columns to the mix, the number of histogram cells goes up exponentially. Also, this falls victim to the "curse of dimensionality"; put simply, although the size of the matrix increases exponentially as you add columns, the amount of data does not increase, so the amount of data in each part of the matrix decreases, until it's so small that you know effectively nothing about most of the matrix, and it becomes useless for estimation.

Further posts will explore more of the questions above, as well as ways to implement cross-column statistics.

No comments: