Monday, October 20, 2008

Cross-column statistics

Of late, I've been involved in an interesting thread on the pgsql-hackers mailing list regarding cross-column statistics. I can admit no immediate personal use for such a feature, but that hasn't stopped me before (cf. PL/LOLCODE), and it's a very interesting subject. I'm blogging about it because, well, I can. I expect, though, that my kids won't let me finish it all now, and that this will be a multi-part post.

The idea of planner statistics on single columns is well understood and implemented all over the database world. That's what you run ANALYZE for -- to refresh those statistics. Ideally the statistics the database keeps will help it determine the best way to execute queries. For instance, assume you have a table called PERSON which stores FIRST_NAME and LAST_NAME, with an index on both columns, and you tell the database to SELECT * FROM PERSON WHERE FIRST_NAME = 'Alfred' AND LAST_NAME = 'Heiseldorf'. The planner can choose several different ways to execute this plan. It can:
  • Scan the entire table and check each row to find matches to the query

  • Scan the index on FIRST_NAME for 'Alfred', and scan the results for LAST_NAME = 'Heiseldorf'

  • Scan the index on LAST_NAME for 'Heiseldorf' and scan the results for FIRST_NAME = 'Alfred'


The planner has to choose which of these three options would be the fastest, and to do that it has to guess how many rows are involved in each method -- hence the statistics. PostgreSQL keeps track of, among several other things, a list of the most common values in each column, and what percentage of the table each comprises (most_common_vals and most_common_freqs in pg_stats). The length of each list is determined by the statistics target value for each column (or default_statistics_target if a target for that column is not specified). PostgreSQL will look at the statistics for the FIRST_NAME and LAST_NAME columns to see if 'Alfred' or 'Heiseldorf' are in the lists of most_common_values for their respective columns. See here for more specifics, but using these values, along with estimated numbers of rows in the table, the planner can guess that there are, say, 310 'Alfred's in the table and 4 'Heiseldorf's. In that case, the optimizer will probably choose to look for 'Heiseldorf' first using an index scan on the LAST_NAME field, meaning it will have to scan only those four rows looking for 'Alfred's.

There are lots of other details, and to avoid being too long (but mostly to avoid embarrassing myself by getting them wrong) I'll leave them out. Suffice it to say that having good guesses about the number of rows a query will return is important for getting decent query plans.

More on this subject later...

No comments: