Monday, October 20, 2008

Cross-column statistics

Of late, I've been involved in an interesting thread on the pgsql-hackers mailing list regarding cross-column statistics. I can admit no immediate personal use for such a feature, but that hasn't stopped me before (cf. PL/LOLCODE), and it's a very interesting subject. I'm blogging about it because, well, I can. I expect, though, that my kids won't let me finish it all now, and that this will be a multi-part post.

The idea of planner statistics on single columns is well understood and implemented all over the database world. That's what you run ANALYZE for -- to refresh those statistics. Ideally the statistics the database keeps will help it determine the best way to execute queries. For instance, assume you have a table called PERSON which stores FIRST_NAME and LAST_NAME, with an index on both columns, and you tell the database to SELECT * FROM PERSON WHERE FIRST_NAME = 'Alfred' AND LAST_NAME = 'Heiseldorf'. The planner can choose several different ways to execute this plan. It can:
  • Scan the entire table and check each row to find matches to the query

  • Scan the index on FIRST_NAME for 'Alfred', and scan the results for LAST_NAME = 'Heiseldorf'

  • Scan the index on LAST_NAME for 'Heiseldorf' and scan the results for FIRST_NAME = 'Alfred'


The planner has to choose which of these three options would be the fastest, and to do that it has to guess how many rows are involved in each method -- hence the statistics. PostgreSQL keeps track of, among several other things, a list of the most common values in each column, and what percentage of the table each comprises (most_common_vals and most_common_freqs in pg_stats). The length of each list is determined by the statistics target value for each column (or default_statistics_target if a target for that column is not specified). PostgreSQL will look at the statistics for the FIRST_NAME and LAST_NAME columns to see if 'Alfred' or 'Heiseldorf' are in the lists of most_common_values for their respective columns. See here for more specifics, but using these values, along with estimated numbers of rows in the table, the planner can guess that there are, say, 310 'Alfred's in the table and 4 'Heiseldorf's. In that case, the optimizer will probably choose to look for 'Heiseldorf' first using an index scan on the LAST_NAME field, meaning it will have to scan only those four rows looking for 'Alfred's.

There are lots of other details, and to avoid being too long (but mostly to avoid embarrassing myself by getting them wrong) I'll leave them out. Suffice it to say that having good guesses about the number of rows a query will return is important for getting decent query plans.

More on this subject later...

Sunday, October 12, 2008

Calculating pi in LOLCODE

Magnus Hagander asked me for a simple function to return a number using PL/LOLCODE. Here's a pi calculator, based on a post by Grzegorz Jaƛkiewicz on the PostgreSQL GENERAL mailing list showing how to find pi using PostgreSQL's new recursive query feature:

create or replace function lol_pi() returns float immutable language pllolcode as $$
HAI

I HAS A PIADD ITZ 0.0
I HAS A PISUB ITZ 0.0
I HAS A ITR ITZ 0
I HAS A ITRZ ITZ 20000
I HAS A T1
I HAS A T2

IM IN YR LOOP

T1 R QUOSHUNT OF 4.0 AN SUM OF 3.0 AN ITR
T2 R QUOSHUNT OF 4.0 AN SUM OF 5.0 AN ITR
PISUB R SUM OF PISUB AN T1
PIADD R SUM OF PIADD AN T2
ITR R SUM OF ITR AN 4.0
BOTH SAEM ITR AN BIGGR OF ITR AN ITRZ, O RLY?

YA RLY, GTFO

OIC

IM OUTTA YR LOOP
FOUND YR SUM OF 4.0 AN DIFF OF PIADD AN PISUB

KTHXBYE
$$;

select lol_pi();

UPDATE: Fixed indenting to make this look nicer, and credited the mailing list post
UPDATE PART 2: Note that this particular function doesn't use anything special about PL/LOLCODE, and AFAICS should run just fine in any LOLCODE interpreter.
UPDATE PART 3: Note also that there are better facilities for loops in LOLCODE, but PL/LOLCODE doesn't implement them yet, which is why I've got to track the counter explicitly, and explicitly GTFO when it hits ITRZ.
UPDATE PART 3: Some investigation has revealed that this method of calculating pi is called a Gregory-Leibniz series

Wednesday, October 8, 2008

PL/LOLCODE HOWTO


I'm working on a presentation for this weekend's PostgreSQL Conference. It's 90 minutes of me chatting about turning LOLCODE, of all things, into a PostgreSQL procedural language.