Thursday, December 18, 2008

I CAN HAS SQRT?

I was pointed at this today. Although it doesn't follow the LOLCODE spec as I read it (perhaps it follows some consensus about future specs that I've not paid attention to), it's neat: square roots in LOLCODE.

Friday, December 12, 2008

ForwardAgent

Periodically I'll learn something so useful I immediately wish I'd known it years ago, and so simple I'm almost ashamed to admit it was news to me. UNIX systems in general, and Linux in particular, seem to be especially good at coming up with such things. Yesterday was one such experience, when I learned about SSH's ForwardAgent option.

SSH agents let you use passphrase-protected SSH keys without typing in your passphrase all the time. If I connect from Host1 to Host2 via SSH, Host2 sends me a key challenge and my SSH agent answers it authenticate me. But if I then use SSH to go from Host2 to Host3, by default I need to either type a password on Host3 (which is annoying and not terribly secure) or copy my private key to Host2 (which is annoying and terribly not secure) (yes, I meant to write it that way).

ForwardAgent (configured in ssh_config) allows Host3 to send a key challenge to Host2, and Host2 to forward the challenge to the SSH agent living on Host1. It's as if my private key followed me wherever I went. Neat :)

Thursday, December 11, 2008

Woah... a MUMPS person

Yesterday, on the bus, I found out an otherwise unassuming acquaintance who works at the local veterans' hospital, is a devotee of the VA's VistA health information system, and the MUMPS system that supports it. I knew there were people still forced to use MUMPS; I didn't know anyone actually enjoyed it.

The VA open sourced VistA while I was working in medical software. I spent a few hours typing inscrutable strings of strange characters into an only semi-responsive console before giving up with the conclusion that anything that esoteric wouldn't be much of a competitor to our product. Which is not to say our product actually worked, just that it was easier to install.

Sunday, December 7, 2008

New sourdough starter

We like sourdough. It's neat, it's fun, it's yummy... but we've had problems with sourdough starters. I got a sourdough start from a coworker a while back. It came with a pedigree: a settler had started it over a century before in the Joe's valley area of Utah, and it had been shared and passed down through families ever since. This coworker grew up on sourdough pancakes from this same starter. For a while it worked well, and even when we didn't remember to treat it kindly, it still made decent bread. We learned that even though we couldn't see the difference between quality wheat and crummy wheat, the starter could, and flour from good wheat would rise better. We learned that the electric flour mill we used heated the flour too much, and the starter wouldn't grow well, and when we finally got our manual grain mill the starter worked much better. But eventually the starter just got nasty. Sourdough recipes talk about a "pleasant sour odor" but ours was far from pleasant. It was more like sour feet than sour dough.

So we started a new one. It's easy enough -- mix flour and water and let it sit out for a few days, periodically adding more flour and more water, and occasionally changing the bowl it's in to a nice clean one. Cheesecloth or perforated plastic wrap on the top is good for letting wild yeasts in and keeping bugs out, and in a few days you get something kinda foamy. That starter worked decently enough, but since last summer we've been fairly busy, and haven't been able to feed it regularly. So the other day we retired it from its dark corner of the refrigerator where it had spent the last few months, and I ground up some rye to start a new one.

Rye is supposed to be better than other grains for starting sourdough cultures, because it has less phytic acid than other grains. Eventually we'll have to -- gently -- introduce wheat flour to this starter; it takes a while for a starter to make the transition to a new flour. But for now, we've got a pasty flour blob sitting in a bowl showing faint signs of bubbling. This being winter, the wild yeasts we collect aren't as active and our kitchen counter isn't as warm, so it takes a while for it to get started, but here are some pictures, just for kicks...

Friday, November 14, 2008

Cross-column statistics III

I'm hoping, possibly in vain, that in this post I'll actually get to the point I was hoping for when I started these posts (previous editions available here (part 1) and here (part 2). This wasn't originally supposed to be a series, much less an epic.

The last post introduced a few questions we need to answer in order to implement cross-column statistics, and got as far as demonstrating how PostgreSQL uses histograms and why it assumes columns are independent. Again, this assumption stems entirely from the fact that the planner has no better information go on, not because it's a particularly correct assumption. As an example, height and weight of an individual are fairly closely correlated values; people who are relatively tall tend to be relatively heavy. In statistical terms, they're "dependent random variables". But since PostgreSQL doesn't know that, it can't make query plans that take advantage of the fact.

There has been a fair bit of study into modeling the dependence between random variables. Some go so far as to say dependence is the most studied topic in statistics[1]. So there are a few different well-studied ways of modeling interactions between database columns. But the simple histogram method used for single columns probably won't work, for several reasons. Principle among them is what's called the "curse of dimensionality"[2], which says in effect that as you add more and more dimensions (in this case, columns) to a dataset (in this case, a histogram), the volume of the space increases exponentially, but the amount of data available to describe that space doesn't, so the details of the space are less and less defined as dimensions are added.

There are other ways. Right now I'm interested in what's called a copula[3], suggested on the pgsql-hackers list by Nathan Boley. A copula is a function defined on the domain [0, 1] in multiple dimensions whose arguments are uniform random variables. Assume we are given two random variables x and y, drawn from populations X and Y respectively, and functions F(x) and G(y), where F(x) gives the percentage of members of the population X that are less than the value x, and similarly for G(y). In our application, X could be all the values from one column of a table, and Y could be all the values from another column in either the same table or some other table. Existing histogram techniques allow us to determine F(x) and G(y). Our cross column statistics are trying to give us a value for H(x, y), which is the proportion of the joint population X + Y whose values are less than both x and y. For two variables, Sklar's theorem[see 3, again] states that H(x, y) = C(F(x), G(y)), where C is a copula.

So what's so neat about copulas that we'd want to investigate them? Perhaps most important is that they help overcome the curse of dimensionality. They model a probability space, not a population space, and every new dimension you add to a copula is defined across its entire probability. Just think about that one, because I can't figure out how to explain it better. That's a blow to my ego, but a blessing in that it means this post won't be as long. :)

Copulas (or if you want to be pretentious, copulae) are also interesting because they're getting a lot of attention, which means there are people figuring out how to use them effectively (which is nice if we want to use them effectively within PostgreSQL). Economists and actuaries and the like use copulas a lot to model interactions between variables so they can tell you, for instance, how much they're raising your insurance premium.

All that said, if we decide copulas work to model this stuff, there will be a bunch more questions to answer:

1) What kind of copula will we use? There are lots to choose from, with names like "Ali-Mikhail-Haq" and "Farlie-Gumbel-Morgenstern"
2) How will we store the copula in some quickly usable but not terribly huge format?
3) How will we actually use these things, anyway?
4) Which columns will we store this way?
... and lots of others.

Finally, by request, an almost entirely unrelated point.

[1] K. Jogdeo, Dependence, concepts of. in: Samuel kotz, norman l. johnson, and campbell b.
read, Encyclopedia of Statistical Sciences 2 (1982), 324–334. as cited by
Ioannidis, Christos and Williams, Julian M.,A Black Box Approach to Copulas: The Non-Parametric Empirical Copula Approach. Available at SSRN: http://ssrn.com/abstract=955493
[2] http://en.wikipedia.org/wiki/Curse_of_dimensionality
[3] http://en.wikipedia.org/wiki/Copula_(statistics)
[4] http://archives.postgresql.org/pgsql-hackers/2008-10/msg00865.php

Thursday, November 6, 2008

Cross-column statistics II

My last post talked generally about how PostgreSQL uses statistics for query planning, and was written to introduce the topic of cross-column statistics, in which I've long been interested. This post will blather on for a while about what such things are, and why they might be neat. Or it might end up being more on single-column statistics, because a basis in one-column stats is useful for understanding cross-column stats.

PostgreSQL's statistics are kept for each column, and each column's statistics are independent of those for another column. The planner assumes that the data in one column are independent of the data in another column; that is, that the probability of a particular value showing up in a given column and row is independent of the other values in the row. The planner makes this assumption not because it's reasonable (it's not), but because it's a lot of work to do otherwise. In order to get past this independence assumption, the planner would need to keep "cross-column" statistics, or statistics that model the interactions between values of different columns.

There are a few questions to answer when considering implementing something like this. They include:
  • What statistics could we keep that would improve query planning? In other words, what statistics can we use?

  • How can we keep those statistics without taking too long to build them, too much space to keep them, or too long to use them when planning queries?

  • How do we know which columns have interesting interactions? In other words, which are dependent enough and involved in enough queries together to make it worth tracking their interactions?

  • Does this work for columns that are part of the same table, or can it work for columns in different tables, joined by some defined relationship?



What statistics could we keep that would be useful?
To answer this question, it helps to know how PostgreSQL uses its current statistics. Remember that the point of keeping column statistics is to estimate the number of rows in the column that meet a given condition, such as "LAST_NAME = 'Heiseldorf'" from "SELECT * FROM PERSON WHERE LAST_NAME = 'Heiseldorf'". To answer this question, PostgreSQL keeps several statistics values for each column, including:
  • the most common values in a column

  • the number of distinct values in the column

  • the fraction of NULL values (NULLs are often treated separately, as it's reasonable to assume the distribution of NULLs would behave differently than the distribution of other values)

  • the most common values in the column and their frequencies

  • a histogram describing the distribution of values in the column. "Histogram" means different things for different people; in this case it's a list of the 1/Nth quantiles, where N is the column's statistics target value. In other words, if statistics target is 10, this will include a list of 10 values in order. The first value in the list is greater than 10% of the values in the column, the second value in the list is greater than 20% of the values in the column, and so on.


The planner can use these values to approximate the number of rows where a particular column's value is equal to, greater than, or less than any other given value. For instance, to find how many values in a column equal "Foo", the planner will look for "Foo" in the most common values list. If it finds it, it uses the estimated number of total rows in the relation (pg_class.reltuples) and the estimated frequency of that particular value (available in pg_stats.most_common_freqs) to guess the number of rows containing that value. If the value isn't in the most common values list, it assumes it's uniformly distributed with all other "less common" values. For inequalities (e.g. "WHERE some_field < 'Foo'"), it looks through the histogram buckets to find which one contains "Foo". Assuming it finds "Foo" in the 5th bucket, it knows "Foo" is greater than all the values in the first four buckets. If there are 20 buckets (e.g. statistics target = 20), that means "Foo" is greater than 4/20, or 25% of the rows in the table. Multiplying by pg_class.reltuples gives the number of rows the inequality operator is expected to return. For data types where a distance metric exists (like numeric values, where it can use the subtraction operator), PostgreSQL can assume a linear distribution of values in the histogram bucket and get a presumably more accurate estimate. But this can only work if the data type's values are from a metric space, that is, where there's a distance metric.

Trying to extend this idea into multiple dimensions quickly becomes difficult, especially when one remembers that it has to apply to data types without a distance metric. If everything had a distance metric, we could (for instance) divide the space between the maximum and minimum values in each column into N uniform segments (N again represents the column's statistics target), create a P-dimensional matrix (where P is the number of columns involved) with N^P cells, where each cell contains the count of rows whose values fall within the histogram bucket defined by that cell.

This approach is simple, but fails miserably in practice. For starters, as you add columns to the mix, the number of histogram cells goes up exponentially. Also, this falls victim to the "curse of dimensionality"; put simply, although the size of the matrix increases exponentially as you add columns, the amount of data does not increase, so the amount of data in each part of the matrix decreases, until it's so small that you know effectively nothing about most of the matrix, and it becomes useless for estimation.

Further posts will explore more of the questions above, as well as ways to implement cross-column statistics.

Monday, October 20, 2008

Cross-column statistics

Of late, I've been involved in an interesting thread on the pgsql-hackers mailing list regarding cross-column statistics. I can admit no immediate personal use for such a feature, but that hasn't stopped me before (cf. PL/LOLCODE), and it's a very interesting subject. I'm blogging about it because, well, I can. I expect, though, that my kids won't let me finish it all now, and that this will be a multi-part post.

The idea of planner statistics on single columns is well understood and implemented all over the database world. That's what you run ANALYZE for -- to refresh those statistics. Ideally the statistics the database keeps will help it determine the best way to execute queries. For instance, assume you have a table called PERSON which stores FIRST_NAME and LAST_NAME, with an index on both columns, and you tell the database to SELECT * FROM PERSON WHERE FIRST_NAME = 'Alfred' AND LAST_NAME = 'Heiseldorf'. The planner can choose several different ways to execute this plan. It can:
  • Scan the entire table and check each row to find matches to the query

  • Scan the index on FIRST_NAME for 'Alfred', and scan the results for LAST_NAME = 'Heiseldorf'

  • Scan the index on LAST_NAME for 'Heiseldorf' and scan the results for FIRST_NAME = 'Alfred'


The planner has to choose which of these three options would be the fastest, and to do that it has to guess how many rows are involved in each method -- hence the statistics. PostgreSQL keeps track of, among several other things, a list of the most common values in each column, and what percentage of the table each comprises (most_common_vals and most_common_freqs in pg_stats). The length of each list is determined by the statistics target value for each column (or default_statistics_target if a target for that column is not specified). PostgreSQL will look at the statistics for the FIRST_NAME and LAST_NAME columns to see if 'Alfred' or 'Heiseldorf' are in the lists of most_common_values for their respective columns. See here for more specifics, but using these values, along with estimated numbers of rows in the table, the planner can guess that there are, say, 310 'Alfred's in the table and 4 'Heiseldorf's. In that case, the optimizer will probably choose to look for 'Heiseldorf' first using an index scan on the LAST_NAME field, meaning it will have to scan only those four rows looking for 'Alfred's.

There are lots of other details, and to avoid being too long (but mostly to avoid embarrassing myself by getting them wrong) I'll leave them out. Suffice it to say that having good guesses about the number of rows a query will return is important for getting decent query plans.

More on this subject later...

Sunday, October 12, 2008

Calculating pi in LOLCODE

Magnus Hagander asked me for a simple function to return a number using PL/LOLCODE. Here's a pi calculator, based on a post by Grzegorz Jaƛkiewicz on the PostgreSQL GENERAL mailing list showing how to find pi using PostgreSQL's new recursive query feature:

create or replace function lol_pi() returns float immutable language pllolcode as $$
HAI

I HAS A PIADD ITZ 0.0
I HAS A PISUB ITZ 0.0
I HAS A ITR ITZ 0
I HAS A ITRZ ITZ 20000
I HAS A T1
I HAS A T2

IM IN YR LOOP

T1 R QUOSHUNT OF 4.0 AN SUM OF 3.0 AN ITR
T2 R QUOSHUNT OF 4.0 AN SUM OF 5.0 AN ITR
PISUB R SUM OF PISUB AN T1
PIADD R SUM OF PIADD AN T2
ITR R SUM OF ITR AN 4.0
BOTH SAEM ITR AN BIGGR OF ITR AN ITRZ, O RLY?

YA RLY, GTFO

OIC

IM OUTTA YR LOOP
FOUND YR SUM OF 4.0 AN DIFF OF PIADD AN PISUB

KTHXBYE
$$;

select lol_pi();

UPDATE: Fixed indenting to make this look nicer, and credited the mailing list post
UPDATE PART 2: Note that this particular function doesn't use anything special about PL/LOLCODE, and AFAICS should run just fine in any LOLCODE interpreter.
UPDATE PART 3: Note also that there are better facilities for loops in LOLCODE, but PL/LOLCODE doesn't implement them yet, which is why I've got to track the counter explicitly, and explicitly GTFO when it hits ITRZ.
UPDATE PART 3: Some investigation has revealed that this method of calculating pi is called a Gregory-Leibniz series

Wednesday, October 8, 2008

PL/LOLCODE HOWTO


I'm working on a presentation for this weekend's PostgreSQL Conference. It's 90 minutes of me chatting about turning LOLCODE, of all things, into a PostgreSQL procedural language.

Monday, September 15, 2008

Why You Should Consider PostgreSQL



I'll be presenting at the Utah PHP Users Group (UPHPU) this Thursday on PostgreSQL. I'm not much of a PHP person. The last time I used it heavily was shortly after PHP5 was released, and I found between changes I had to make to get PHP4 code to work in PHP5, plus frequent security patches I had to install, plus changes I had to make to get my code to work with the patched PHP, it was all more work than I felt I wanted to endure. So I switched to Java. That was also an inappropriate choice for many reasons, but it was an excellent introduction to the sorts of things the programmers were doing at the job I started around that same time, and since the project for which I was doing all this programming never went anywhere anyway, it's probably all for the best.

Anyway, I met "mindjuju", the president of UPHPU at the recent Utah Open Source Conference, and since I'm president of the Utah Database Users Group we started chatting user groups. We're trying to get UDBUG off the ground, and since most of our membership is already involved in other users groups, we figured one way is to have meetings in conjunction with other groups.

A few days later, I got a message on IRC from mindjuju suggesting that, as I'd expressed my devotion to all things PostgreSQL to him at UTOSC, I might want to do a presentation to introduce people to PostgreSQL and let them know why they might want to use it. Another message a few days later told me that the scheduled presenter at this month's UPHPU meeting might have to cancel and could I step in if that happened. Hence the talk, the first draft of which is above.

Sunday, September 14, 2008

Schematic, etc.


I'm hoping to get this blog caught up with the current state of the project, and as such, have spent the last $way_too_long messing with KiCad to draw a workable schematic of what I plan to do. See the image associated with this post for the somewhat disappointing results. KiCad made a nice postscript schematic, but I had to settle for a ATmega8 in the image instead of the ATmega168 I'm using, and the multiplexers I've got aren't the 74LS154s shown, because of limitations in the KiCad component libraries such as they exist on my laptop. Also, blogger.com wouldn't, so far as I could see, allow me to upload the original postscript, so I've had to convert it to PNG, with a fair bit of lost resolution as a result. Anyway, the idea is that the microcontroller counts, as quickly as possible, from 0 to 255, and puts the value out on PORTB, four bits of which act as the selectors for each of two 1:16 multiplexers. One multiplexer has a constant +5V input, and it electrifies one of 16 possible rows in a matrix of switches, depending on the selector values from the ATmega168. Then the other multiplexer, whose selectors are the lower four bits of PORTB, iterates through all 16 columns of the matrix. The output of that multiplexer goes to a pin on the microcontroller; if a charge manages to show up on that pin, the microcontroller knows that a switch has been pushed. The software on the ATmega168 will keep track of what keys are pushed at any given time, and when something changes, will send a signal to a computer somewhere. Here's hoping it actually works.

The multiplexers are really neat. In this project, they serve simply to give me a bunch more pins to work with. You can put a signal on its "input" pin and send it to one of 16 possible outputs by applying a binary number between 0 and 15 inclusive to the four selector pins. They also work in reverse, so if you're applying signals to the 16 pins, you can choose which signal ought to be fed out of what would then be the "output" pin. That lets me control a 16x16 matrix of 256 "switches" (the organ pedals, keys, and whatever else I want to control) with only eight microcontroller pins to selecting the switch I'm interested in. I feed +5V into one multiplexer constantly, and use one more microcontroller pin to see if that +5V gets through the switch matrix to tell me a key is pressed. Quick thanks to Intersil for sending me two free 1:8 multiplexers (three selector bits, eight possible outputs). If I don't find a place for them in this project, I'll find a place in another project :)

Yesterday I soldered together a 5V power supply with pretty much the schematic shown here. Now my test circuit on a breadboard on the floor next to the baby cradle behaves predictably :)

Tuesday, September 9, 2008

Background

I started this blog principally so I could document projects I'm working on, or thinking of working on, or concluding not to work on, etc. Principle among these is the organ I'm hoping to refurbish. One Christmas, my grandma called me to say that an aunt of mine was getting a new organ for Christmas (similar, if I remember correctly, to this one, giving her old one to grandma, and would I like grandma's old Hammond that's been sitting in the basement for years (similar to this one), and kinda still works?

I said yes, and with much grunting, straining, and herniating, we got the thing up the stairs and into my second floor condo, whereupon I discovered what little it could do. Half the time it doesn't even turn on, and the sounds it makes are pretty rough. However, thanks to a nice, open source, software organ synthesizer called Aeolus, my computer can do a decent organ impression. So I figured I'd build a MIDI interface into my new Hammond, whose console is pretty decent, internals notwithstanding, and via MIDI the console could tell my computer what sounds to make.

So I started learning electronics. It's loads of fun -- I've wanted to learn this stuff for years anyway, and now that I have a real project to apply it to, I find it's not too tough. Yet. :)

Monday, September 8, 2008

Why?

Someone asked me today, "Are you blogging this somewhere?" Having never felt the need to start a blog, much less update one (regularly or otherwise), I was somewhat taken aback. Who, after all, could possibly be interested, and if no one was interested, what difference if I just stick to scribbling occasional poorly composed ramblings into my neglected journal while slumped listlessly in a chair at the kitchen table at 4:00 AM, having once again allowed the nocturnal demands of my two little boys to chase from me all capacity for more sleep?

Whew... take a breath...

The answer: The Greater Good. Or perhaps not. In fact, it has been suggested that some of the activities I've seen fit to engage in (or to make vague plans one day to engage in) might prove useful or interesting or something to various denizens of the InterTubes. Not only that, I realized that my embarrassingly spotty journal writing habits didn't lend themselves well to the strict documentation required by experiments such as I intend to conduct in relation to the above activities, and that perhaps the convenience of blogging would somehow encourage more consistent record-keeping.

If you've made it this far with some measure of comprehension, my congratulations to you.

In short, I'm writing this because someone suggested they might be interested. That being the case, I'll try to ensure that the stuff I post aligns with the expressed interests of my readers, such as they are(n't). Should I fail in those efforts, well, phooey.