I have gotten the second large data source flowing through our metrics system and have the first report hooked up to it.
It's going to be interesting comparing the performance of the two data sources. Both have similar volume, but this second one is in a much cleaner looking star schema as opposed to the extremely denormalized single table format. Vertica handles both of these formats well, so I'm eager to figure out how close the performance is.
Kettle (a.k.a. Pentaho Data Integration) is a real winner here as far as enabling me to develop and maintain these very complex ETL processes. The ETL for the previous data source working against the single table clips along at over 30,000 records per second. This new ETL is a good bit slower, both because of a difference in the file structure of what I'm parsing, and because I have seven dimensions that I am doing foreign key lookups in. There is lots of room for optimization in this ETL too though.
It is somewhat difficult to optimize the throughput of the transformation for a headless server or when running a clustered transformation in Kettle. Pentaho is supposed to be coming out with some new management tools that will hopefully streamline things there.
One of the interesting things I ran into was the fact that because Kettle runs each step in a separate thread and these steps are passing around rows of data as array objects, certain server class hardware can actually perform much slower than desktop class hardware.
A case in point: a very simple transformation that does nothing more than generate several million records of data and pass them through a few steps can run at more than 700,000 records per second on my MacBook Pro with a 2.5 GHz Intel Core 2 Duo processor. The exact same transformation running on a HP blade with dual quad core 2.5 GHz Intel Xeon processors and 16 GB of EEC memory tops out at about 350,000 records per second. Let me tell you, that was pretty depressing to witness! Of course, the saving grace here is that when there is a lot more work to be done than just passing pages of memory around between cores, the server can do a lot more work, faster. That is another thing that I'm hoping some R&D at Pentaho is going to help solve.
Powered by ScribeFire.