Clip Man

[info]daniele


Daniel Einspanjer's journal

Data warehousing, ETL, BI, and general hackery


Previous Entry Add to Memories Tell a Friend Next Entry
Been a long difficult week
Clip Man
[info]daniele
I have wonderful results to show for it though.

I have gotten the second large data source flowing through our metrics system and have the first report hooked up to it.

It's going to be interesting comparing the performance of the two data sources. Both have similar volume, but this second one is in a much cleaner looking star schema as opposed to the extremely denormalized single table format. Vertica handles both of these formats well, so I'm eager to figure out how close the performance is.

Kettle (a.k.a. Pentaho Data Integration) is a real winner here as far as enabling me to develop and maintain these very complex ETL processes. The ETL for the previous data source working against the single table clips along at over 30,000 records per second. This new ETL is a good bit slower, both because of a difference in the file structure of what I'm parsing, and because I have seven dimensions that I am doing foreign key lookups in. There is lots of room for optimization in this ETL too though.

It is somewhat difficult to optimize the throughput of the transformation for a headless server or when running a clustered transformation in Kettle. Pentaho is supposed to be coming out with some new management tools that will hopefully streamline things there.

One of the interesting things I ran into was the fact that because Kettle runs each step in a separate thread and these steps are passing around rows of data as array objects, certain server class hardware can actually perform much slower than desktop class hardware.
A case in point: a very simple transformation that does nothing more than generate several million records of data and pass them through a few steps can run at more than 700,000 records per second on my MacBook Pro with a 2.5 GHz Intel Core 2 Duo processor. The exact same transformation running on a HP blade with dual quad core 2.5 GHz Intel Xeon processors and 16 GB of EEC memory tops out at about 350,000 records per second. Let me tell you, that was pretty depressing to witness! Of course, the saving grace here is that when there is a lot more work to be done than just passing pages of memory around between cores, the server can do a lot more work, faster. That is another thing that I'm hoping some R&D at Pentaho is going to help solve.

Powered by ScribeFire.


(Leave a comment)

Kettle vs Jitterbit?

(Anonymous)

2009-03-25 10:17 pm (UTC)

Interesting stuff on Kettle. I've been meaning to getting around to trying it. I've been playing with Jitterbit, and it seems to do ETL pretty well, but also has as an EAI slant. Any idea how this compares to Jitterbit's data integration (http://www.jitterbit.com) solution?

Re: Kettle vs Jitterbit?

[info]daniele

2009-03-26 03:08 am (UTC)

Kettle's name was originally coined because it was designed foremost as an ETL tool. When Pentaho acquired it, they renamed it Pentaho Data Integration, but I don't think it has that much in common with EAI data integration tools like Jitterbit. Jitterbit seems like a pretty decent integration tool from what I've read on the website. I wasn't able to see any demonstrations without signing up. I did see that they allow downloads without sign ups via SourceForge, but I have not downloaded it to take it for a spin yet.

What I saw on the website seems to indicate that it has strong features in regards to web service integration and XML handling. Kettle has a lot of strong features on the ETL side to do things like streaming DB lookups, data warehouse population, sorting, filtering, (de)normalization, etc.

Thanks for the link.

(Leave a comment)

Home