Clip Man


Daniel Einspanjer's journal

Data warehousing, ETL, BI, and general hackery

Entries by tag: business intelligence

Been a long difficult week
Clip Man
I have wonderful results to show for it though.

I have gotten the second large data source flowing through our metrics system and have the first report hooked up to it.

It's going to be interesting comparing the performance of the two data sources. Both have similar volume, but this second one is in a much cleaner looking star schema as opposed to the extremely denormalized single table format. Vertica handles both of these formats well, so I'm eager to figure out how close the performance is.

Kettle (a.k.a. Pentaho Data Integration) is a real winner here as far as enabling me to develop and maintain these very complex ETL processes. The ETL for the previous data source working against the single table clips along at over 30,000 records per second. This new ETL is a good bit slower, both because of a difference in the file structure of what I'm parsing, and because I have seven dimensions that I am doing foreign key lookups in. There is lots of room for optimization in this ETL too though.

It is somewhat difficult to optimize the throughput of the transformation for a headless server or when running a clustered transformation in Kettle. Pentaho is supposed to be coming out with some new management tools that will hopefully streamline things there.

One of the interesting things I ran into was the fact that because Kettle runs each step in a separate thread and these steps are passing around rows of data as array objects, certain server class hardware can actually perform much slower than desktop class hardware.
A case in point: a very simple transformation that does nothing more than generate several million records of data and pass them through a few steps can run at more than 700,000 records per second on my MacBook Pro with a 2.5 GHz Intel Core 2 Duo processor. The exact same transformation running on a HP blade with dual quad core 2.5 GHz Intel Xeon processors and 16 GB of EEC memory tops out at about 350,000 records per second. Let me tell you, that was pretty depressing to witness! Of course, the saving grace here is that when there is a lot more work to be done than just passing pages of memory around between cores, the server can do a lot more work, faster. That is another thing that I'm hoping some R&D at Pentaho is going to help solve.

Powered by ScribeFire.

Status update
Clip Man
Released the first alpha of my project. And I've been pretty happy with the results so far. Gotten some good feedback and there is lots more work to be done.

The project is based on Pentaho and uses a Vertica cluster as the DB backend. I've gotten pretty amazing results out of the combination.

I've been spending a lot of time working with two community additions to Pentaho, the Community Build Framework (CBF) and the Community Dashboard Framework (CDF). These two amazing projects are being driven by Pedro Alves, a BI consultant specializing in Pentaho. They have really allowed my project to move along rapidly in the direction I wanted to take it.

The other exciting thing I hope to blog about further in the near future is the choropleth map I managed to implement in Pentaho. It was based on an example from Chris Schmidt. While writing this post, I just discovered that he lives nearby. I think I might have to take him out to lunch as a treat for the help he's given me. :)

I need to look into integrating the Simile Timeplot widget into my Pentaho dashboards. I really need the ability to provide rich annotations for momentary or duration events.

No posts this week. I won't let that be a foreshadow of things to come.
Clip Man
It was quite frustrating that no matter how many hours I poured into this proof of concept project, the deeper I got into the data source, the uglier the data got.  It is funny because the first data file that I chose pseudo-randomly to work with could not have been a worse choice.  It started out looking pretty reasonable and it just got steadily worse as I parsed through more of it.

I made pretty good progress after an all nighter Friday, but I'll need to speed things up quite dramatically to have any chance of reporting success.  I know of a half dozen things I can do to start optimizing and there are probably two dozen that I haven't thought of yet, but obviously, it was more important to me to get the process working completely first.

Making progress
Clip Man
I was worried up through lunch, but I am relieved to say that I've made a lot of progress today.
I was able to get sample data flowing through my reporting system from the database into the analysis engine.

The first part of next week will be spent in ETL land transforming our log data into a star schema and then migrating this working pipeline over to use that data instead of test data.

Whew.  I'm very happy to be able to end my first week here with something demonstrable.  If only there was someone here who I could show it to!  Most everyone who would care is off today. ::grumble::

You are viewing daniele