Clip Man

daniele


Daniel Einspanjer's journal

Data warehousing, ETL, BI, and general hackery


Entries by tag: vertica

Counting unique visitors in SQL
Clip Man
daniele
A lot of web metrics solutions out there like NetTracker or Omniture allow you to perform analysis on the number of unique visitors over time. This is a pretty important metric to a lot of companies, and I recently needed to perform such an analysis, but it was on data stored in a SQL database rather than in one of these proprietary solution's data-stores.

Doing any sort of distinct counting on a large volume of data in SQL can be very costly, both in terms of storage of the raw data (since you can't aggregate it), and in query performance since there are relatively few optimizations that can be performed on the table or the query.

Below are some highlights of how I implemented this.Collapse )
Tags: , , ,

Been a long difficult week
Clip Man
daniele
I have wonderful results to show for it though.

I have gotten the second large data source flowing through our metrics system and have the first report hooked up to it.

It's going to be interesting comparing the performance of the two data sources. Both have similar volume, but this second one is in a much cleaner looking star schema as opposed to the extremely denormalized single table format. Vertica handles both of these formats well, so I'm eager to figure out how close the performance is.

Kettle (a.k.a. Pentaho Data Integration) is a real winner here as far as enabling me to develop and maintain these very complex ETL processes. The ETL for the previous data source working against the single table clips along at over 30,000 records per second. This new ETL is a good bit slower, both because of a difference in the file structure of what I'm parsing, and because I have seven dimensions that I am doing foreign key lookups in. There is lots of room for optimization in this ETL too though.

It is somewhat difficult to optimize the throughput of the transformation for a headless server or when running a clustered transformation in Kettle. Pentaho is supposed to be coming out with some new management tools that will hopefully streamline things there.

One of the interesting things I ran into was the fact that because Kettle runs each step in a separate thread and these steps are passing around rows of data as array objects, certain server class hardware can actually perform much slower than desktop class hardware.
A case in point: a very simple transformation that does nothing more than generate several million records of data and pass them through a few steps can run at more than 700,000 records per second on my MacBook Pro with a 2.5 GHz Intel Core 2 Duo processor. The exact same transformation running on a HP blade with dual quad core 2.5 GHz Intel Xeon processors and 16 GB of EEC memory tops out at about 350,000 records per second. Let me tell you, that was pretty depressing to witness! Of course, the saving grace here is that when there is a lot more work to be done than just passing pages of memory around between cores, the server can do a lot more work, faster. That is another thing that I'm hoping some R&D at Pentaho is going to help solve.

Powered by ScribeFire.


Status update
Clip Man
daniele
Released the first alpha of my project. And I've been pretty happy with the results so far. Gotten some good feedback and there is lots more work to be done.

The project is based on Pentaho and uses a Vertica cluster as the DB backend. I've gotten pretty amazing results out of the combination.

I've been spending a lot of time working with two community additions to Pentaho, the Community Build Framework (CBF) and the Community Dashboard Framework (CDF). These two amazing projects are being driven by Pedro Alves, a BI consultant specializing in Pentaho. They have really allowed my project to move along rapidly in the direction I wanted to take it.

The other exciting thing I hope to blog about further in the near future is the choropleth map I managed to implement in Pentaho. It was based on an example from Chris Schmidt. While writing this post, I just discovered that he lives nearby. I think I might have to take him out to lunch as a treat for the help he's given me. :)

I need to look into integrating the Simile Timeplot widget into my Pentaho dashboards. I really need the ability to provide rich annotations for momentary or duration events.


Nyah, I didn't really want to release today.
Clip Man
daniele
So I was doing final cleanup and preparation for releasing the first version of my project to a wider audience at work and one of my machines developed a bad case of fingers in ears. By that I mean it stopped listening to its cluster brothers and then it stopped listening to me. The load average kept going higher and higher and while I was able to log in to the machine, frequently, commands I'd issue would just hang and no amount of breaking or kill -9 would help them.

When I gave it up for lost and tried the reboot command, I discovered to my chagrin that it didn't work either.

So, until one of the IT people (who are all swamped with FF3 release stuff) can get around to logging in to the console and killing it, I'm going to have to find other things to occupy myself with. :/

I did come across this interesting comment on linuxquestions.org talking about what can cause a process to disregard kill -9:

What to do when kill -9 doesn't work

Okay, so I fibbed a bit.
Clip Man
daniele
It was a foreshadow of further inactivity to come.

Work as Mozilla has been incredibly engaging and fun but also very stressful.  The first project I picked up to work on turned out to be much more complicated in several different ways than I had anticipated.  That plus the general ramp of coming up to speed with a new company and with new technology has left me far behind where I thought I would be. :/

I am beginning to see the light at the end of the tunnel though.  Things are starting to fall into place quicker and I suspect very soon we will have something very exciting that we can begin to show to people.

No posts this week. I won't let that be a foreshadow of things to come.
Clip Man
daniele
It was quite frustrating that no matter how many hours I poured into this proof of concept project, the deeper I got into the data source, the uglier the data got.  It is funny because the first data file that I chose pseudo-randomly to work with could not have been a worse choice.  It started out looking pretty reasonable and it just got steadily worse as I parsed through more of it.

I made pretty good progress after an all nighter Friday, but I'll need to speed things up quite dramatically to have any chance of reporting success.  I know of a half dozen things I can do to start optimizing and there are probably two dozen that I haven't thought of yet, but obviously, it was more important to me to get the process working completely first.

Making progress
Clip Man
daniele
I was worried up through lunch, but I am relieved to say that I've made a lot of progress today.
I was able to get sample data flowing through my reporting system from the database into the analysis engine.

The first part of next week will be spent in ETL land transforming our log data into a star schema and then migrating this working pipeline over to use that data instead of test data.

Whew.  I'm very happy to be able to end my first week here with something demonstrable.  If only there was someone here who I could show it to!  Most everyone who would care is off today. ::grumble::

You are viewing daniele