Log in

No account? Create an account
Clip Man


Daniel Einspanjer's journal

Data warehousing, ETL, BI, and general hackery

Performance improvements at the cost of complexity
Clip Man
I discovered something that I feel is a bit of a bug in the Sun Java implementation. 

If you pass in a string to the method InetAddress.getByName(), it does a bunch of testing to see if it is a domain name or a literal IP address.
If it is an IPv4 address, it will then use String.split() to split the four parts.  String.split() uses regexes to do its work.

That means that if you are querying for hundreds or millions of addresses in a tight loop (as I've been doing), the JVM is spawning and compiling hundreds or millions of regex objects, in addition to a String array and four String objects per call.

So at first, I just worked around it by doing basic substringing instead of splitting.  That gave me about 100x performance improvement. But then I realized I was still generating four string objects for every call..

So I came up with this mapping method and it runs about 1000x faster with a near constant minimal memory footprint.

I pre-calculate a multidimensional array of shorts where each element is indexed by the literal character value - 48 of the digits making up the number 0 - 255.

With that array available, at run time, I can do a simple lookup of the short value and then do the math to get the long representation of the IP address.  I'm still generating a couple of references and a few intermediate int values, but the JIT optimizer can make quick work of that.

Linked is the test program I created to play with the different methods:  InetAddressParse test

Powered by ScribeFire.

Don't listen to bash, it will lie to you!
Clip Man
Remember folks,  if you mv a directory, and there is a bash shell currently in that directory, the bash prompt will not update to reflect the new name until you cd out of the directory and then back in.

I just spend way too long making changes and being frustrated because the changes weren't having any effect.  I was clearing cashes and restarting applications and monitoring log files..  It wasn't until I happened to do a :pwd in vim while editing the file for the umpteenth time that I finally noticed that the file I had been editing was actually in a backup of the folder that I had just made.


Powered by ScribeFire.

The best DHTML date range picker I've ever seen
Clip Man
Filament Group's Date Range Picker

It uses jQuery and a JavaScript date parsing library by the name of Date.js.  This thing is simply amazing.  Some of the reasons I think so:

  • The developer can configure a start and end date limits based on what is valid for the system (e.g. if you only have data going back to 1999, no sense in letting the user chose a date in 3000 BC)
  • The developer can configure a set of predefined ranges such as "Last week", "Month to date", "Year to date".
  • If the developer allows it, the user can use any combination of preconfigured ranges, a single date, an arbitrary range of dates, or they can use the back and forward arrows to roll the current date range forward or back.
  • It is smooth and crisp, able to be easily themed, and seems pretty extensible/tweakable.
It is still a work in progress (they just released it today), but I think it is still usable.  The only downside that I've found so far is that the back and forward arrows in this very first released version can produce some unexpected ranges.  They are currently strictly math based, so if you do something like select the current month and then hit the back arrow thinking it will select the previous month, you'll probably get something slightly different since most adjacent months don't have the same number of days.

I'm also pretty sure it has an off by one error in it that I suspect they'll fix shortly.  If you select Sunday to Saturday of a week and then scroll backward, the next range is actually Monday to Sunday and the next Tuesday to Monday...

Ignore these nitpicks and go check it out right away if your website needs a date picker though.  To get such a fantastic widget in the very first release can only mean that it is going to be the bee's knees after a little public beta testing.

Powered by ScribeFire.

Open Source Hardware
Clip Man
I thought that this article in Slate about Open Source Hardware was a fun read and worth sharing.
There is an interesting similarity in the way that Arduino handles open sourcing of their design but reserves the trademark to preserve brand quality to the Mozilla Firefox trademark.

If you like reading about geeks going against the status quo in their industry and trying to make the world a better place, give the article a read.

Powered by ScribeFire.

Good bye Mountain View
Clip Man
It has been a great two weeks out here in the office.  I've gotten to see a lot of people face to face and had some useful meetings about my projects.  I just kicked off another round of massive data loads to run over the weekend while I'm out of pocket. Hopefully they will run smoothly and deliver me high quality data.

There are some really exciting things coming up this quarter:
  • I'll be working on one of the largest data sets yet, our AMO data.  We have several really cool mechanisms for visualizing individual extension projects hosted on AMO. The developer has control over whether to make the statistics public or not.  As an example, you can take a look at the statistics for Adblock Plus.  I'll be working on ways to be able to integrate data across projects so we can get a better understanding of the extension community that means so very much to Mozilla.
  • I'll hopefully be blogging a little more about the complexities of processing the large amount of data that I have to crunch through.
  • I'll be making several pieces of my Pentaho Data Integration (Kettle for those of you in the know) ETL scripts available in an open source repository.  It will help with the blogging, they might be useful to other people doing similar things, and who knows, maybe some people will even have suggestions for improvements!
  • Later in the quarter, I'll be working on an exciting new project to take some of the aggregated data that Mozilla has, such as the number of downloads of Firefox for given time periods, and making it available publicly for the community to explore and visualize.  At the moment, I'm leaning toward trying to use the Many-Eyes project from IBM AlphaWorks.  If anyone has any better ideas, please let me know.

Powered by ScribeFire.

Tags: ,

Been a long difficult week
Clip Man
I have wonderful results to show for it though.

I have gotten the second large data source flowing through our metrics system and have the first report hooked up to it.

It's going to be interesting comparing the performance of the two data sources. Both have similar volume, but this second one is in a much cleaner looking star schema as opposed to the extremely denormalized single table format. Vertica handles both of these formats well, so I'm eager to figure out how close the performance is.

Kettle (a.k.a. Pentaho Data Integration) is a real winner here as far as enabling me to develop and maintain these very complex ETL processes. The ETL for the previous data source working against the single table clips along at over 30,000 records per second. This new ETL is a good bit slower, both because of a difference in the file structure of what I'm parsing, and because I have seven dimensions that I am doing foreign key lookups in. There is lots of room for optimization in this ETL too though.

It is somewhat difficult to optimize the throughput of the transformation for a headless server or when running a clustered transformation in Kettle. Pentaho is supposed to be coming out with some new management tools that will hopefully streamline things there.

One of the interesting things I ran into was the fact that because Kettle runs each step in a separate thread and these steps are passing around rows of data as array objects, certain server class hardware can actually perform much slower than desktop class hardware.
A case in point: a very simple transformation that does nothing more than generate several million records of data and pass them through a few steps can run at more than 700,000 records per second on my MacBook Pro with a 2.5 GHz Intel Core 2 Duo processor. The exact same transformation running on a HP blade with dual quad core 2.5 GHz Intel Xeon processors and 16 GB of EEC memory tops out at about 350,000 records per second. Let me tell you, that was pretty depressing to witness! Of course, the saving grace here is that when there is a lot more work to be done than just passing pages of memory around between cores, the server can do a lot more work, faster. That is another thing that I'm hoping some R&D at Pentaho is going to help solve.

Powered by ScribeFire.

A post about personal data
Clip Man
Mitchell Baker, the Chairperson of Mozilla Foundation and Mozilla Corporation recently posted a series of blog entries about data:

  • Thinking About Data
  • Framework for discussing “data”
  • Why focus on data?
  • Data Relating to People
  • Data — getting to the point

    This discussion is something I've been looking forward to seeing at Mozilla since I started back in March. In the work that I do, I make every effort to safeguard data and make sure that what I process and store can't turn around and bite me later.

    One thing that I felt could use a different approach of listing out is the different forms of personal data that people are likely to generate or come across in the web world.

    To me, the best way to categorize these types of personal data is with a matrix. I've created one below that has the origin of the data as the X axis and the classification of the data as the Y axis. Inside each cell, I've placed a few examples that I think represent that intersection of data.

    I'd encourage anyone interested in this to comment on other origins, classifications, or examples of personal data. The more we have defined, the easier it will be to make sure that our discussions about data don't leave anything out.

    I've also saved this document on docs.google.com (Personal data types matrix).
    If anyone wishes to collaborate with me on enhancing it, please just let me know in the comments and I'll send you a collaboration invitation.








    Name/Address (partial)

    IP address

    Contact information (comprehensive)


    E-mail address3

    Blog URL

    Credit card information




    Website filters

    Friend invitations

    Friends list

    Friends watched/followed


    Blog posts4

    PGP key

    Contact information (comprehensive)


    Blog posts5


    Friends list



    Personal search terms

    Extrapolated interests

    clickstream in site

    Web history

    People watched/followed

    1Multiple pieces of potential identifying information are usually needed to make definite identification or direct contact

    2Data may be elicited as a requirement for interaction with the data collector (e.g. IP address required to view a web page or shipping information required for a purchase) or it may be optional (e.g. a blog comment form requesting your URL).

    3E-mail address is a definite identification because it immediately allows a person to contact you directly

    4Blog posts talking about who you are or where you live are potentially identifying.

    5Blog posts talking about topics that interest you or things you do are characterizing.

Mozilla 2008 Summit
Clip Man
I'm in Whistler, B.C. Canada attending the Mozilla 2008 Summit. It is a huge crowd of people. Should be lots of fun. More later.
Tags: ,

Finally. It took work to get this apache front-end configured properly
Clip Man
I've spent most of the day trying to get an Apache 2.2 server set up to do both LDAP authentication and AJP proxying to a tomcat back-end.

The trickiest parts were translating changes from Apache 2.0's implementation of auth_ldap.

In 2.0, the following directory directives were needed to do group based authentication:

AuthType Basic
AuthName "Use your ldap username/password"
AuthLDAPBindDN xxx
AuthLDAPBindPassword xxx
AuthLDAPURL ldap://server/o=xxx?xxx
Require group cn=xxx,ou=groups

However, in 2.2, the syntax changed slightly and the following was what it took for me to get it going:

AuthType Basic
AuthName "Use your ldap username/password"
#AuthBasicProvider defaults to file so it is required if you aren't loading mod_authn_file
AuthBasicProvider ldap
AuthLDAPBindDN xxx
AuthLDAPBindPassword xxx
AuthLDAPUrl ldap://server/o=xxx?xxx
#It is now ldap-group instead of just group
Require ldap-group cn=xxx, ou=groups

Before I put the AuthBasicProvider ldap directive in place, I was getting an error in the logs:
configuration error: couldn't check user. No user file?: /

It took me a longer time to figure out the ldap-group vs group problem. In the logs, I was seeing incorrect password attempts being logged properly, but if I typed the right password, there was nothing in the log but the authorization dialog was just redisplayed in the browser.
I imagine that maybe if I tweaked some log settings somewhere I'd find that it was possible to see the Require directive failing.

Also note that most auth_ldap examples show putting the directives in a <Directory> section. Well, if you are using mod_proxy or mod_proxy_ajp, there is no directory so you put the auth directives in a <Location /> section instead.

Status update
Clip Man
Released the first alpha of my project. And I've been pretty happy with the results so far. Gotten some good feedback and there is lots more work to be done.

The project is based on Pentaho and uses a Vertica cluster as the DB backend. I've gotten pretty amazing results out of the combination.

I've been spending a lot of time working with two community additions to Pentaho, the Community Build Framework (CBF) and the Community Dashboard Framework (CDF). These two amazing projects are being driven by Pedro Alves, a BI consultant specializing in Pentaho. They have really allowed my project to move along rapidly in the direction I wanted to take it.

The other exciting thing I hope to blog about further in the near future is the choropleth map I managed to implement in Pentaho. It was based on an example from Chris Schmidt. While writing this post, I just discovered that he lives nearby. I think I might have to take him out to lunch as a treat for the help he's given me. :)

I need to look into integrating the Simile Timeplot widget into my Pentaho dashboards. I really need the ability to provide rich annotations for momentary or duration events.