Clip Man

[info]daniele


Daniel Einspanjer's journal

Data warehousing, ETL, BI, and general hackery


Performance of Rhino JS engine and Janino library in Kettle
Clip Man
[info]daniele
My friend Roland Bouman made an interesting blog post regarding the performance of a bit of JavaScript for Kettle that he saw on a different blog.

Given the large amounts of data that I am shoving through Kettle every day, I tend to be extremely concerned about performance. Even a small inefficiency can lead to dramatic slowdowns. Hence, when I saw his post, I got to thinking about how I would approach the problem if it were within the realm of the large data sets I work with and hence required extreme optimization.

I didn't have a lot of spare time to dedicate to this experiment, so I opted for a screen-cast instead of a nicely formatted blog post. That said, I think there is a certain benefit in being able to see the work flow of someone who is very comfortable with Kettle.

The screen-cast is currently in Apple QuickTime format. Bleh. I need to get a new Ogg Theora transcoder because the one that I tried to use last time is not happy with me and I didn't have time to fiddle with it.

So, if you use Kettle and are interested in these things, here is the screen-cast. Be warned it is 30 minutes long and probably not extremely exciting to anyone outside of the ETL field.

Kettle string transformation optimization walk-through

If you are familiar with developing plug-ins for Kettle and you'd like to take a look at the User Defined Java Class plug-in I demonstrated at the end of the screen-cast, you can pick it up from the Pentaho SVN plugins repository. Just wear gloves because it has rough edges.
User Defined Java Class plug-in
Tags:

Advice regarding using Travelex Cash Passport cards for travel
Clip Man
[info]daniele
If you are a traveling to Europe and considering getting one of these debit cards to make life easier for you while there, my advice is don't! Figure out which of your credit cards charges the least amount of fees for international usage and just use it.

Further ranting and whining below the cut... )
Tags:

Ubuntu screen-profiles customization
Clip Man
[info]daniele
I recently loaded Ubuntu server 9.04 onto a new machine and encountered Ubuntu's screen-profiles.
In general, I like it. I had one problem and one customization that I wanted to share:

I use Mac OS X's Terminal.app to connect to my remote machines, and by default, it has custom mappings for F1 through F4. I have no idea what those keybindings mean, but they prevent screen-profiles's keybindings from working. It took a little fiddling to figure out how to fix them. Basically, you need to:
  1. Open up the preferences dialog for Terminal.app
  2. Go to the Settings pane
  3. Click on the Keyboard tab button
  4. Edit the action for each of the F1 through F4 keys
  5. When editing, click the "delete one character" button twice to erase the characters currently in there (leave the \033 escape)
  6. Type the following characters: [ 1 1 ~ 11 is F1, 12 is F2, 13 is F3, 14 is F4
  7. The new entries should look just like the F5 through F8 actions.
Once I was able to use the F2 F3 and F4 keys, I decided that they weren't that useful to me. I prefer to use a combination of screen regions and windows. The window commands are very easy for me, but I've always found the split, focus, and remove keybindings to be uncomfortable so I figured those would be great commands to map to F2 F3 and F4. Here is how I did that:
  1. sudo cp /usr/share/screen-profiles/keybindings/common /usr/share/screen-profiles/keybindings/regions
  2. sudo vi /usr/share/screen-profiles/keybindings/regions
  3. replace the first four entries with the new entries below
  4. save and close the file
  5. In screen, hit F9 to bring up the menu
  6. Select the option for "Change keybinding set
  7. Select the new "regions" entry
  8. Hit F5 to reload your screen-profile and pick up the new keybindings.

register n "^aS^a^i^a^c^aA" # | Goes with the F2 definition
bindkey -k k2 process n # F2 | Create new region and window (and name it)
bindkey -k k3 focus # F3 | Next region
bindkey -k k4 remove # F4 | Remove region
Tags: ,

Shell script analytics
Clip Man
[info]daniele
I just made a rather lengthy post on the Mozilla blog of data about shell script analytics.  I'll try hard not to cross post stuff like this too often, but I thought I'd allow myself the spam this time around because using Bash and AWK to do things like this really is an important part of who I am personally as a geek in addition to what I do for Mozilla. :)
Tags: , , ,

I've always thought my job was fun. Now I hear it is sexy too!
Clip Man
[info]daniele
I just finished reading this lovely little post from the company dataspora titled The Three Sexy Skills of Data Geeks.

By far, my favorite quote was, "A good data munger excels at turning coffee into regular expressions and parsers".  That certainly describes me to a tee. :)

I've always found each of these the three facets of working with data fascinating.  One of the comments mentioned that decision making was an important missing trait.  I could go either way there.  I feel it is good to be able to tell a compelling story with the data that helps others to understand it, and then those people take the understanding you imparted to them and make decisions based on it.

It is incredibly hard to find a person who is skilled in just one or two of these facets.  When you find the data geek who has all three, then you count yourself lucky.  Expecting someone who has that caliber of devotion to data to also be capable of making decisions like a CEO is a bit unrealistic in my opinion.

Anyway, the article is a good, quick read.  It also quite nicely summarizes the major passions in my professional life right now.


Interesting crowd-sourced solution sites
Contemplation
[info]daniele
A good friend of mine runs the site bug.gd (and it's more professional pseudonym, errorhelp.com).  This service provides something that is slightly missing from the typical Google search for an error to find a solution.  It allows you to enter the full text of the error message or stack trace instead of just a couple of keywords, and it provides rich community feedback on solutions.  You can even tip people for their solutions through tipjoy.com integration.

I recently came across two other nice sites created by a different company that provide a similar and complimentary service:
stackoverflow.com - A site dedicated to crowd-sourcing answers to programming questions
serverfault.com - A site dedicated to crowd-sourcing answers to system administration questions

I think it is very helpful to have a list of these sites that you can go to post a question and hopefully get an answer that will even be moderated by the community to help you determine the value of the answer.  This is something that typically takes a lot longer if you search for a forum or mailing list site and post there.  While it is less immediate than IRC, the moderation and ability to leave a question and get an answer "soon" are nice features you are less likely to see in IRC (although I've always gotten great results from #java, #sql, #mysql, and #bash).

As you can tell, I'm a big fan of crowd-sourcing.  I have run a couple of contests on 99designs.com and have been incredibly pleased with the results that came out of that community of freelance graphic designers.

Check these places out and see if they can help you or if you can help them!


Ways to visualize and share data
Clip Man
[info]daniele
Mozilla needs to be able to provide useful extracts of data such as download trends, etc. and allow the community to perform their own analysis on them, so I'm always keeping a lookout for useful tools to further that goal.

When Tony Wright posted the blog entry Just How Important is the Valley? Let’s Look at some Data on April 17th 2009, he was kind enough to publish the data set (it needs an attribution / license though) and the data looked interesting so I thought I'd spend a little time playing with it using some tools that I've been keeping my eye on.

First, I slurped the table into DabbleDB, a website that is very well suited to messing with this type of data (i.e. sourced from the web, might need a bit of cleanup, etc.). You can view and edit the data I imported to DabbleDB here: Acquired Startups Data

DabbleDB does a great job at allowing a user to sort, filter, group, and modify data using a simple interface, but it does not have a large array of visualizations. For that, we head over here to the IBM AlphaWorks lab's project, Many Eyes Wikified.  I created a quick wiki dashboard for throwing together a few visualizations: Acquired Startups Visualizations

This was just a quick break from real work I've been doing, so I spent less than an hour on this.  I only took about 20 minutes with DabbleDB: importing the data, cleaning the dollar values, then creating two new views that group the data by country or by state for visualization.  Then I moved over to Many Eyes and played with a few visualizations to try to find some interesting views of the data and threw them into the dashboard and two sub pages.

Being able to quickly extract, transform, and visualize this data is the big win for DabbleDB and Many Eyes in my opinion.  With both applications having open licensing of the data and collaboration as a key focus, they are tools that I hope to be able to take advantage of at Mozilla soon.






Counting unique visitors in SQL
Clip Man
[info]daniele
A lot of web metrics solutions out there like NetTracker or Omniture allow you to perform analysis on the number of unique visitors over time. This is a pretty important metric to a lot of companies, and I recently needed to perform such an analysis, but it was on data stored in a SQL database rather than in one of these proprietary solution's data-stores.

Doing any sort of distinct counting on a large volume of data in SQL can be very costly, both in terms of storage of the raw data (since you can't aggregate it), and in query performance since there are relatively few optimizations that can be performed on the table or the query.

Below are some highlights of how I implemented this. )
Tags: , , ,

TinyArro.ws URLs
Clip Man
[info]daniele
A friend just released an URL shrinking service that I enjoy:  tinyarro.ws (more nifty when written as ➡.ws).
It has a few great features over the current main stream shrinkers:

1. Cool/fun URLs (e.g. http://➽.ws/囨 for my website)
2. Very short URLs due to Unicode suffixes (great for Twitter!)
3. Preview by default! (no tweak to the URL to remember)
4. Option to enter your own custom suffix (TinyURL now has this, but it was too useful to not mention).
5. A Ubiquity command ›.ws/☺ (eventually to be integrated directly on the site)

Some news about the site:
TinyArro.ws: 10 new unicode domains. Defaulting previews to ON.
Ask HN: Thoughts on TinyArro.ws? Tiniest urls in the world (or your money back)
Tags:

Willingness to be a little evil
Contemplation
[info]daniele
I have been a supporter of Firefox and Mozilla for several years now, and while I don't write patches and fix bugs, a major part of that support is educating people about Mozilla, open source, and user empowerment whenever a conversation about technology allows for it.

I've found that people who use proprietary software and operating systems often fall into two broad categories for rationalizing that choice:
1. They are told to do so by some authority (usually their employeer, sometimes their social tech support person, and in some cases, just because they were told it was the right thing to do by an ad or magazine article).
2. They started using it for some reason (typically reason #1 above) a long time ago and are now just accustomed to it.

I'm sure all this is going to be old news to most people reading this, but I bring it up because of an interesting article I read today.

In the 1960's and early 70's, psychologist Stanley Milgram performed a series of famous experiments that tested the willingness of people to do something they would normally object to on moral grounds when they are in a strictly controlled environment and instructed to do so by an authority figure.

More recently, psychologist Jerry Burger had the opportunity to perform a series of similar experiments.  This alternet article describes the story and discusses the findings.  As I read the results and Dr. Burger's statements regarding the findings, I started thinking about how easy it is for the people to choose to give up their freedom to a piece of proprietary software for reasons similar to the ones described in these experiments.

In a green field, these people would normally opt for software that provided them with more freedom and in many cases, subjectively better security, but because they are instructed by an authority figure, or because they got started with it a long time ago and just slid deeper and deeper in, those preferences are not enough by themselves to prompt the person to change their behavior.

Now even this thought in and of itself would not be enough to prompt me to blog about this topic.  We're still well in the territory where the people who haven't gotten lost in a Wikipedia article about toothbrush hygiene they found when they clicked my first link are saying, "um, DUH!"  So here is my point:

At the end of the article, Dr. Burger focuses on an interesting finding of both experiments.  When a person is instructed to do something "wrong", they are significantly less likely to do so if they are surrounded by peers who object first.

So when you talk to someone who is sighing about how much they hate product X but they don't have a choice, don't hate on them and don't deride them for not having a backbone, but just tell them and show them how you chose to stand up for your freedom and your security.  An example can go a long way toward giving them the courage to listen to that little voice inside saying, "I want something better!"


Bash functions for going up to a directory
Clip Man
[info]daniele
Sometimes, if I'm in a really deep directory, I don't want to cd from / nor do I want to cd ../../../..
I just want to either go up 5 directories, or maybe I want to go up to the parent directory "src" when I'm in /home/dre/src/projects/foo/bar/classes/org/apache/blah

This set of Bash functions lets me do that.
The first, up() will change your directory. The second will instead just print the desired directory name.  This makes it easy for you to mv a file up higher or something.

If you pass no arguments, it just goes up one directory.
If you pass a numeric argument it will go up that number of directories.
If you pass a string argument, it will look for a parent directory with that name and go up to it.
(Note, there is a small display bug there. If you give it an invalid name, cd reports the "No such file or directory" error, which is good, but it has a bogus path. Since you can't know what path they were actually trying to go to, it should just say "No such parent directory: ${yourbogusname}". I don't have time to figure that out right now though.)

Just put these functions in your ~/.bashrc file and don't forget to source it. (  source ~/.bashrc )

function up()
{
    dir=""
    if [ -z "$1" ]; then
        dir=..
    elif [[ $1 =~ ^[0-9]+$ ]]; then
        x=0
        while [ $x -lt ${1:-1} ]; do
            dir=${dir}../
            x=$(($x+1))
        done
    else
        dir=${PWD%/$1/*}/$1
    fi
    cd "$dir";
}

function upstr()
{
    echo "$(up "$1" && pwd)";
}

All hail Ken Kovash!
Clip Man
[info]daniele
It may be showing my ignorance, but I was unaware until recently of the officially recognized day for celebrating the man, the myth, and the math that is Ken Kovash.  To think that all the time leading up to this point, I had just been satisfied with the joyous feeling in my heart every day I interacted with him.

Ken can be a harsh task-master some times.
"Daniel, where are my numbers from yesterday?"
"Daniel, why are the funnelcake trends low here and high there? You're data are wrong, go find it and fix it!"
But the pain is worth it when I see him take my crude raw data and masterfully sculpt it into bounteous bevies of tables, raging rivers of trend lines, triumphant towers of bar charts, overwhelming ontologies of pie graphs, and gilt-edged grids of treemaps

One must weep to behold it.
 

Powered by ScribeFire.


Performance improvements at the cost of complexity
Clip Man
[info]daniele
I discovered something that I feel is a bit of a bug in the Sun Java implementation. 

If you pass in a string to the method InetAddress.getByName(), it does a bunch of testing to see if it is a domain name or a literal IP address.
If it is an IPv4 address, it will then use String.split() to split the four parts.  String.split() uses regexes to do its work.

That means that if you are querying for hundreds or millions of addresses in a tight loop (as I've been doing), the JVM is spawning and compiling hundreds or millions of regex objects, in addition to a String array and four String objects per call.

So at first, I just worked around it by doing basic substringing instead of splitting.  That gave me about 100x performance improvement. But then I realized I was still generating four string objects for every call..

So I came up with this mapping method and it runs about 1000x faster with a near constant minimal memory footprint.

I pre-calculate a multidimensional array of shorts where each element is indexed by the literal character value - 48 of the digits making up the number 0 - 255.

With that array available, at run time, I can do a simple lookup of the short value and then do the math to get the long representation of the IP address.  I'm still generating a couple of references and a few intermediate int values, but the JIT optimizer can make quick work of that.

Linked is the test program I created to play with the different methods:  InetAddressParse test


Powered by ScribeFire.


Don't listen to bash, it will lie to you!
Clip Man
[info]daniele
Remember folks,  if you mv a directory, and there is a bash shell currently in that directory, the bash prompt will not update to reflect the new name until you cd out of the directory and then back in.

I just spend way too long making changes and being frustrated because the changes weren't having any effect.  I was clearing cashes and restarting applications and monitoring log files..  It wasn't until I happened to do a :pwd in vim while editing the file for the umpteenth time that I finally noticed that the file I had been editing was actually in a backup of the folder that I had just made.

Sheesh.


Powered by ScribeFire.


The best DHTML date range picker I've ever seen
Clip Man
[info]daniele
Filament Group's Date Range Picker


It uses jQuery and a JavaScript date parsing library by the name of Date.js.  This thing is simply amazing.  Some of the reasons I think so:

  • The developer can configure a start and end date limits based on what is valid for the system (e.g. if you only have data going back to 1999, no sense in letting the user chose a date in 3000 BC)
  • The developer can configure a set of predefined ranges such as "Last week", "Month to date", "Year to date".
  • If the developer allows it, the user can use any combination of preconfigured ranges, a single date, an arbitrary range of dates, or they can use the back and forward arrows to roll the current date range forward or back.
  • It is smooth and crisp, able to be easily themed, and seems pretty extensible/tweakable.
It is still a work in progress (they just released it today), but I think it is still usable.  The only downside that I've found so far is that the back and forward arrows in this very first released version can produce some unexpected ranges.  They are currently strictly math based, so if you do something like select the current month and then hit the back arrow thinking it will select the previous month, you'll probably get something slightly different since most adjacent months don't have the same number of days.

I'm also pretty sure it has an off by one error in it that I suspect they'll fix shortly.  If you select Sunday to Saturday of a week and then scroll backward, the next range is actually Monday to Sunday and the next Tuesday to Monday...


Ignore these nitpicks and go check it out right away if your website needs a date picker though.  To get such a fantastic widget in the very first release can only mean that it is going to be the bee's knees after a little public beta testing.


Powered by ScribeFire.


Open Source Hardware
Clip Man
[info]daniele
I thought that this article in Slate about Open Source Hardware was a fun read and worth sharing.
There is an interesting similarity in the way that Arduino handles open sourcing of their design but reserves the trademark to preserve brand quality to the Mozilla Firefox trademark.

If you like reading about geeks going against the status quo in their industry and trying to make the world a better place, give the article a read.


Powered by ScribeFire.


Good bye Mountain View
Clip Man
[info]daniele
It has been a great two weeks out here in the office.  I've gotten to see a lot of people face to face and had some useful meetings about my projects.  I just kicked off another round of massive data loads to run over the weekend while I'm out of pocket. Hopefully they will run smoothly and deliver me high quality data.

There are some really exciting things coming up this quarter:
  • I'll be working on one of the largest data sets yet, our AMO data.  We have several really cool mechanisms for visualizing individual extension projects hosted on AMO. The developer has control over whether to make the statistics public or not.  As an example, you can take a look at the statistics for Adblock Plus.  I'll be working on ways to be able to integrate data across projects so we can get a better understanding of the extension community that means so very much to Mozilla.
  • I'll hopefully be blogging a little more about the complexities of processing the large amount of data that I have to crunch through.
  • I'll be making several pieces of my Pentaho Data Integration (Kettle for those of you in the know) ETL scripts available in an open source repository.  It will help with the blogging, they might be useful to other people doing similar things, and who knows, maybe some people will even have suggestions for improvements!
  • Later in the quarter, I'll be working on an exciting new project to take some of the aggregated data that Mozilla has, such as the number of downloads of Firefox for given time periods, and making it available publicly for the community to explore and visualize.  At the moment, I'm leaning toward trying to use the Many-Eyes project from IBM AlphaWorks.  If anyone has any better ideas, please let me know.


Powered by ScribeFire.

Tags: ,

Been a long difficult week
Clip Man
[info]daniele
I have wonderful results to show for it though.

I have gotten the second large data source flowing through our metrics system and have the first report hooked up to it.

It's going to be interesting comparing the performance of the two data sources. Both have similar volume, but this second one is in a much cleaner looking star schema as opposed to the extremely denormalized single table format. Vertica handles both of these formats well, so I'm eager to figure out how close the performance is.

Kettle (a.k.a. Pentaho Data Integration) is a real winner here as far as enabling me to develop and maintain these very complex ETL processes. The ETL for the previous data source working against the single table clips along at over 30,000 records per second. This new ETL is a good bit slower, both because of a difference in the file structure of what I'm parsing, and because I have seven dimensions that I am doing foreign key lookups in. There is lots of room for optimization in this ETL too though.

It is somewhat difficult to optimize the throughput of the transformation for a headless server or when running a clustered transformation in Kettle. Pentaho is supposed to be coming out with some new management tools that will hopefully streamline things there.

One of the interesting things I ran into was the fact that because Kettle runs each step in a separate thread and these steps are passing around rows of data as array objects, certain server class hardware can actually perform much slower than desktop class hardware.
A case in point: a very simple transformation that does nothing more than generate several million records of data and pass them through a few steps can run at more than 700,000 records per second on my MacBook Pro with a 2.5 GHz Intel Core 2 Duo processor. The exact same transformation running on a HP blade with dual quad core 2.5 GHz Intel Xeon processors and 16 GB of EEC memory tops out at about 350,000 records per second. Let me tell you, that was pretty depressing to witness! Of course, the saving grace here is that when there is a lot more work to be done than just passing pages of memory around between cores, the server can do a lot more work, faster. That is another thing that I'm hoping some R&D at Pentaho is going to help solve.

Powered by ScribeFire.


A post about personal data
Clip Man
[info]daniele
Mitchell Baker, the Chairperson of Mozilla Foundation and Mozilla Corporation recently posted a series of blog entries about data:

  • Thinking About Data
  • Framework for discussing “data”
  • Why focus on data?
  • Data Relating to People
  • Data — getting to the point

    This discussion is something I've been looking forward to seeing at Mozilla since I started back in March. In the work that I do, I make every effort to safeguard data and make sure that what I process and store can't turn around and bite me later.

    One thing that I felt could use a different approach of listing out is the different forms of personal data that people are likely to generate or come across in the web world.

    To me, the best way to categorize these types of personal data is with a matrix. I've created one below that has the origin of the data as the X axis and the classification of the data as the Y axis. Inside each cell, I've placed a few examples that I think represent that intersection of data.

    I'd encourage anyone interested in this to comment on other origins, classifications, or examples of personal data. The more we have defined, the easier it will be to make sure that our discussions about data don't leave anything out.

    I've also saved this document on docs.google.com (Personal data types matrix).
    If anyone wishes to collaborate with me on enhancing it, please just let me know in the comments and I'll send you a collaboration invitation.


    Identifying

    Characterizing


    Potential1

    Definite

    Self

    Relationships

    Elicited2

    Name/Address (partial)

    IP address

    Contact information (comprehensive)

    SSN

    E-mail address3

    Blog URL

    Credit card information

    Demographics

    Location

    Interests

    Website filters

    Friend invitations

    Friends list

    Friends watched/followed

    Published

    Blog posts4


    PGP key

    Contact information (comprehensive)

    Interests

    Blog posts5

    Wishlists

    Friends list

    Harvested

    cookies

    Personal search terms


    Extrapolated interests

    clickstream in site

    Web history

    People watched/followed


    1Multiple pieces of potential identifying information are usually needed to make definite identification or direct contact

    2Data may be elicited as a requirement for interaction with the data collector (e.g. IP address required to view a web page or shipping information required for a purchase) or it may be optional (e.g. a blog comment form requesting your URL).

    3E-mail address is a definite identification because it immediately allows a person to contact you directly

    4Blog posts talking about who you are or where you live are potentially identifying.

    5Blog posts talking about topics that interest you or things you do are characterizing.


Mozilla 2008 Summit
Clip Man
[info]daniele
I'm in Whistler, B.C. Canada attending the Mozilla 2008 Summit. It is a huge crowd of people. Should be lots of fun. More later.
Tags: ,

Home