Clip Man

daniele


Daniel Einspanjer's journal

Data warehousing, ETL, BI, and general hackery


Previous Entry Share Next Entry
Performance improvements at the cost of complexity
Clip Man
daniele
I discovered something that I feel is a bit of a bug in the Sun Java implementation. 

If you pass in a string to the method InetAddress.getByName(), it does a bunch of testing to see if it is a domain name or a literal IP address.
If it is an IPv4 address, it will then use String.split() to split the four parts.  String.split() uses regexes to do its work.

That means that if you are querying for hundreds or millions of addresses in a tight loop (as I've been doing), the JVM is spawning and compiling hundreds or millions of regex objects, in addition to a String array and four String objects per call.

So at first, I just worked around it by doing basic substringing instead of splitting.  That gave me about 100x performance improvement. But then I realized I was still generating four string objects for every call..

So I came up with this mapping method and it runs about 1000x faster with a near constant minimal memory footprint.

I pre-calculate a multidimensional array of shorts where each element is indexed by the literal character value - 48 of the digits making up the number 0 - 255.

With that array available, at run time, I can do a simple lookup of the short value and then do the math to get the long representation of the IP address.  I'm still generating a couple of references and a few intermediate int values, but the JIT optimizer can make quick work of that.

Linked is the test program I created to play with the different methods:  InetAddressParse test


Powered by ScribeFire.


  • 1
You're using NetBeans! How do you like it?

This was actually my first ever foray into NetBeans. I tried it out because I wanted to be able to easily run both CPU performance and Memory performance profiling. Eclipse's TPTP just wasn't cutting it either directly on my Mac or for remote profiling of my Linux server via my Mac.

Doing this testing in NetBeans was a breeze! Unfortunately, it was a big pain in the ass to try to get the actual source code project imported so I gave that up and just used NetBeans for the testing. Still using Eclipse for the actual Java project coding.

Merge!

(Anonymous)
Please raise a bug with Sun and include a patch and testing information!

Out of interest, does your implementation handle completely numeric IPs (without digits), and does Sun's? Also how does it deal with invalid entries (e.g. Unicode text)?

I have that on my todo list to file a bug.

My method is extremely optimized for my particular case of valid dotted quad IPv4 strings, so it doesn't handle either of those cases.

Please note however, that I wouldn't consider my method to be the solution to what I described as a "Sun bug". The "bug" that I feel is there is that InetAddress uses String.split() which is very inefficient in a tight loop since it compiles a regex on the fly. Good Java developers (of which I'm sure Sun is packed) know that if you are going to potentially use a regex more than once, you compile it once and save a reference to the compiled pattern. Whomever implemented InetAddress.getByName() just slipped up there.

They might go so far as to implement my method2() test of using substring instead of String.split since it saves the storage of the regex and the creation of a String[], but I'm not sure it would be as big of a win for the general case because they still have to do several more tests to see if the address is valid or if it is one of the other formats they support.

Premature optimisation

(Anonymous)
Is it actually worthwhile doing this testing, when the OS is going to do it for you anyway, particularly as it has to handle all the variants such as 2130706433 (yes, that's a valid alias for localhost)?

Re: Premature optimisation

I'm not sure what part of the conversations exactly you are referring to as potentially unnecessary, but I can describe a bit more of what I"m doing and maybe that will help clarify the situation I found myself in:

Java has an InetAddress class that can be used to do things like DNS lookups and also to establish a TCP/IP connection. My case was not really the most normal one. I was doing GeoIP lookups from raw IP addresses. Because of the fact that I was performing this work on millions of IP addresses, the inefficiency I describe above with the String.split() method became very apparent when I performed some performance and memory profiling.

As I mentioned above, my particular solution is not the correct way to deal with the general case, but it was the optimal way to handle my case. I think that the fact that I found this problem through performance testing and that I validated the worth of my changes through more testing would clear it of the possibility of "premature optimisation". Now I must admit, I'm not sure how many other people out there might be performing InetAddress lookups in tight loops so maybe there isn't any need for Sun to change their code.

As to the part about the OS doing it for you anyway, I believe that comes down to the fact that there are few things that Java can rely on the OS to do for it since in those cases, they have to have appropriate code and tests in place for each different OS that they want to support.

Why that complex?

(Anonymous)
Wouldn't something like this give approximately the same perf?:

int length = ip.length();
long res = 0;
int block = 0;
for (int i = 0; i < length; i++) {
char c = ip.charAt(i);
if (c == '.') {
res = res<<8 + block;
block = 0;
} else {
block = block*10 + c - '0';
}
}
return res;

Re: Why that complex?

I have to agree that this code makes a lot of sense. There were a couple of bugs in it, and unfortunately, there were a couple of bugs in my method as well. Once those were cleaned up, this one outperformed my method3 very slightly because it avoids the String.indexOf() call.

  • 1
?

Log in

No account? Create an account