Wide Finder 2 in Java
Posted 3 months, 3 weeks ago at 10:05 pm. 0 comments
I wanted to give WF2 a decent shot in Java, to see how it compares to the funky OCaml / Scala / Ruby versions that other people were making. It runs pretty well, about 15 minutes, and isn’t that complex. Okay, I did make it a single class file just to be simpler, but OO purists won’t like it
It’s a hack, a test, just to see how well it will run.
Notes from my big run with it…
nohup time ~bratton/jdk1.6.0_06/bin/java -d64 -XX:+UseConcMarkSweepGC -Xmx13000m -Xms13000m -cp . egb.MTNIOStats 40 128 /wf1/data/logs/O.all > nohup.30T.128k.out &
real 15:33.9
user 6:02:50.9
sys 8:36.0
Let’s see…
UseConcMarkSweepGC is a good thing. UseParallelGC sounds like it would be good with lots of cores, but it kept killing the VM about 70% through.
I went with a gigantomous heap, just because
I don’t really need it all, but as the queues get bigger, it’s nice to have. It probably would run fine with much, much less.
Java6 does much better than Java5. I’m using a locally-installed version since the /usr version wasn’t working. You need to unpack the 32bit sparc9 binaries first, then unpack the 64bit binaries on top of that. Have to use -d64 to get a heap bigger than 3G.
The -server JIT flag didn’t make any difference in processing time for me.
Interestingly, NIO block read size didn’t matter that much from 32k up to 4M, and neither did the number of threads, whether 25, 30, 35, 40, 50, 60, or even 90. Wacky! Not too much overhead in terms of context switching…
My app design has a single thread doing all the IO, simple reads, and then it hands the ByteBuffers to a separate blocking queue for each worker thread, to avoid any lock overhead. I think that’s probably irrelevant now, and that locking would be nanoseconds, so maybe I’ll redesign it, but it works fine.
My biggest problem is that my worker threads are for the most part waiting on IO to get more data to process. And the reduce phase at the end is not very long, 64 seconds, and it’s actually single threaded for now because shrinking 64 seconds to 64/5 seconds isn’t going to drop me from 15 minutes to 7 minutes
And okay, my results are off by .01% or something, but I haven’t re-run since I updated my parser to handle spaces in URLs. Close enough for me, not close enough for some other WF2ers.
It all depends on your business domain what your accuracy needs to be.
MTNIOStats.java - file name capitalization may be munged - thanks WordPress! ![]()