Neil - I think you've done an impressive job with your thread-profiling package. Let me relay my experience on the subject, which is in plenty of languages, but not yet in Java.
Ever play one of those arcade games where something pops up and you need to try to shoot it? That's how I do performance tuning. I take samples of the stack manually with the pause button. As soon as I see a line of code appear on more than one stack it is a potential target. If it is clearly a bottleneck, I shoot it down (fix it and start over). If I'm not sure, I continue sampling until I do see such a target.
Where this differs from your technique is that you have the concept of an "interesting target", such as the first line in the user's own code. What I've found is that the bigger the code is, the deeper the stack is, and the more likely I am to find things mid-stack that are easily shot down.
The reason it works is because if a line of code appears on a sample, that slice of time is being spent because that line is there, and would not be spent if it weren't there. The amount of time the line is responsible for is roughly estimated by the fraction of samples that contain it, and rough is good enough.
-Mike
P.S. Here's a rant which, even if it's a rant, is true:
http://stackoverflow.com/questions/1777556/alternatives-to-gprof/17...