Before you blame the game, try blame the player
I am referring to my last post last week where I was looking into JVM Garbage collection issues, and a particular issue I was having with spontaneous ColdFusion 9 server restarts. As it turned out the problem was one of swap space, or rather a lack of it.
Looking into JVM memory issues can be very daunting, and I think initially I went though a set of emotions not unlike the five stages of grief.
- Denial – This exception cannot be anything to do with my application
- Anger – how could Adobe release a product with such shoddy memory managment?!
- Bargaining – can we just get one more new 64bit server and fill it RAM?
- Depression – Jaysus, will this JVM garbage collection explanation ever make any sense to me?
- Acceptance – OK, I’ll accept I might have a memory leak, now how to find it?
I’ve been the guts of a week looking into this issue and after I posted on Stack Overflow I started to accept that I might actually have a memory leak. It was difficult to accept. I thought the stupid permanent generation just needed to be tweaked up, but I was wrong and this post is to explain how I found out it not in fact Java, not ColdFusion (well not entirely!) but mostly me to blame.
So how did I know I might have a memory leak apart from the obvious (and terminal) exceptions? Well I noticed how jConsole showed my old generation as ever increasing. I was looking to see that generation cleaned up at some point.
Taking a Java Dump
Stop giggling! The first thing I decided to do was to let the server go for a while, consuming memory until I knew from past experience it was close to crashing. I then used
jmap which is bundled with JDK to take a heap dump from JRun. This was actually impossible using Remote Desktop or Terminal Services due to the fact that my user on Windows 2008 was not the same as the user under which the CF 9 service was running. The error was
Not enough storage is available to process this command and in the end a suggestion to use SysInternals
psexec was brilliant. I was able then, using the
-s switch, to remotely kick off the
jmap on the remote server with the JRun PID (process ID) which I got from Task Manager on the remote server.
Half a Gig of JVM Goodness
Yep, the heap dump worked, but
psexec hung and I had to kill it on my local machine and then the
jmap process on the remote machine – but the dump was intact. It was also nearly half a GB in size. I renamed it to have a
.bin extension, zipped it up (which reduced the size to about 150MB) and then copied it to my local machine.
Next I needed to analyse the heap dump, and no better (free!) tool then the Eclipse Memory Analyser Plugin. I couldn’t get the BIRT charting plugin to work so I had to do without the charting features, but I could live without the eye candy (normally I love graphs though).
With a little digging around you can really see what objects are using memory in your JVM, and the killer feature for me was the interestingly named ‘dominator tree’ which lists the biggest objects and what they keep alive. I had one object actually keeping 70% of the heap alive. You read that right – 70% of all my server’s heap memory. At that point that was over 200MB of memory.
That was the key really. Once I drilled down I could see the class name of the CFCs being kept alive and when I went and inspected the code, these objects are used exclusively in a long-life thread that operates as a queue.
ColdFusion 9 is not entirely blameless. I went back and found the first time we had this error on our server and I found it was just after we updated to CF 9. Very suspicious.
CF 9 introduced the idea of a
LOCAL scope and as it turned out, the objects that were accumulating were all held together by a
HashMap object, i.e. a ColdFusion
struct. It seems that between the way I had coded the long-living thread calling a function that never ended and the introduction of this new scope I inadvertently created a memory leak. Since my function never ended I decided to assign any temporary variables to to the
VARIABLES scope and then explicitly unassign them, and this appears to have done the trick.
So I wanted to let people know that, all things being equal, the problem is probably
- Not a driver memory leak
- Not a ColdFusion bug
- Not poorly configured JVM startup parameters
- Not the lack of a 64bit server and unlimited RAM
Instead it is often (but not exclusively) some code written that has a subtle memory leak and you just need to persevere and track it down. Good luck!
You might also be interested in the bookmarks I saved as part of my research.