Thursday, December 24, 2009

Confluence, iSCSI, NetApp, Flexiscale, Fail

We've been hosting the FudgeMsg website using Confluence (and mad props for the free Open Source license by the way!) on a VM hosted by Flexiscale. We chose Flexiscale for the following reasons:
  • Confluence is ridiculously tricky to cluster, meaning that you can't benefit from the scale-out capabilities of the Amazon EC2 model. (Edit 2009-12-24 - We're not attempting to cluster it; this statement is to indicate that because it's so tough to cluster, we're not trying to. Confluence is thinking we might be, and crashing as a result).
  • Getting your AMI+EBS+ElasticIP for such a single-point vendor app is very difficult and time consuming, again meaning that the Amazon EC2 model isn't ideal.
  • We didn't want to go for dedicated hardware for a low-volume application stack.
  • We don't have a stable enough internet connection to host the application stack ourselves.
We thought we were pretty darn clever to be honest, but then we noticed that things started going wrong. In particular, we started getting this error a lot:


We were running a relatively complicated Java setup (jsvc to run as chrooted user, on Tomcat, with multiple web applications running), so we thought we had done something wrong. After all, the first rule in software engineering is always, always, assume you have caused the break.

We were wrong.

What we've since found out is that the entire Flexiscale model is that your VMs have effectively no local storage whatsoever, and everything is iSCSI hosted off of a NetApp cluster. That means that your storage is fully persistent, and survives migrations of your VMs across physical hardware. Which is a great concept when it works.

Except that the Flexiscale NetApp cluster is borked. Essentially, sporadically the iSCSI services will completely shut down. Sometimes for a second, sometimes for 4 hours. Your processes will still be running, but if they attempt to hit disk for any reason, they'll hang and ultimately timeout. Your processes are still running, but they can't talk to disk.

In the case of Confluence, which was actually trying to talk to PostgreSQL, which itself was trying to talk to disk, Confluence detects the complete timeout hang of PostgreSQL as a cluster violation, and puts itself into crash mode.

You want to know the best part of this? When you're in this state, there's nothing you can do. You can't SSH into the VM, because it can't read the password file to let you in. The Flexiscale tools won't even let you hard bounce the machine. And they won't tell you when it's back up directly, so you just have to keep trying until you can finally get into the VM, to restart your servlet container.

This has caused no fewer than 20 instances where Confluence has died for us, sometimes lasting hours until we can actually recover the VM, and twice now in 2 days. It makes us look like morans who can't even run Confluence, much less guide development of a message encoding system.

So until we manage to get off of Flexiscale (haha, can't even get in to back up the data at the moment), if you see that error when going to the Fudge Messaging website, now you'll know why.

There are two morals of the story:

blog comments powered by Disqus