POSTMORTEM ON MXTOOLBOX EMAIL OUTAGE

4/25/2011

We screwed up. We cannot thank our customers enough for their patience.

As of Monday morning (4/25) we are fully operational, with all mailboxes recovered and operating normally. We are still in full response mode and by Monday evening we will have additional contingency plans in place. The objective of these activities is to ensure that if a similar event were to occur tomorrow, we would be in a position to respond much faster.

As it happens, we were also in the middle of a six month project to improve our redundancy and disaster recovery plans. The lessons learned during this outage will certainly be put to use as we build out this planned expansion.

We will be contacting our directly impacted customers over the next couple of days with additional information and to answer any questions.

On behalf of our entire team, thank you again for your understanding and patience.

Eric Rachal, President
MxToolBox, Inc.

Additional Technical information
MxToolBox hosts 15 email servers within the Amazon EC2 east coast region across 3 availability zones. On Thursday morning, April 21st, Amazon had major failures in two of these zones. During the event, many of our servers became unresponsive and we were unable to access our backups. Approximately 15% of our email customers were unable to send or receive email until Friday morning, and unable to access their previous mailbox data until 6:00am Monday morning. The MxToolBox.com website was not impacted.

On Friday morning we made the decision to move all effected mailboxes onto a standby server to restore mailflow and the ability to send and receive email, but without the restored mailbox data. In hindsight we should have done this much earlier. At the time we believed we would have access to our data volumes much sooner, allowing for a cleaner and quicker recovery path for our users.

Once we launched the backup mailboxes on the standby server and were heading into the weekend, we elected to slow down and proceed more cautiously. We tested and retested the process for merging the backup mailboxes with the restored data, with the objective of being fully operational for the start of business early Monday morning.

It is important to note that we don’t host all of our critical infrastructure “in the cloud” or with a single provider. We maintain critical email routing gateways, spam filtering services, LDAP directories, etc. on hardware we fully control.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s