We screwed up. We cannot thank our customers enough for their patience.
As of Monday morning (4/25) we are fully operational, with all mailboxes recovered and operating normally. We are still in full response mode and by Monday evening we will have additional contingency plans in place. The objective of these activities is to ensure that if a similar event were to occur tomorrow, we would be in a position to respond much faster.
As it happens, we were also in the middle of a six month project to improve our redundancy and disaster recovery plans. The lessons learned during this outage will certainly be put to use as we build out this planned expansion.
We will be contacting our directly impacted customers over the next couple of days with additional information and to answer any questions.
On behalf of our entire team, thank you again for your understanding and patience.
Eric Rachal, President
Additional Technical information
MxToolBox hosts 15 email servers within the Amazon EC2 east coast region across 3 availability zones. On Thursday morning, April 21st, Amazon had major failures in two of these zones. During the event, many of our servers became unresponsive and we were unable to access our backups. Approximately 15% of our email customers were unable to send or receive email until Friday morning, and unable to access their previous mailbox data until 6:00am Monday morning. The MxToolBox.com website was not impacted.
On Friday morning we made the decision to move all effected mailboxes onto a standby server to restore mailflow and the ability to send and receive email, but without the restored mailbox data. In hindsight we should have done this much earlier. At the time we believed we would have access to our data volumes much sooner, allowing for a cleaner and quicker recovery path for our users.
Once we launched the backup mailboxes on the standby server and were heading into the weekend, we elected to slow down and proceed more cautiously. We tested and retested the process for merging the backup mailboxes with the restored data, with the objective of being fully operational for the start of business early Monday morning.
It is important to note that we don’t host all of our critical infrastructure “in the cloud” or with a single provider. We maintain critical email routing gateways, spam filtering services, LDAP directories, etc. on hardware we fully control.