September 01, 2011

There Went 158.

A little after 4p Tuesday, the server serving the screens suffered a crash. Node.js, the technology serving the webpages and content to the screens suffered a memory issue, ballooned, and forced a restart for both it and Redis, the NoSQL database that houses a local copy of the events data for the screens. In doing so all the screens events data was lost. For better or worse, I’ll bet almost none of you noticed, which I count good. The screens all keep their…

By David McKelvey

 

A little after 4p Tuesday, the server serving the screens suffered a crash. Node.js, the technology serving the webpages and content to the screens suffered a memory issue, ballooned, and forced a restart for both it and Redis, the NoSQL database that houses a local copy of the events data for the screens. In doing so all the screens events data was lost.

For better or worse, I’ll bet almost none of you noticed, which I count good. :) The screens all keep their own copy of the events while running them, so over half just kept cycling events as ever — they just wouldn’t receive any new updates. The few that failed to cycle had been communicating with the server when it went down or shortly after, and so they went black or showed a maintenance screen.

It took about four-to-five hours to restore the system, including fixing a few of the issues that crimped it in the first place and restoring all the event data from LiveWhale. The system is hardened now against the same failure, although other issues still exist. I had pushed the system to be ready for the 24th, and so events like this are just natural.

But sadly, I had to reset our 158 day no-server-meltdowns record. Now at day 2, we hope to surpass it about five months from now.