Server crashed because it was invisibly designed to break • The Register

On call The week, and even the year, may flow to their respective conclusions, but The register keep working hard on On-Call, our weekly story of technicians triumphing in difficult circumstances.

This week, meet a guy we’ll rename to “Kris” who arrived at his office on a Monday morning to find the phone ringing, a voicemail full, and his pager ringing with urgent messages from his manager.

The cause of all this emergency was a dead server that was powering an important application.

Kris quickly inspected the machine, which was connected to power and had working inverters. No telltale smell suggested anything inside had expired. Turning everything off and on again had no effect.

“Unity was dead as a doornail, as they say,” Kris wrote.

So the only thing to do was to call the company’s service provider for help.

“It was one of those situations where I really didn’t want to deal with them because in most cases by the time they sent a technician the issue was already resolved either by the user or by myself,” Kris told us. But at this point, a second opinion was the only option.

“About five hours later, the tech shows up smelling like he’s been living in his car and sleeping in his ashtray,” Kris told On-Call. In less than five minutes, he had diagnosed that the server was faulty and that new power supplies were the solution.

That hardware arrived two days later and – after Kris and the tech grunted and lifted the server to install – failed to fix the dead box.

Kris confessed The register that part of him didn’t care at all, because watching the smug technician knock off an ankle was kind of fun.

However, at the time – three days after the start of an incident that robbed the company of an important app – Kris was under more than a little pressure.

So he agreed with the visiting technician’s amused observation that the server was still down and that the motherboard must be the real problem.

Three days later a new motherboard arrived and – after more lifting and sweating – did nothing to get the server back up and running.

But all that work put Kris on the right track to finding a solution, because handling the case sparked another thought: Did the lock switches work?

Interlock switches, for the uninitiated, are safety mechanisms that shut off power when enclosures are opened. Which is a good idea because no one should get electrocuted while working on a server.

It turned out that one of the switches had broken on this server, but the fault was invisible and undetectable.

“I had never opened the server since I had this job,” Kris said. “I made a quick trip to R&D and one of the engineers pulled out a similar switch from their parts and gave me one. Tech wired in a new switch and we were good to go.”

This incident is not a typical on-call triumph.

It came – or rather, didn’t come – in the weeks and months that followed, when Kris’ inbox never contained an invoice for the repair attempt by her outside service provider. So, although the incident was unpleasant, at least it didn’t cost Kris’ employer a penny!

On-Call will operate in its usual Friday the 23rd timeslot, then offer seasonal specials. So if you have stories like Kris’s, or tech support stories over the holidays click here to email On-Call. ®

Leave a Reply