March 29, 2012
The alert came shortly after 11 a.m. on Saturday: Blackbox 1, a modular data center behind Building 50 that handles 252 computers dedicated to SLAC’s BaBar experiment, was down. Les Cottrell, SLAC’s manager of networking and telecommunications, went with network architect Antonio Ceseracciu and technical coordinator Ron Barrett to investigate and get the system back up and running as fast as possible.
The power was on, so the problem was somewhere in the network equipment or cables.
To determine the precise location, Ceseracciu ran a test that sends a pulse of light to the far end of the cable. The pulse travels down to the place where the cable is broken and returns. By measuring how long this takes – much as a bat measures distance by using sound waves for echolocation – they ascertained that the damaged area was 15 meters down the 100-meter cable.
Meanwhile, as a preventive measure, Neal Adams of the scientific computing team disabled the jobs in the computing queue to avoid potential problems in the batch submission system software.
Cottrell, Barrett and Ceseracciu checked wires and switches one by one, and when they looked at the cable junction box behind Blackbox 1, they found the problem, as well as evidence of the likely culprit. Something had chewed through two of the yellow cables and nibbled away at the coating on the red one.
And the guilty party left behind a pine cone, neatly stashed in a corner of the junction box. All signs pointed to the gray squirrel that lives in the tree between the data center and Building 50.
Squirrels gnaw on twigs; they will just as readily chew on wires and cables. Any kind of rodent is notorious for nibbling, according to Paul Rezendes, author of Tracking and the Art of Seeing. As for the pine cone, it may have been a cache. This may not have been the squirrel’s first visit, Rezendes said. Like an opening in a tree, a junction box is “ a good place to stash things or to make a home. Maybe it was just starting to renovate.”
To make the fix, Cottrell engaged Hien Trinh to perform a fusion splicing to mend the break in the two compromised cables. The next step, already in progress, is to seal the opening that allowed the squirrel to get into the junction box.
Asked if it was better that the outage happened on a weekend instead of during the week when more SLAC employees are on campus, Cottrell explained that the data center does batch computing, which runs 24/7, so an outage is “bad at any time.” But within four hours of the notification, he said, everything was working smoothly once again.