The life of a sysadmin

Well, that was unneccessarily exciting…


You know, I had just gotten home from about five hours of grading, dealing with a couple of entirely unrelated matters, and so on, and was looking forward to a calm night… suddenly, out of nowhere, came a spate of weird error messages when I tried to check my e-mail.

I pinged the server. Did it answer? Yes.

Did it answer SSH’s? No.

Did the main file server answer pings? No.

Oh, bugger. That file server is a tank; I’m kinda surprised that anything short of it catching on fire could panic it. But it didn’t catch fire, since the mail server is only a meter and a half away, and if the building were on fire it wouldn’t be answering pings either.

So I get there, and have a fun time fixing it. Apparently the memory had somehow failed – but after taking it out and cleaning the contacts, everything was fine again. I’m not sure if there’s a deeper problem in there somewhere.

(Of course, this makes it sound easy. What really happened is that I got there, found it panicked and incapable of bouncing; the motherboard was giving an error message. During the bloody eternal time it took one of the machines to load up the mboard manual to decipher what that meant, I had grabbed another not-quite-as-important machine, and was about 90% of the way through a full-brain transplant: taking every part of the file server that knew it was a file server and putting it in that small computer. Then I found out that the error code was memory trouble, so I decided to check and find the source; taking out the RAM chips and fiddling a bit made it work again, so I reverse-brain-transplanted. This mostly worked, except that the ethernet suddenly was unhappy; but taking it out, cleaning its contacts and putting it back seemed to take care of that too.)

So, some important discoveries for the evening:

  • Response time is doing pretty damned well. From server failure to my finding out about it was an hour; from my finding out about it to it being completely fixed (including time to drive like a bat out of hell from home to campus) was just under an hour. The only real way to speed that up would be to have it on a pager, and frankly they don’t pay me enough for that.
  • Lack of hardware, OTOH, is a problem. When I was doing that full-brain transplant I realized that the little box of screws, mounting rails, and so on – simple pieces of machined steel that one needs when fixing things – was nowhere to be found. This was turning into a significant issue when I decided to reverse the procedure. Conclusion: A box of hardware always needs to be easily on hand.
  • Good hardware: The SuperMicro SC760 server case is fscking excellent. It’s possible to open it up completely, from all sides, and field-strip it within seconds, or keep it running, with basically no effort. In future, all cases, rack and tower alike, that I buy will be from this company. The motherboard (An Epox EP-8KHA+) is also nice; its diagnostics, both LED’s and on-screen, are excellent.
  • Flaky hardware: Something strange is going on in there. I don’t know if it’s the RAM, (Crucial, registered ECC) the motherboard, or something else that caused this. The CPU fan was making some noise until I played with it a bit, and I noticed that the main CPU voltage was running a bit low – 3.1V instead of 3.3. But it’s back up to full now. I need to figure out what the hell is going on.
  • Metal: Ow ow fsck fsck goddammit… but adrenaline keeps one from noticing that until after the repairs are done with.

Ah, the joys of sysadmin life. At least I can bill for this.

Advertisements
Published in: on March 21, 2003 at 00:16  Comments (5)  

5 Comments

  1. My, that sounds far more exciting than a meowing siamese who refuses to eat his special diet.

  2. Hey, that’s the motherboard I put into my home-built PC! After some reasonably thorough research, I concluded that it’s got of the best price/performance ratios of any motherboard out there (at the time I got it, anyway, about 1.5 years ago).

  3. …and as of this morning, it’s failed once again, and I’m in the middle of brain surgery right now. Hopefully I’ll be able to make it to game tonight. RAID reconstruction is in progress.

  4. Do you mean the +3.3v power supply voltage was low, or that the motherboard settings were hosed so that it was only supplying +3.1v to the CPU? Quite different things.
    I’ll spare you the horror stories I had of how I found out that one particular set of motherboards’ chipset had a completely unfixable IDE bug – except under windows, where there was an utterly asstarded workaround.

  5. The motherboard settings were right, but the CPU was reporting Vcc=3.1V instead of 3.3. Which is leading me to suspect some sort of cooling problem… but I’ll go at it with a scope later.


Comments are closed.

%d bloggers like this: