[rescue] ECC [was: Re: WOT: Ebay changes to IBM from SunE10Kservers?]
newell at cei.net
Tue Jul 15 11:30:58 CDT 2003
>> The typical response from a PC to an ECC error is an NMI. While linux
>> dosen't really "support" it, it's rather hard to ignore, and I know from
>> experience that it'll bitch like mad if you get one.
>Just because you get an NMI, doesn't mean you know what to do with it,
>or that the hardware gives you enough information.
Does it throw an NMI on a correctable or uncorrectable ECC condition?
What's the proper response for an uncorrectable error--halt the machine,
log and continue as if nothing happened, something in between?
Do any systems implement memory washing? (I've read about this with sats,
especially LEO birds...the EDAC system will loop through reading all
memory, correcting the recoverable errors, and write the correct word back
to memory. The idea is to catch the bad bits before they become too
numerous to correct.)
More information about the rescue