[rescue] Drive Replacement Question

Brian Deloria bdeloria at gmail.com
Fri Sep 7 17:11:07 CDT 2007

On 9/7/07, Ahmed Ewing <aewing at gmail.com> wrote:
> On 9/7/07, Brian Deloria <bdeloria at gmail.com> wrote:
> > I had also read that I may have had to use the metaclear and metainit
> > command.  I don't believe that this would have solved my problem plus I
> was
> > fairly unclear on the syntax and usage and was quite concerned I'd kill
> the
> > mirror.  It seemed like the examples wanted me to drop the good
> submirror
> > and recreate the mirror and attach each submirror again.  I was also
> > concerned over the vagueness of the examples as to which submirror would
> > overwrite the other.  The last thing that I wanted to have happen is for
> > the good submirror to be overwritten by the blank disk.
> If I'm understanding correctly, you're alluding to a complete removal
> of DiskSuite/SVM from the disks altogether. I've always considered
> this horrible overkill, especially within the constraints of a limited
> maintenance window for a production box; it's simply not necessary and
> adds extra confusion to an already busy procedure.
> Note, though, that metaclear is required in my recommendation
> too--except limited only to clearing and initializing the submirrors
> on the failed / replaced disk and *not* the top level mirror. This
> means that the config is a "1-way mirror" (top level mirror with a
> single submirror attached) during the replacement. There are no
> changes necessary to /etc/system or /etc/vfstab, and reboots aren't
> inherently required.
> In any event, the DiskSuite/SVM documentation freely available on
> http://docs.sun.com is some of the best OEM-produced stuff I've come
> across. Clear, concise, and best of all, task-oriented (see here:
> http://docs.sun.com/app/docs/doc/816-4519/6manoju18?l=en&a=view).  It
> has plenty of syntax examples. You should check it out to help clear
> up any uncertainty you might have in that regard.
> > [insert "KS" anecdotes here]
> IMHO, those horror stories are all the more reason to stick with the
> tried and true methods that have documentation to back them up.
> Sometimes that's the only thing available to CYA when faced with an
> overzealous and hardheaded colleague who has more influence on
> management than you do.

I've learned that problem coworkers are best dealt with via email.  No
verbal requests are made, or at a point where I've disagreed with something
they get sent a "per our conversation <problem> let me know if anything in
my account is incorrect."

When things have blown up in the past and people have tried dragging me into
it either by saying I told them to do it that way or that I never told them
to not do it that way etc. it has been extremely useful.  At $WORK
everything goes through purchasing, after making verbal requests or handing
pieces of paper to purchasing and having everything from quantities,
shipping speed, order X and Y being changed to X instead/or Y, things never
getting ordered.  I've had to go back to this tactic and at least once a
month I avoid getting bitched out this way.

Management was in fear of this guy leaving and so they never really
disciplined his subordinate cronies.

> Ah well, thanks again everyone for their input.  I too prefer to 'break'
> things and 'prove' that replacement failover / raid reattachments do in
> work and do so properly.  You end up with a better understanding of how a
> repair is supposed to go and the timeframe for it to take place.  I
> unfortunately have walked into a situation where there are many legacy
> systems to consider the dependancies are ridiculous at times and the
> documentation is non-existant.
I agree, there's not much worse than having to inherit legacy systems
with inexplicable configs. And the most fun part is being blamed when
you can't get it squared away in short order after a failure.

I came across a great one where a customer couldn't figure out why he
couldn't restore his RAID5 metadevice from its "Needs Maintenance"
state after a proper replacement of the offending disk. At first
glance at the metastat, I missed it too. Turns out, the guy's
predecessor saw it fit to use *two slices per disk* to make the RAID5,
so when a single disk failed the RAID5 lost two members and all data
was immediately lost. There was a hot spare pool configured, but they
never associated it with any volumes, so it sat idly by. Not that it
would have been able to do anything, unless there were read/write
errors on only one of the two slices and the reconstruction had time
to complete before errors were detected on the other slice... A
further check of servers at the site showed that several others were
configured with this ticking timebomb configuration... it was the
definition of a hot sticky mess.

Had he actually tested and tried proving that he could recover from a drive
error he would have learned of his error before it became an issue.

This has all been a good reminder for me to start finishing some long
overdue server bibles... :-)

rescue list - http://www.sunhelp.org/mailman/listinfo/rescue

More information about the rescue mailing list