[rescue] Inside Solaris newsletter

Jason T silent700 at gmail.com
Thu Oct 2 12:30:36 CDT 2014

On Thu, Oct 2, 2014 at 9:26 AM, Andrew Hoerter <amh at pobox.com> wrote:
> Thanks for doing this, Jason.  Your mail was a reminder that I've got
> some of these newsletters myself.  I have these issues from each year:
> 1996: 3-5, 7-12
> 1997: 1-3, 7-12
> 1998: 1-6
> 1999: 4

I'm willing to take on additional scans, if you want to send them
over.  I do chop the edges, though, so if you want them back, they'll
be a little more narrow :)

We've got 3-5/98 and 4/99 up, so that leaves the 96 and 97 issues plus
1-2/98 from your stack.  Email me direct if you'd like to have me do
them and I'll get you the address.

> Speaking of scanning, are there any resources out there for figuring
> out the ideal scanners to buy or pitfalls to avoid when sucking in old
> documentation?  Maybe my google-fu is deficient.  The bitsavers.org
> page has a quick blurb on what they use, but it would be nice to read
> in more detail how people are saving classic computing documents.

Someday I'm going to have to write up my process and let others let me
know what I do wrong :)   The Bitsavers info was a good start but I
think it's a bit dated.  I imaging AEK has better equipment and a more
refined process now.  One of my "scanning mentors" has been DLH over
at bombjack.org.  He's the scanning king in the Commodore world.

At the moment, my two main scanners are an old Fujitsu 4097d, which
does dual-side B/W and greyscale faster than anything I have access
to.  It's huge, SCSI-only and has no driver support beyond XP
(although it does work with sane in current Linux, from what I've
read.)  It keeps on truckin', though, and really tears through stacks
of B/W manuals.  It can do 11x17 pages and longer pages for things
like fold-out schematics as well (11x40 and so on.)  I got it for $25
off ebay (+lots of shipping) and just picked up a NIB refurb kit for
it, so I hope to get a lot more use out of it.

The other is a Fujitsu FI-5530C, which I bought from a local guy who
refurbs scanners.  It's not the newest line but new enough to have
USB2 (ever tried color scanning through USB1.1?  Don't.)  It will do
dual-sided 11x17 pages and, having no flatbed, takes up a lot less
space.  Slower than the big 4097 but not terribly so.

I have access to a big Xerox Docucenter at work that I have used for
some scans.  Unlike the two scanners at home, it mechanically "flips"
pages to do duplexing, instead of having two sensors and a smooth,
straight paper path.  Lately the duplexer can barely get through a
stack without a jam, so I've taken to scanning the two faces
separately and putting them back together with a script.  Worse, it
has badly mangled documents, and that's never good.

For fragile docs and really nice color scans, I've got an Epson V300
flatbed ($9.99 at Goodwill.)  A cheap Brother all-in-one
printer/scanner thing handles 11x17 color scans better than you'd
expect, atlhough I can't think of the last time I used it.

Since I am, for better or worse, still bound to the Windows desktop, I
usually scan directly into Acrobat Pro using the TWAIN drivers for the
various scanners, with the exception of the flatbed, where I use
Epson's own tool to scan to a tiff (usually for the color covers of an
otherwise B/W doc.)   I scan B/W at 400dpi and color covers usually at
300.  Really nice color docs, like brochures with lots of photos, I'll
do at 400 or better.  The Solaris newsletters were only 300dpi but
still came out pretty good.

Lately I've been thinking I should scan to image files first, then
compile them in Acrobat, but I have not yet found a Windows TWAIN tool
that I like.  I often have to touch up a page or correct a bad skew,
which might be easier to do with raw images rather than the
Acrobat->Photoshop edit function.  Acrobat has no image editing tools
of its own.

Sometimes B/W docs will have photos, which can be tough to capture
with any quality in B/W mode.  At 400dpi you're scanning with enough
resolution to capture the halftone dots most printers used, but
tweaking the contrast/brightness settings can be a pain.  In those
cases, I'll go back and scan just the pages with photos in 8-bit
greyscale mode.  It increases the doc size but not by too much if you
compress wisely.

In Acrobat I take care of assembling the pages (separately scanned
covers, multi-session scans, missed pages, etc.,) PDF page numbering
and, if I'm feeling like putting in the effort, chapter titles as PDF
bookmarks.  Next is OCR, done with the "searchable image exact"
setting, so no bitmaps are removed or DPI settings changed and the OCR
layer is hidden.  IMHO doing OCR is a best-effort, "search terms only"
endeavor.  We'll never have 100% accuracy and we're aiming to preserve
document images, not just content.  So far, I've only used Acrobat's
OCR engine; there is probably better stuff out there.  Also when you
upload to archive.org, they re-OCR the docs with their own engine, as
does Google (I've found many non-OCR'd docs on bitsavers by way of a
search on some phrase inside the doc.)

The major final step is compression, which is a touchy subject.
Definitely to be done post-OCR but lately I'm reluctant to do much of
it at all. It's probably the area where I have the least experience.
Acrobat's compression for B/W pages is "lossless" (I think it's just
ZIP) but for color and grey, it can really make a mess of a page.  At
this point I run it on the second-from-top compression setting, no
background removal or any of the other stuff besides deskewing (which
rarely seems to do anything) and save a separate copy to compare to
the pre-compressed version.  If the quality is still good and it saved
a lot of file size, I'll go with it.  Otherwise it's probably not
worth it.

Lastly I add some metadata to the PDF, in case I or someone else
someday writes a web file browser that uses it.  It also allows me to
store the complete, correct name of the document rather than the
shortened (and illegal character-removed) version in the filename.
For file naming I use the bitsavers standard of [part#_title_rev_date]
or at least something close to it.  I have a lot of old docs to go
back and rename...

This definitely won't be everyone else's process, especially as
regards the software.  There are surely cheaper (and often free) tools
out there to work with PDFs, both on Windows and *nix, but the general
processes are the same. Given some search effort, affordable hardware
is available.  It takes some practice (mainly of the trial-and-error
variety) but everyone can contribute to preserving history.

- jht

More information about the rescue mailing list