Dragaera

Need volunteer(s)

David Dyer-Bennet dd-b at dd-b.net
Fri Jul 5 13:07:43 PDT 2002

We need one or more volunteers to scan and OCR a book for us.
Specifically, we need to scan and OCR _The Phoenix Guards_.  

I did _Jhereg_, _Yendi_, and _Teckla_ myself.  Each of them took 2-3
hours to scan, OCR, and look at all the things the spellcheck queried.
I was using a somewhat slow flatbed scanner (I chose it for its
ability to handle photos, not for its speed; *and* it's several years
old), and Abbyy FineReader 6.0 OCR.  (Note that TPG is in the range of
2-3 times as big as any of those three; so on this basis TPG might
take a bit over 6 hours total.)  

(So far as I can see it's legal to OCR a book you own; it's what you
do with it *after* that that may be illegal.  It's certainly legal to
OCR a book *for the copyright holder*, that is, for Steven, and pass
the file to him.  If you do this for us, we do ask you *not* to
distribute the results to *anybody* else!)

Obviously you need to have a flatbed scanner to do this, and OCR
software.  The package I used is downloadable for a 15-hour trial
(hours of actual use, not elapsed hours).  And you need to be willing
to risk or sacrifice a copy of the book in question; we don't have a
budget or a pile of free copies of the books sitting around anywhere.

It was easiest to do with the pages loose rather than bound, which was
pretty easy to arrange -- for $0.50 Kinko's cut the binding off for me
on their big guillotine paper cutter.  This does have the downside of
ruining the book; but a used paperback in poor condition works fine
for scanning.  Doing it with the pages bound probably ruins the book
anyway, through pressing it down flat on the scanner.  When we get to
the rare books, it may be worth considering alternatives, like
photographing the pages with a digital camera.

Anybody interested, please get in touch with me by email,
dd-b at dd-b.net (no point tying up the group for it).  I will provide
more details about what we do and don't want in the resulting files,
and coordinate assigning page number ranges to various volunteers.
I'm hoping those who volunteer can put in a few hours fairly soon --
say, three people each putting in three hours in the next week would
get the book done.

This is for the full-text search engine for the books.  We hope to
make this available in "alpha" test mode to a very few selected people
relatively soon (sorry, I've already figured out who I want to ask),
and it will be a part of the new public web site when it opens.  It'll
also be useful to Steven to have an etext of the published version of
the book (especially so for TPG, since the manuscript etext appears to
have been lost).  Oh, in fairness, I should say there's still some
possibility a problem (political / legal) with making the search
engine publicly available *could* still crop up, though it now looks
unlikely.  We will, of course, credit the people who scanned
particular books for us in the credits on the search engine.

I'm currently using Steven's manuscript etext for Issola and 500
Year's After, and will probably start with that for more recent books
where one is available.  But having the search engine reflect the
actual published text is the long-term goal, so eventually if this
goes ahead we'll need to scan the rest of the books, too.  (Looking at
the amount of work the scanning is, it's pretty clearly less work to
just do the scan than it is to collate the manuscript etext manually
against the published copy and update it.)
-- 
David Dyer-Bennet, dd-b at dd-b.net  /  New TMDA anti-spam in test
 John Dyer-Bennet 1915-2002 Memorial Site http://john.dyer-bennet.net
        Book log: http://www.dd-b.net/dd-b/Ouroboros/booknotes/
         New Dragaera mailing lists, see http://dragaera.info