A Flora of California by Willis Linn Jepson

Electronic Access
to a Rare Book

A Flora of California
by Willis Linn Jepson

The following rare book was made available to the public electronically through a unique collaboration between the UC Berkeley Digital Library Project, BSCIT, XEROX PARC, and The Jepson Herbarium.

A Flora of California
Jepson's A Flora of California is an important, insightful, scholarly milestone still much in demand by scholars. This richly illustrated botanical masterpiece was published in several volumes spanning decades, but sadly never fully completed before Jepson's death. The volumes published are a cornucopia of information gathered over Jepson's lifetime on the plants of California in the early 20th century.

Systematic botany differs from other sciences in its persistent reliance on its early literature, going back as far as the eighteenth century. Each new study of a species must be based on a thorough analysis of all previous published studies. Unlike The Jepson Manual (1993), the Flora contains information with specific references to individual specimens, flowering times, original descriptions, and other, detailed information that could not be fit into the field-portable Jepson Manual.

Inasmuch as many botanical locations originally described by Jepson have been disturbed or even obliterated (the Los Angeles basin, the dunes of San Francisco, etc.), any botanical information on now urbanized or disturbed sites becomes treasured; thus the original Flora's importance may be expected to increase over time. However, access to the existing volumes is limited because the volumes have been out of print for many years (a limited number is still available from the Jepson Herbarium). Past attempts to use commercial OCR systems on similar books have failed, presenting a rigorous challenge for current research.

The Jepson Herbarium used the digitized version of the older Flora as a template to be revised, updated, and completed for all species in California. The new Flora is now distributed online as the Jepson eFlora . This effort depended on access to the original Jepson Flora, which has not been previously available on-line either as image or text, and is difficult to find in print. The availability of the Flora in digital format with indexes provides a valuable resource for researchers of California plants around the world.

THE SCANNING PROCESS

The Bookscanner

Initial imaging of these volumes was completed using an experimental prototype scanner developed by XEROX PARC. The PARC Bookscanner is designed for use on rare and fragile books, such as A Flora of California, with minimal impact or damage. Traditional flat-bed scanners will break the spine and cause delicate pages to tear from too much operator handing.
The unique design of the bookscanner carefully cradles the book, supporting the spine while both open pages are scanned at high resolution simultaneously. Image files in TIF format are then automatically cataloged in sequential order in a digital archive. With this system, scanning rates of up to 280 pages an hour are possible.

Q & A about our Experience

Q: What was your actual throughput per hour?
A: Approximately 165 pages per hour. (initial scanning took 2 long days for 2200+ pages) After sanity checking, approximately 120 page images had to be rescanned out of a total of 2200+ page images, mostly the result of inadvertently cropped margins and other operator error)

Q: What resolution did you scan at and what format were the files stored in?
A: Images were acquired at 300 dpi as grayscale TIFF (native format for scanner) They were converted to jpegs for web viewing.

Q: Did you find that the scanner gave you a quality image without damaging the book?
A: Yes. The Jepson Flora volumes were in generally good shape, and common enough that we were not worried if minor damage occurred. Throughput might be significantly reduced for a more rare/fragile/brittle text that requires delicate handling. The cradle with drop-down scanner seemed to be a good design for valuable texts.

We did find there to be some difference in exposure on left vs. right side pages which was never fully explained, and which led to some problems later when binarizing the images for OCR. One side would come out more exposed, which caused problems when trying to batch process all files at once. Tweaking of settings during scanning may correct this. Quality of the originals was generally good for our purposes, which were to show the full detail of the page (including scientific illustrations.)

Ultimately we ran into the aforementioned problems with Left/Right exposure, which prevented us from uniformly correcting the exposure for all pages, that combined with a highly specialized botanical vocabulary led to OCR error rates that were unacceptable for text recognition and indexing. Lacking additional staff time to devote to this project, the electronic Jepson Flora currently stands available as an image-only product.

BSCIT, University of California, Berkeley