
Back to College Topics
Editions
Back to Story List: November 2000
Digitization: What Does it Mean? By Scott Ramsey The Cavalier Daily Alumni Association has recently committed to a project of digitizing the print archives of The Cavalier Daily and College Topics. But just what does this mean? In short, it means creating a new archive in a digital format that can be stored, duplicated, distributed and viewed using personal computers and the Internet. There are two options: 1.) A simple archive can be browsed by date. This is done by digitally "photographing" all the newspapers in the archive. 2.) A more complex archive is searchable based on names or words that appear in headlines, bylines or the stories themselves. To do this, the information on the pages must be converted to a text format, i.e. digitally "reading" all the newspapers in the archive. How is it done? The CDAA investigated this approach and, while it is technologically feasible right now, it is so labor-intensive that the cost is too high. An alternate method is to convert the pages in the bound volumes or on a set of microforms into scanned digital images, essentially taking a photograph of the entire newspaper page including the stories, headlines and graphic elements. The scanning process can be largely automated, especially if we start from microform reels. This will produce an enormous archive of digital photographs that can be catalogued and displayed on a computer screen, as well as stored on compact discs. Before we scan, we will need to ensure that the existing microforms created by Alderman Library are all of good quality. Any issues that were poorly photographed in the past need to be re-transferred to microform. To create a searchable archive, Optical Character Recognition (OCR) software can then be used to "read" the stories contained in the digital images. This will be a challenging task, as the individual elements for each story must be grouped together (such as linking a story to its jump and graphic elements). In addition, older OCR packages have difficulty with unusual fonts and poor contrast between the page and text. However, the quality of OCR software is improving dramatically every year as it becomes more complex. The CDAA envisions completing this step with a future OCR package, though some demonstrations of the potential can be performed using todays software. How much will this process cost? Based on current estimates, a complete browsable archive may cost $50,000 to generate, though that figure could be off by a factor of two or more based on what we learn. The cost of converting a browsable archive to a searchable one is strongly dependent on the quality of the OCR package, and would become less expensive as better packages are developed. This project can be attacked incrementally, with the costs spread out over a long period of time. How will the project be financed? (Scott Ramsey is a former CD operations manager and business manager. He is heading the research into preserving the CD archives. If you have any information on the subject or would like to volunteer to help out, please contact him at 703-858-4009 (w) or skramsey@compuserve.com.)
|
||||||||||||
Contact support@CDalumni.org
with questions or problems.
©Cavalier Daily Alumni Association