Automated Genealogy
1901 Census Indexing Project

This is only a proposal and is subject to change.

Revised on November 5th, 2002

Introduction

The ultimate objective for a project would be to have a complete, high quality transcription of the 1901 Census of Canada that is accessible and searchable for free via the web. Since that would be a very large undertaking that would likely take many years to accomplish we have adopted a staged approach.

The first stage of the project is to produce a searchable index to the census data. Images of the completed census forms are freely available on the National Archives Web Site. Unfortunately, without an index, locating people in the images is time consuming, and almost impractical if the researcher doesn't know an approximate location. The creation of an index will allow searching for individuals and subsequent quick access to the image of the page they appear on.

For many purposes an index that points to the original images is actually superior to a transcription as the researcher is merely using the index to access the original or primary data rather than relying on a secondary, transcribed, version which inevitably contains errors, omissions, or interpretations. In the past, the majority of transcription work has been conducted with material that was only available on microfilm, which made access to the primary source extremely inconvenient. Having a transcription in paper format was, in relative terms, extremely convenient. Minimizing the need to refer to the microfilm to check for errors made the quality of transcription extremely important. While not arguing in any way that quality is not important in an online index, the way that an online index is used results in a different set of trade-offs. For example, since the user of an index to online images will presumably consult the original image it may be appropriate to indicate where an entry is illegible or provide multiple interpretations, where a paper transcription must place more emphasis on guessing a value. An online index is also much more amenable to frequent updates and the addition of shared annotations by its users.

It is likely that there will continue to be a demand for complete transcriptions, and in fact, this may provide a natural division between revenue generating transcription efforts by genealogical societies and commercial entities and an effort to produce a freely available online index. A genealogical society might, for instance, contribute a subset of its data to an indexing project while publishing a full transcription in order to generate revenue.

An important change to understand is that the availability of the images on the web means that anyone with a browser and an internet connection can participate in the transcription process. This allows the load to be distributed over a much larger number of people than is currently typical in transcription projects based on microfilms. People can contribute as little or as much as they like, and local efforts can produce local transcriptions while participating in a unified nationwide effort to produce a single compatible database.

The Web Site

A web site has been constructed to illustrate the feasibility of online transcription and in fact is fully functional and in use. The web site allows the user to simultaneously view the image of the census form and a web form for entering the transcribed data. Transcription is performed one line at a time with each line stored in the database as it is entered. By using web forms data input can be simplified and made more reliable. Values that are restricted to one of a few possible values can be entered by selecting the value rather than typing it in, thereby reducing input errors (selecting values does not require using the mouse or cursor keys). Validity checking can be used to further reduce input errors.

The web site is still undergoing enhancement but is fully functional.

Intellectual Property Rights Proposal

Automated Genealogy proposes to provide the software and to host the transcription process on its site, and to turn the resulting data over to the National Archives. All data will be owned and copyrighted by the Archives which will control access and distribution. In exchange for the data the Archives will provide free access and search facilities on its web site. In exchange for providing the software and hosting services Automated Genealogy will receive the right to maintain and use a copy of the data. In exchange for participating in the transcription process the public will gain free access to a searchable index of the complete census. Data will be maintained on who participated in the transcription process and volunteers will be credited with the recognition they deserve. Everyone wins!

Issues

Offline Access

Many people use their phone lines to access the internet and could not stay online while transcribing data. Since the access to the images requires one to be online this is a problem. Possible solutions are to provide a way of storing the images or creating hardcopy copies. We could also provide standard spreadsheet templates that would produce data in a compatible format for uploading. In the worst case we would need to rely on volunteers who can work online or have online access through libraries or other facilities.

Intellectual Property

Intellectual property rights are a senstive issue. Some people will refuse to work on any project where anyone receives any commercial benefit. Many people will be satisfied with an arrangement as long as the data is available for free. Having free access available should ensure that any commercial use is providing added value in order to be viable. One option is to require commercial use to provide notice that the data is freely available and where one can find it. This solution has been widely used in the free/open source software world.

Existing Data

A lot of transcription work has already been performed but is scattered over many locations and access is not free. Where access is not free or permission is not provided or the data is not in a useable format work will have to be duplicated in order to provide a complete and free index. Where data can be imported we should do so. Some tools have been developed which allow data to be imported from a spreadsheet and we have successfully imported some data in this fashion.

Field Selection

The following columns were chosen for the index:

This set of columns can be viewed at reasonable magnification in a reasonable size image, and all these columns can be visible without scrolling on a 1024 pixel wide screen. The software has been designed to allow a full transcription, and/or some intermediate set of fields to be built on top of the index, either after the index is complete or while the index is under construction.

Quality Control

The number of people who are willing to proofread data is much smaller than the number willing to transcribe so there will be a time gap between transcription and review. Given that the index points to the original images the availability of unproofread data is still preferable to waiting, potentially for years, for the data to be rigorously reviewed. The site presently incorporates features that allow users to record errors they find and suggested corrections, and these errors are presented to the original transcriber and/or a corrections review group for consideration. In the near future options will be added to allow suggested corrections and alternative transcriptions to be made viewable by users.