This is version 1. It is not the current version, and thus it cannot be edited.
[Back to current version]   [Restore this version]

Project Overview

The Goal

Imagine if you could look up any person who ever lived in Canada and find an organized collection of all the public records related to that person. With one search you would find their entry in each census done during their lifetime, links to their family members, their vital statistics records such as birth, marriage and death records, immigration records, homesteading records, biographies, war service records, photographs, mentions in newspapers and books, journals, family stories and more. That's our goal.

In a single sentence, our aim is to provide access to public records organized by person. This will complement the currently prevalent organization by source, but will have many additional advantages.

One of the primary activities in family history research is collecting together the various records related to each person in a family tree. This generally involves visiting the sites or web sites of the many institutions that house the relevant records, making copies, and building up a collection of records, usually on a personal computer. Generally each genealogist repeats this process for each person in their tree, repeatedly duplicating work done by other genealogists. Some collaboration and sharing of work does go on but this is complicated by the fact that most genealogy programs are more oriented to storing conclusions about a person than storing a collection of evidence about them.

Evidence and Conclusions

Birth records or census entries that include a birth date provide evidence about a person's birth date. After examining one or more pieces of evidence a genealogist will often come to a conclusion about the person's birth date and record that conclusion in their genealogy program. It is these conclusions that are most readily shared when genealogists collaborate. The gedcom file format does a reasonable job of providing a standardized format for representing conclusions. The gedcom format does not however standardize formatting of sources and associated data. Each genealogist has their own naming convention for sources, and there is no established standard for including transcriptions or digitized images of the actual data. The various formats of generated reports also generally cater to presenting conclusions rather than evidence. The situation is further complicated by the fact that the evidence is often contradictory, one census gives one date, another census gives another date, and a soldier's enlistment record may give another. There are no standardized exchange mechanisms that make it easy to cleanly exchange evidence about a person or group of people. Because each genealogy program handles events and sources differently, and then each user of each program uses the program differently, there is no easy way to transfer large collections of evidence between two collaborating genealogists, let alone global sharing.

A Different Approach

With the advent of the world wide web, and its associated technologies, it is now practical to undertake a different approach to collaborating, which is the building of shared repositories of evidence. Instead of citing a record, such as a census record, one can link to that record in the common repository. The next step after that is to build a common mechanism for grouping the records that pertain to a particular person so that one can refer to this set of records as a whole. Rather than each genealogist independently constructing these groupings each record can be added to a grouping once, and be instantly globally available. This will allow people to build on the available evidence, adding new records rather than repeatedly reconstructing the same record sets already constructed by others.

There are two obvious ways to go about building such a global resource: the data can all be collected together in one place, or the data can remain at the custodial institutions with the addition of standardized methods of access. On the surface, it is sufficient that any record be accessible via a simple web address, as long as the addresses are reasonably permanent. There are advantages however to deeper access, the process of organizing data can be greatly facilitated if we can construct tools for sorting, searching and matching, and those tools are likely to be most effective if they can efficiently access the raw data. Again, this can be accomplished either by moving data to a site where the tools can be applied, or by distributing the tools so they can be utilized where the data is located.

The Virtues of Reciprocity

If there are two related resources where a user is likely to be interested in both, having a link between those resources is likely to increase access to both. When a user discovers one resource they are likely to follow the link to discover the second, and vica versa, resulting in increased access to both resources.

A centralized directory of records can list the records that relate to a given individual, providing links to those records. Individual records can also provide a link back to the list at the central repository. By linking both ways both sides of the equation benefit. The value of a collection is enhanced by the ability to discover additional resources associated with a record in the collection, and the ease of finding a record in a collection is improved, providing the opportunity for wider access. The link from a record in a collection to the centralized list of associated records provides an additional point of access to that list, and increases the opportunities for that list to be found and accessed. It is also possible to make the centralized list transparent, with each record providing links directly to the associated records.

Institutional, Commercial, and Volunteer Approaches

Building up the lists of associated records is likely to be extremely labour intensive. Most custodial institutions will have limited resources they can afford or justify contributing. Approaching the project from a commercial perspective requires a huge investment of labour, and therefore of capital. It may be difficult to justify such an investment if it doesn't have an accompanying plan for a return on investment. In fact, it is possible that such a project would cannibalize the value of a company's existing resources and undermine its business model. If a company charges by time rather than number of records accessed it may be counterproductive to provide quicker access, at least in lui of competitive pressure. A commercial approach may also encounter resistance in obtaining access to records, and reciprocity breaks down somewhat if it costs money to follow a link in one direction and not in the other. Lastly, a commercial approach may have difficulty leveraging volunteer resources if the volunteers perceive that they are providing free labour to someone else's commercial benefit.

Volunteers are a viable approach to accomplishing this task. There are a huge number of genealogists that stand to gain from the results of the work, and many of these are willing to contribute their labour for free if access to the result is free. Genealogists are basically doing the same work for free anyway, so they have little to lose and much to gain by working collaboratively.

A Model For Collaboration

There are many institutions with records resources that want to maximize use of those resources but are not willing to simply hand them over to another institution or project. It would therefore be useful to set up a framework where these resource can be utilized and linked to without requiring the custodial institution giving up all control. One straightforward approach would be to simply maintain a central database of associated records which could be accessed by each participating institution. Each institution could then incorporate links into their web sites so that when they display a record that has associated records they also display either links to the other associated records or a link to a central site that displays all the associated records.

Although the various census indexes on Automated Genealogy are all hosted on the one site they still provide a good basis for illustrating how linking between sites could work as the same basics apply. Each record in each census can be linked to other records, either on the site or off the site. One can follow links from a record in one census, to a list of the associated records, and from there to any of those associated records. There are associated records that are on different sites but at the current time none of those sites provide links back to the associated records.

Similar But Different Projects

There are several web sites and projects aiming toward similar ends, although not necessarily specific to Canada and not approached in the same manner. The two most prevalent approaches are to try to build on the existing work of genealogists by building on contributed gedcom files, and the building of large collections of records. The former are usually hobbled by the quality of the contributed data and the fact that they are primarily working with conclusions rather than records. The latter approach is a good starting point, but few of the companies involved have an articulated plan for moving beyond just collecting records.

There are several web sites that ask people to submit their Gedcom files, after which the system attempts to perform some sort of matching and/or merging process. In some cases the site merely collects a large number of Gedcom files and allows you to search all of them at once, in other cases there is an attempt to build a single "family tree" by merging all the data together. Either way, these projects attempt to leverage the work of many people, but face a fundamental challenge: a lot of the submitted data is simply wrong, and there is no easy way to tell what information is correct and which is not. All projects based on contributions from large numbers of people face this issue: some contributors will provide high quality data, and some contributors will provide low quality data. It is difficult, both from a technical viewpoint, and from a volunteer or customer relationship viewpoint, to assess the quality of contributions. It is especially difficult when those contributions are in the form of conclusions instead of evidence. Two sources may give different birth dates, it is much more difficult to assess a conclusion about which, if either, of the records is correct, than to assess whether the transcription of each source is accurate. To assess whether the two records even refer to the same person does require some judgment, but still much less judgment than deciding which record is in error about the birth date.

A contrasting approach to this issue is to concentrate on data that is relatively easy to verify. Working on transcribing data from publicly accessible digital images of the original data is a good example. With transcribed census data that is directly linked to digital images of the original census forms you can verify or dispute the correctness of the transcribed data with just the click of a link. Where handwriting or image quality makes data ambiguous, having associated records easily accessible can provide disambiguation. The more data that is readily accessible the more opportunities there are to improve quality. Census data is unusual in that it provides a snapshot of data about all the members of a family at a given time, and its coverage is very high, i.e. almost everyone appears in each census. If you have all the census data for all the members of a family you have an excellent start toward building a family tree.

Starting with collecting evidence, prior to moving on to drawing conclusions can be viewed as laying solid foundations before constructing the house. Building a global family tree by merging existing data of dubious origin is similar to building a house, and planning to work on the foundation later. It could even be likened to setting out to construct a skyscraper, not only without laying a solid foundation but planning to build with scrap material.

There are several companies and organizations attempting to build vast collections of databases, albeit where each database is essentially independent, tied together only by a common search facility. This approach is certainly a good starting point but falls short of the end goal. Although in one sense, a unified search of many databases implicitly links together data that matches search criteria, a large number of "false positive" search results get returned. It usually requires human judgment to sort out which are true matches for the person in question, and that work has to be redone every time the search is performed. Companies with broad collections of data have a solid foundation to build on, but to date few have expressed any intention to do so. Most such companies seem to be content to have their users endlessly duplicate one another's efforts, and indeed have a financial incentive to do so. These companies have created a large barrier to entry and can afford to sit on their laurels until someone threatens their market position with a superior service. If a competitor can save users time and effort by organizing records by person, these companies will be forced to respond, to everyone's benefit.

The "natural monopoly" dilemma

The simplest way to achieve the goal of reorganizing data to provide a person-centric view is to gather all the records together in one place. It is much easier to provide the infrastructure if you have complete access to all the data. The problem is that the data is not all kept in one place currently, and the current custodians have interests that cause them to be reluctant to give up what control they have. In many cases their funding relies to some extent on the measured utilization of their collections. If they give all their data away, even just copies of it, they stand to lose users, and to some extent they lose some claim to their importance.

"Winner takes all" has always been a dominant meme among internet ventures. Who ever gets there first, gets the most users, becomes most well known, will tend to become ever more dominant. Where sites are built on user generated content, the sites with the most users will have the most content and will therefore attract the majority of new users, so the gap between them and their competitors gets wider and wider. Competitors are then forced to establish niches or perish.

Because many important records are still under copyright, custodial institutions are able to limit the redistribution of their records, and it is difficult for anyone to collect them all under one roof. This is especially true in Canada in comparison to the United States because US government records are generally freely available to the public without copyright restrictions once they are made public, while the Canadian government maintains crown copyright for 50 years after documents are made public.

There is, therefore, a need to find a way where all parties can use records without undermining the custodial institutions. Fortunately, web technologies provide exactly the tools needed to accomplish this, especially if custodial institutions and the institutions that fund them are willing make small adjustments in how they measure utilization. Sometimes the number of hits on a particular web site is not the best metric of success.

As an alternative to collecting all the data on one site and having users access the records there, users could access records through links to the custodial institution. If access is made easier traffic may actually increase. Indeed, genealogy became more popular when ease of access to data was increased via the internet. If data is widely available there is a greater likelihood of innovation in providing access. For instance, one or more players may provide a person-centered view of the data instead of just another collection-oriented view.

A pathological exception should be noted: if you make records difficult to find, by providing inadequate search facilities for example, or insisting that users search databases individually, the user may have to perform several searches before finding the desired record, and indeed, more searches are likely to be performed when looking for a record that doesn't exist than for one that does. This is one of the reasons that browsing is sometimes superior to searching. If you can find all the records associated with an individual with one search, the total number of hits on web sites will be lower than if you have to perform multiple searches on many web sites. Indeed, there is a disincentive to provide the best possible service if it results in the user quickly finding what they need and leaving, at least if you measure success by traffic and duration of subscription instead of by finding the desired record quickly. If users are paying per item the incentive is to deliver items as quickly as possible, if they are paying by the minute, hour, day, month, or year there is an incentive to keep them paying as long as possible. Again however, a lower barrier to entry, both in money and effort, can lead to increased utilization, and this is exactly the stated mission of many institutions acting as records custodians.

Future edits

Swapping of data

Standards for linking


  Page Info My Prefs Log in
This particular version was published on 13:44 22-Sep-2007 by Lindsay Patten.
Automated Genealogy Wiki
JSPWiki v2.4.103