Enabling the Distributed Family Tree

This is the official research blog for the Distributed Family Tree, an open network of genealogical data and metadata.  In a nutshell, the big idea is that we can combine all available genealogical information on the Internet into a single distributed network.  The foundation for this network is the substance of the Master's Thesis that I am currently working on.

Data Model Extensibility, Part 2

This is the second of a five part series on the DFT data model. Part one covered the fundamentals: RDF, OWL, and Named Graphs.

From GEDCOM to the DFT

GEDCOM, the de facto standard for exchanging genealogical data between computer programs, uses two main record types: individuals and families. An individual record represents a person. A family record represents the marriage of a husband and wife, along with their children. There are numerous examples of situations where the data cannot quite be mapped into this model, but GEDCOM supports a notes record which genealogists can use to make clarifying remarks. Individual genealogists often devise their own systems of recording incongruous information in a consistent manner as well.

This specification has worked well enough in practice, but it does not permit the degree of precision neccessary to support automated processes. A major thrust of this project is to get genealogical data into a machine-understandable format so that software agents can “perform research and other labor-intensive tasks on behalf of their human masters.” With this in mind, one of the ways that the DFT data model differs from GEDCOM is in how families are represented. Instead of explicitly representing families, relationships between individuals are represented through events. The family structure can be easily reconstructed from the interplay between individuals and events, without loss of information.

Individuals

An individual is simply a resource of the appropriate type. Recall that in RDF a resource is identified by a unique URI. For example, we can represent an individual with the URI #brandon like so:

#brandon  rdf:type  gc:Individual

This statement (given in N3 notation, which is much easier to read and write than XML) should be read “#brandon is an individual.” The rdf: and gc: prefixes are shorthand for the RDF and Genealogy Core (GC) ontologies (any use of the gc: prefix is purely suggestive because the GC ontology hasn’t been concretely defined yet).

The URI #brandon is an identifier, but not a name in the traditional sense. We can give #brandon a name and gender as follows:

#brandon  gc:name   "Brandon Gilbert"
#brandon  gc:sex   "M"

Other attributes are denoted similarly. Notice that computer software has no way of knowing that the first name is Brandon and the last name is Gilbert. It can only assume this. Genealogists often put forward slashes (as in “Brandon /Gilbert/”) to indicate the surname, but with non-English names this can get very tricky. The whole problem can be resolved by making everything explicit with a “name parts” extension to the data model. For example, Brandon’s name can be given as a list:

#brandon  gc:name    #n1
#n1       rdf:type   rdf:List
#n1       rdf:first   "Brandon"
#n1     rdf:rest   #n2
#n2       rdf:type   rdf:List
#n2      rdf:first  "Gilbert"
#n2      rdf:rest   rdf:nil

Each name part can then be tagged as a “given name” or “surname.” While this may not appear too helpful for English names, it is very useful in Spanish, Danish, and other naming systems.

The data model can also be extended with additional attributes. For example, it may be useful to note that Brandon was a mechanic. Assuming the GC does not provide for occupations, we can create our own ontology which does and then add occupational information:

#brandon   ex:occupation   "Mechanic"

Where ex: is the prefix for our “extension” ontology.

Events

An event is an appropriately typed resource identified by a unique URI.  There are many different kinds of events. For example, here is a birth event identified as #birth:

#birth   rdf:type gc:Birth

Events tend to happen at a specifc time and in a specific place. This information can be recorded like so:

#birth   gc:date   "March 7, 1878"
#birth  gc:place  "Boston, MA"

This happens to be when and where Brandon was born, so we’ll associate the event with him:

#brandon   gc:born  #birth

Events, like individuals, can have additional attributes that are not specified in the GC. For example, an enterprising genealogist may want to record what the weather was like at a given event. As with occupation above, an extension ontology can be written to provide for this.

We happen to know a little about Brandon’s parents, Marcus and Sally. As they both had something to do with Brandon’s birth, we can associate them with the event as well:

#marcus   gc:fathered   #birth
#sally    gc:gaveBirth  #birth

Note that this information is very precise.  It says that Marcus fathered Brandon, and that Sally gave birth to him. It implies nothing about their relationship. We can, however, show that they are in fact married:

#marriage   rdf:type    gc:Marriage
#marcus    gc:married  #marriage
#sally    gc:married  #marriage

The enterprising genealogist might associate the minister who performed the marriage as well (perhaps with ex:performed).

As with names, dates and places can also benefit from being more explicit. For example, it is not uncommon to see a date given in a record as “January 2, 1718/19.” No, there isn’t any uncertainty in the data, it’s simply an occurance of a double date: January 2, 1718 according to the Julian calander, and January 2, 1719 according to the Gregorian. So how to record it?

The data model can be extended to support explicit dates in different calander systems. For example, a death event that occured on “January 2, 1718/19″ could be recorded as:

#death   rdf:type  gc:Death
#death  gc:date   "January 2, 1718/19"
#death   gc:date   #d1
#d1     rdf:type  ex:Julian
#d1     ex:day    2
#d1     ex:month  1
#d1     ex:year   1718
#death   gc:date   #d2
#d2     rdf:type  ex:Gregorian
#d2     ex:day    2
#d2     ex:month  1
#d2     ex:year   1719

Notice that #death has three gc:date properties. It is possible to restrict the cardinality of properties using an ontology, but I have chosen not to do this to accomodate both complementary and contradictory data. We’ll see more on this in part five.

Next Time

In part three I’ll show how to record information that changed over the lifetime of an individual (such as surname).

    Trackbacks/Pingbacks


  1. […] In part two I’ll show how basic genealogical information can be recorded in RDF. Technorati Tags: RDF, OWL, Dublin Core, Named Graphs […]


  2. […] This is the third of a five part series on the DFT data model. Part one covered the fundamentals: RDF, OWL, and Named Graphs. Part two demonstrated how basic genealogical information can be recorded in RDF. […]


  3. […] This is the last of a five part series on the DFT data model. Part one covered the fundamentals: RDF, OWL, and Named Graphs. Part two demonstrated how basic genealogical information can be recorded in RDF. Part three showed how to record information that changed over the lifetime of an individual, such as surname. Part four showed how to cite sources. […]


  4. […] This is the fourth of a five part series on the DFT data model. Part one covered the fundamentals: RDF, OWL, and Named Graphs. Part two demonstrated how basic genealogical information can be recorded in RDF. Part three showed how to record information that changed over the lifetime of an individual, such as surname. […]

Leave a Reply