Enabling the Distributed Family Tree

This is the official research blog for the Distributed Family Tree, an open network of genealogical data and metadata.  In a nutshell, the big idea is that we can combine all available genealogical information on the Internet into a single distributed network.  The foundation for this network is the substance of the Master's Thesis that I am currently working on.

Data Model Extensibility, Part 5

This is the last of a five part series on the DFT data model. Part one covered the fundamentals: RDF, OWL, and Named Graphs. Part two demonstrated how basic genealogical information can be recorded in RDF. Part three showed how to record information that changed over the lifetime of an individual, such as surname. Part four showed how to cite sources.

Why Contradictory Data?

When you work with your own genealogy on your computer, you generally have everything just right. Maybe the birthdate on your great-great-grandfather’s death certificate is different from the one on his birth certificate, but you know which one is correct, and that’s the one you enter. When you import someone else’s GEDCOM, there may be contradictory differences between their data and your own, but you resolve them through the merge process. In the end, you again have clean data.

This works fine when you’re doing your own genealogy on your own computer, but when you want to collaborate globally, this way of doing things simply won’t cut it. Instead we need to accomodate all data, even contradictory, yet still allow people to see “clean” data. This is done using mediated views.  As previously described, software uses the current user’s confidence ratings to determine what data to display. Other users can use these confidence ratings as well, to help inform their own decisions.

Confidence Ratings

Here we’ll consider how to actually record confidence ratings. Let’s look at the following data on Rodger:

#rodger  rdf:type  gc:Individual
#rodger  gc:name   "Rodger Kroeber"
#rodger  gc:name   "Rodger Grover"
#rodger  gc:name   "Roger Crower"

Originally spelled “Rodger Kroeber,” both Rodger’s first and last names were butchered over the years. Having in our possession a copy of his original birth certificate which attests to what we believe is the correct spelling, we put the names in two graphs and then rate them:

#correct  {
 #rodger  gc:name   "Rodger Kroeber"
}

#incorrect {
 #rodger  gc:name   "Rodger Grover"
 #rodger  gc:name   "Roger Crower"
}

#correct   gt:rated    #rate1
#rate1      rdf:type    gt:Rating
#rate1      gt:surety   1.0
#rate1      gt:comment  "I have a copy of the original birth certificate which attests to this spelling."
#incorrect  gt:rated    #rate2
#rate2      rdf:type    gt:Rating
#rate2      gt:surety   0.0

The first rating indicates that we are 100% sure that the information in the #correct graph is correct, along with a human-readable comment as to why. The second rating indicates that we have no confidence in the information of the second graph whatsoever. Notice the use of the gt: prefix which, like the gc: and gp: prefixes, is purely suggestive.

Conclusion

There’s actually a lot more that could be said about the DFT data model. For example, you can link to Web sites, digital photos, or scanned documents in the provenance model; carry on a conversation about confidence disputes; trust genealogy “authorities;” or any number of other potential extensions. In this week’s series of posts I just wanted to make the basic concepts a little more concrete. As these features are implemented in Genesis I’ll describe them in greater depth.

No comments yet

Leave a Reply