Enabling the Distributed Family Tree

This is the official research blog for the Distributed Family Tree, an open network of genealogical data and metadata.  In a nutshell, the big idea is that we can combine all available genealogical information on the Internet into a single distributed network.  The foundation for this network is the substance of the Master's Thesis that I am currently working on.

Storage and Querying of E-Commerce Data

I just finished reading Storage and Querying of E-Commerce Data by Rakesh Agrawal, Amit Somani, and Yirong Xu of the IBM Almaden Research Center in San Jose. It seems like all the cool stuff happens there. Anyhow, the paper considers alternate implementations and performance results for a vertical representation of sparse data. The traditional way of storing data in a table is what the authors call the horizontal representation: a row for each entity and a column for each attribute of that entity.  The data is considered sparse when only a handful of the attributes apply to any given entity (the rest of the attributes are NULL). A vertical representation, according to the authors, is a table with three columns: object identifier, attribute identifier, and attribute value.

Sounds familiar? I thought so too. I was especially excited to read about their analysis of why a vertical representation is desirable. The benefit that really stood out to me was rapid schema evolution: the ability to frequently add new attributes without incurring significant costs. This is key to the extensibility of the DFT data model.

The experimental results are the most interesting part, though. At first blush they are very promising: using a vertical representation for sparse data vastly outperforms the horizontal representation (though it does incur a significantly higher index cost). It’s important to consider what this means, however. Most of the experiments consider entities with 1,000 attributes, 5-10% of which have values for any given entity (the rest of the attributes are NULL). Will this really be the case with genealogical data?

I don’t have any hard numbers, but I suspect that genealogical data in the DFT will have some significant differences. I don’t think it’s too much of a stretch to imagine as many as 1,000 possible attributes (the GEDCOM 5.5 standard specifies around 130, and developers often extend it with many more). But it’s important to realize that only a dozen or so of them will be used very often. Name, sex, and events such as birth, marriage, and death are easily the most common; I expect that other attributes quickly drop off in frequency according to a power law distribution.

This suggests to me that a hybrid approach might work better. For example, Chimezie Ogbuji advocated using both a triples table and a types table (actually two triples tables, one for assertions and one for quotations, but I really don’t like RDF reification so I pretend that it doesn’t exist). The triples table functions as a vertical representation, while the types table functions as a horizontal representation of one attribute. This was done because rdf:type predicates occur so much more frequently than other predicates. With the DFT, it might make sense to have an individuals table with a handful of common attributes such as name and sex, and then a triples table for the many less-common attributes.

Producing such an implementation is outside the scope of my current project, though. My interest in performance is entirely pragmatic, and the current implementation (a single triples table… well, actually a quads table, to track named graphs) is working well enough so far.

No comments yet

Leave a Reply

cheap generic kamagra kamagra uk viagra, Viagra Buy generic viagra levitra and cialis pills