Enabling the Distributed Family Tree

This is the official research blog for the Distributed Family Tree, an open network of genealogical data and metadata.  In a nutshell, the big idea is that we can combine all available genealogical information on the Internet into a single distributed network.  The foundation for this network is the substance of the Master's Thesis that I am currently working on.

Still Replumbing + New Search

So far I’ve been largely successful with replacing my own RDF subsystem with the Jena, ARQ, SDB, and NG4J libraries.  Whereas before it took a minute or more to import the PGV website directory, thanks to bulk loading it now imports almost instantaneously.  Previously only a very limited class of queries on the data were possible, but now with full SPARQL support, anything goes.  I’m really happy with it.

I’ve decided to scrap the old search, however.  I was going to simply refactor it in the interest of getting on with the thesis, but it will take just as much work to create a new search that works much better.  Central to this new search is a recent addition to the plumbing: full-text search.  Over the last two days I integrated Lucene, a full-text search engine, into the project.  Lucene makes it unbelievably easy to index content and then search it.  And it’s incredibly fast (at both)!

As proof of concept, I refactored the PGV website list to use a Lucene index (instead of a SPARQL query with caching).  Those of you who have used the PGV Websites view before will know very well that it was very slow.  You will be pleased to learn that it now takes less than a second to populate the list, even the first time it is opened!  True, the PGV Websites view is really not all that important (or at least not yet; in the future you’ll go here to add/remove sites and enter account details for non-anonymous access).  But it does suggest that using Lucene indexes for search will be very successful.

No comments yet

Leave a Reply