Goodbye Database!
In a completely unanticipated reversal, Genesis is now shedding its database. “What!?”, you ask in blinking disbelief (as I seem to have attracted a particularly quiet readership, I get to put words in your mouth). An excellent question, I’m glad you asked. Permit me, if you will, to entertain it.
Way, way back in the beginning, long before I’d ever even heard of this semantic web thing, I was planning on Genesis being nothing more than a really good record manager (like PAF, only usable). This of course necessitated a database for storing all the data on the user’s computer. I always assumed that one would be there, even though the whole concept later evolved. It never occurred to me that a database wasn’t really neccessary anymore.
The flash of insight came yesterday morning as I was contemplating the next step in the replumbing/resurfacing effort. I don’t recall the exact circumstances, but I do remember asking myself what would happen if I stopped caching data in the database. Well, performance would go through the roof, for starters! Startup and shutdown time would become negligible. Disk space usage would fall dramatically. And perhaps most important of all, I could take advantage of the OWL inference support in Jena!
This last point bears explanation. A major part of this project is the ability for the user to indicate that Person A and Person B are in fact the same person. This is done by creating an owl:sameAs relationship between the two. Given this fact, Genesis should infer (using the OWL inference rules) that anything said about Person A is also true about Person B, and vice versa; the two are effectively one. With some tricks this could efficiently be done using a database. However, anything more complex would be next to impossible without bloating the size (and reducing the speed) of the database several orders of magnitude; all inferences would need to be precomputed each time new data is added to the database. But inferences like this can be done in-memory (without a database) on demand!
Well there are obviously many positive aspects, but are there any downsides? The most obvious drawback is the fact that it takes PGVAgent a long time to search each PhpGedView website one-by-one. Having a database means that these search results can be cached for future searches. If the database goes, so does the persistent caching. In fact, this is why I had never considered dropping the database before. Which begs the question, why did I suddenly start considering it now?
Filed in 


Silent, yet avid readers…Unite!
It only makes sense, but it is also true that databases CAN be run mostly (if not entirely) within memory. Perhaps the database was set up poorly, caching everything to disk immediately and reading everything from the drive immediately? I know you can delay disk writes and still keep data in the database by having a separate disk-writer thread that works independently (in part, because last.fm works (or used to) on that principle)
Also, I question the use of owl:sameAs, as it is a very strong statement to make, and I don’t know how its ontological semantics would work in the named graph context. More and more from my start on that independent library, I’m believing that the named graph set itself needs a dedicated named graph in which assertions can be made, such as the ‘owl:sameAs’ claim…
Lastly, I’d think that the difficulties seen, in particular, with the owl:sameAs unification in Genesis, are more algorithmical than database-related (though of course disk-throughput may make a difference, depending on how much of the database is cached).
But this is all speculation on my part.
It’s true that the database may have been set up poorly, but even if I could get the database to run as quickly as an in-memory model, I would still have the unmentioned hassle of keeping Lucene indexes synchronized with the data in the database (which is problematic without transactions). The whole problem is greatly simplified by dropping teh database altogether, seeing as it isn’t really neccessary.
As for owl:sameAs, I agree that it should go into a special assertions graph. When it is asserted by an agent, it goes into the agent’s assertion graph, which is not trusted by default, so it gets ignored in reasoning. When it is asserted by the user, it goes into the user’s assertion graph, which is of course trusted and therefore used in reasoning. Because it can be either trusted or not trusted, I’m OK with it being a strong statement.
For reference, I’m planning on using WIQA to manage trust.
So I like the idea that only the “view” gets saved. There can be an invisible “layer” thou where somethings can be cached in a BDB for speed purposes only, but I really like the idea of being able to go to any computer and using this. Just one question… are “collections” going to be stored locally?
That’s funny, just this morning I started working on caching using BDB (thanks to your encouragement in past discussions, BDB is awesome)! Maybe I should write a new post, “Hello Database!” :)
A “collection” can exist anywhere, whether it be local or remote. Putting data online is very strongly encouraged, but I can imagine reasons why one might want to maintain a local collection instead. If you’re a Luddite, maybe? ;)