Piggybacking on Google
Two months ago Yakov Shafranovich made an excellent recommendation: why not just use Google!? I didn’t quite realize at the time how insightful he was.
Searching PhpGedView websites
PGVAgent currently searches PhpGedView websites by randomizing the internal list of websites and querying ten of them at a time. It continues doing this until the list is exhausted, which can take quite a while. Using Google, however, PGVAgent can take a huge shortcut. A Google query of the form ”inurl:individual.php inurl:pid= [search terms]” returns PGV individual pages that contain the given search terms. PGVAgent can use these results to prioritize PGV websites and potentially get to the interesting results a lot sooner.
Searching GEDCOM files
Yakov suggested another Google query that exposes a treasure trove of genealogical information: “filetype:ged [search terms]”. This query returns raw GEDCOM files that contain the given search terms. As PGVAgent already requires a GEDCOM parser to retrieve genealogical information beyond name, gender, birth, and death from PGV websites, I can simply reuse it to import data from raw GEDCOM sources. Granted, it doesn’t directly further the primary objective of my thesis, but it does make Genesis all that more useful to its users.
You can expect both enhancements in a future version of Genesis and PGVAgent (no idea how soon though).
Technorati Tags:
Filed in 


Doesn’t it produce a lot of unnecessary downloads if we retrieve a complete Gedcom file just to find one name?
Another interesting possibility would be to download the Gedcom files into the Valhalla server. But that will probably cause copyright problems.
Could it be possible to use GENDEX … or TNG Network (http://tngnetwork.lythgoes.net/index.php) as it is called now? Perhaps just the GENDEX files could be used not to sponge on TNG.
Jesper, you’re totally right. Downloading entire GEDCOM files just for a name is indeed a waste, but on the other hand it’s a simple measure that will work until something better (what you propose) is available. Once we have Valhalla, it’d be great to write a web crawler that feeds data into it. Or better yet, perhaps the folks at WeRelate.org would be interested in deploying Valhalla, or at least exposing their data in a DFT-friendly way.
crex, that’s a great idea as well! It would be pretty easy to use the TNG Network to find TNG websites. The trick though is getting the actual data out of the websites. I don’t think TNG supports any web services (as PhpGedView does), and it takes a lot of effort to screen scrape. I do want Genesis to be able to access any genealogy service in the future though. Well, now would be nice, actually, but one step at a time.
The GENDEX idea was invented by Gene Stark and it had it’s own search engine on gendex.com (site dead since 2004). TNG has adapted that idea. Several genealogy programs can generate GENDEX files. Google for “gendex filetype:txt” and you’ll see how the files looks like. There aren’t that many to be found now, but if it could be used in Genesis more users might want to generate these index files. I guess it would be primaily for accessing static web pages. No relations is recorded in GENDEX unfortunately. At one time the GENDEX genealogy search engine indexed “more than 22,000 online databases of genealogical information on 60 million people.” Perhaps someone will write the GENDEX 2 file format that also includes relations :)
Hi Hilton,
It was good to meet you at the FHT conference. Keep up the great work!!
Something else that you can do with Google is ask it to give you a bigger list of phpgedview websites. This is what hackers have done in the past to try and look for exploits.
For example enter the following search query into google:
individual.php phpgedview version 4
You will have to filter out some noise, but it could give you more sites to search.
–John
Thanks John. It was good to meet you and your students as well!
Thanks for the great tip as well. I was surprised to learn that there are over 10,000 PGV websites out there. I guess I’d just assumed that the 1,000 or so listed in the public registry were nearly all of them. I’d really like to be able to access the rest of them as well.
Hi Hilten,
the idea of GENDEX is to make all private genealogical web pages searchable in one index.
A GENDEX Server does not generate own, redundant web pages, it just points to the original. It gets the simple information from a GEDCOM file which is placed somewhere on the users web space. The link to it has to be published to the GENDEX server, which reads it from time to time to update its data base.
Additional to TNG an other GENDEX server grows: http://www.familytreeseeker.com .
It looks more professional and contains twice as much records as TNG ( http://tngnetwork.lythgoes.net ).
See also GENDEX FAQ: http://www.familytreeseeker.com/faq.php?l=en&p=14
This is not my page, but I use it to find people who have the same ancestors.
I would recommend everyone to provide his GENDEX file.
Regards,
Michael Beuss
Thanks Michael!
Actually, within the last week I’ve been in contact with the owner of familytreeseeker.com about using the site in Genesis. Contrary to appearances, I’ve been working on Genesis as time becomes available, and a new version will be coming out soon. So far, the plan is to use familytreeseeker.com as the default search engine. It will be presented in a browser window within Genesis. I’m also working on a p2p genealogy system in one of my CS classes which uses GENDEX records to distribute the index. I think it’s a simple, straightforward, and effective format and hope that I can help influence adoption.