Archive for February 12th, 2012

February 12, 2012

GEDCOM Standards – Where Genealogy Meets Technology — #Genealogy, #Technology, #Standards

by C. Michael Eliasz-Solomon

Stanczyk, has been churning since about November of last year (2011).  I have a number of ideas rummaging around my brain for genealogy apps. For over a quarter century, I have been a computer professional and used and/or developed a lot of  programs using a myriad of technologies. At my core, I am a data expert: design it, store it, query it, manage it, analyze it and protect it. It being the data.

Before going to #RootsTech 2012, I knew GEDCOM was the core of our hobby/business/research. GEDCOM is our defacto standard. It is how data in exchanged between us and our various programs. I say defacto because as a standard goes it is not a very open standard (one organization “owns”   it, and  the rest of us go along with it). It also has not changed in about decade and a half; So Ryan Heaton was correct in calling it “stale”. It does still work .. mostly. Although if a standard does not progress then you get a lot of proprietary “enhancements” that prevent the interchange of data completely — since one vendor does not know how to deal with another vendor’s file in totality.

At present, GEDCOM maxes out at version 5.5, although there are various other variations you might  see. But 5.5 was the last standard version. I counted 128 total tags and a provision for creating non-standard tags (they start with an underscore).

[Mike thanks to Tamura Jones! Even though GEDCOM v5.5.1 was never finalized, it IS the defacto max version of GEDCOM. GEDCOM v5.5.1 added 9 tags, removed the BLOB tag, so we now have a total of 136 tags.   — I will need to update even my high level graphic syntax diagram]

Tags are like:

INDI,   FAMC,   FAMS,   SOUR,   REPO,   HEAD,   TRLR    etc.   -or-      ALIA,   ANCE

The first bunch is familiar and are probably in your family tree (if you ever exported the GEDCOM file). The ALIA tag is one that Dallan Quass said was universally used wrong by all programs. After seeing its definition, I can see how it  is confusing.  As for the ANCE, tag I do not recall seeing any program letting me do any functionality that might utilize this tag. This tag is probably one of those tags that Dallan said is not used at all.

I looked at the “MULTIMEDIA” section of the standard. It looks like it is woefully out of date and probably not used at all (at least not in any standard way), which is probably why our pics, audio, and video (or any other media file like PDF, MS Word) do not move with the GEDCOM. Has any program ever used the ENCODING/DECODING of a multimedia file? The standard seems to imply a buffer of only 32K (for a line) and even if you used a large number of  CONC tags strung one after another you need 100 lines to store a 3.2MB file in-line in the GEDCOM. I do not think I have seen that in a GEDCOM. They probably stored these binary large objects (BLOBs) outside the gedcom and refer to their path on the computer/network.  I did some noodling. I have 890 MB (or approximately  890,000 KB) in pictures and scanned source documents for about 1,000 people in my family tree. So I use nearly a gigabyte (1GB) for my family tree and all other multimedia — and I do not have any audio or video!  So I use almost 1MB/person.

If we did have this magical new GEDCOM standard that could carry all of our multimedia from one GEDCOM program to another GEDCOM program, the copying would take a long time. If I uploaded/download it to/from the Internet, I might incur an overage on my ISP’s usage charges, if this were technically feasible!   Imagine if I did this multiple times a month (as I got updates). I am beginning to understand why no vendor has tackled the problem. I would also like to store PDFs and other documents besides GIF/JPG/PNG which can be displayed on the Internet web pages natively in a browser. Those are not a part of the existing GEDCOM standard. Let me sling some jargon — I’d want to store any file type that there is a MIME type definition for,  that I can currently embed in emails,  or utilize in Java programs or that the HTML5 standard will allow for multimedia.

The GEDCOM 5.5 was in its infancy on dealing with character sets. It was predominantly ASCII with some funky ANSEL coding of characters to handle latin alphabet diacriticals, although it is not clear how I would do the data entry for those and it looks incomplete. It did mention UNICODE, but only cursory and just to remind us that the lengths in the GEDCOM standard were in  ‘characters’ not bytes –which was correct. Although those multibyte characters (say in Hebrew, Russian or Japanese or Chinese) would quickly use up the 32K byte line buffer  limit, which would effectively become about 8K characters per line. In fact, GEDCOM 5.5 says it will only deal with LATIN alphabets and leave Cyrillic, Hebrew and Kanji for some far flung future. Stanczyk  is Slavic, I need UNICODE to represent my ancestor’s names and places. Fortunately, I do not feel the need for Cyrillic (Russian, Ukrainian, Belorussian, Macedonian, etc.) or I’d be out of luck. I’ll just use the Polish version of those names in their ‘Latinized’ forms.

Oh that is another area the standard needs to be enhanced. NAMES. Dallan mentioned that Personal Names do not get a thorough treatment in the standard (I am refusing to read the data model and I am a Data Architect). Location Names get almost no treatment — they do give you a place to store your locations  (PLAC tag). What language should I use, after all my ancestors are from POLAND for God’s sake. Besides the obvious Polish, I have German, Russian and Latin to deal with and being American I prefer English. Slavic names often do not translate well. For example Wladyslaw is Ladislaus in Latin, but in English there is no equivalent — maybe that is why my ancestors use ‘Walter’ instead. But the point is, how should I store the name? Can I store all of the equivalents and search on any of them? Nope.

Damn, Russian is Cyrillic.  GEDCOM doesn’t deal with non Latin alphabets;  And even though I can read the Russian genealogy records, I ‘d rather not nor would I want to try and do data entry that way either. Besides, the communists reformed the language in 1918 (making War & Peace considerably shorter in Russian); That reform eliminated several characters. Most modern software is not aware of the eliminated characters  much less able to generate them. This whole Language/Unicode/Name thing is complicated and I have not even mentioned the changing borders or the renaming of cities in different languages or over time or their changing jurisdictions. I cannot fault GEDCOM for all of these woes. I have them in my own research and I have not yet found any satisfying way to  handle them. I find it helps to have a very good memory and keep these things in my head — but there is no backup for that.

How are we ever going to arrive at the vision Jay Verkler put forth at #RootsTech?  GEDCOM needs to become an open standard. Once it is standardized again, then it needs to become modern again and deal with the current technology, so we can get around to the tough problems of conforming: names, places, sources/repositories, calendars/dates  and doing complex analyses like Social Network Analysis as a way to gather wayward ancestors into a family for which we lack documentation to prove (Genealogically). I hope the future includes Bieder-Morse phonetic matching and can deal with folding diacritical characters into a base character (ex.  change ę into e) for searches.

FamilySearch, if you are going to register GEDCOM tools, then please do a few more things for the NEW standard. First, make each vendor add to an APPENDIX the name and complete definition of their NON-STANDARD tags, in case anyone else wishes to implement or deal with them. Put a section in the header (HEAD tag) that lists all NON-STANDARD tags (just once each) along with its vendor so that someone else can go look at the standard and see what these tags mean and possibly implement the good ones. Forget that two byte thing before the HEAD tag. Just make the HEAD tag ‘s  CHAR sub-tag indicate the character set (ANSI | ANSEL | UNICODE ).  Please administer a #RootsTech keynote to vote on annual changes to the GEDCOM standard. Provide a GEDCOM validator and also a GEDCOM converter webpage to allow users/vendors to validate/convert their gedcom file(s).

Make multimedia be meta-data and allow users to define “LOCATIONS” where multimedia files can be found using either a PATH or a URL (or a relative path / URL). Make it a part of the standard that the meta-data must move, but the multimedia files can optionally stay put. Multimedia should be able to be placed on a LOCAL/NETWORK, or on the INTERNET or on a multimedia  removable volume(s) [thumb drives, CDs, DVDs, etc.]. Make the multimedia “LOCATIONS” editable so a user can switch between LOCAL/NETWORK, INTERNET, or REMOVABLE including using some of each type of LOCATION. Allows these files to exist or not (show “UNAVAILABLE” or some equivalent visual clue, if accessed and they do not exist).  The mapping between an Individual (INDI) or a family (FAM) or some other future GROUP and its multimedia file(s) must move as a part of the meta-data (even if the multimedia file(s) do not). That way the end-user need only edit his LOCATIONS meta-data (and ensure the files are in that/those location(s)) when he runs the software.

Define an API for GEDCOM plug-ins so that new software can access the GEDCOM without parsing the gedcom file. The API should give the external plug-in a wrapped interface to the underlying data model without having to know the data model, just the individual, family, or location, or a name list of individuals, families, or locations. This will allow new software to provide additional functionality to a family tree or to provide inter-operability between trees/websites. Obviously security/privacy rules would limit this kind of  plug-in access.

That’s Stanczyk’s vision of the GEDCOM future!

February 12, 2012

#RootsTech Research – 2012 — #Polish, #Genealogy

by C. Michael Eliasz-Solomon

Stanczyk, prepares for going to an archive or research library. So when I was awarded the prize of going to #RootsTech, I immediately started my preparations.

I favor the microfilm which are free to read in the Family History Library in Salt Lake City.

Biechów –  MF # 1257788 (parts 8-10) which covered the years 1875-1877

Pacanów – MF #’s 1192352,  1192351 which covered 1876-1877   &  1875 [respectively]

Beszowa – 1257787

Tumlin – 1808856, 939955

Olesnica – 1807620 (parts 4-10)

Opatowiec – 1807620 (parts 11-16),  1192351 (parts 1-7)

Stopnica – 1807635 (parts 1-6)

Swiniary – 939951

And those were just the Polish villages (there were many more in the USA, but that is another floor).

Some of the above are because I am expanding the search for records to surrounding parishes. That is called a proximity/circle search. As it turned out, the proximity also included nearby parishes where affiliated families said they were from. So I was looking for GAWLIK in Opatowiec and GRONEK in Stopnica/Olesnica.  I always checked for ELIJASZ/LESZCZYŃSKI/WLECIAŁOWSKI in all villages. I was disappointed that I did not find KĘDZIERSKI in Tumlin.

I had prepared for some books (and/or maps) too. Sadly, many of these items were not located in the library and my three levels of assistants all failed to find them or even to explain why they could not be found:

943.8 E7sh (Malopolska cadastral. This was a high priority, so it was disappointing not to be able to locate these).

943.84 R2e (a register of Landowners — also not locatable).

A couple of books I did find, were a disappointment because they did not contain any of my family. Cest la vie — that too is a part of the research. All told I had 10 spreadsheet pages of  Family History Catatlog Items!  That may seem like a lot; But it is always better to be over prepared because as you see some items cannot be located, some are dead-ends, and some quickly show they do not contain what you are looking for after all. Being under prepared is just a time waster, but they do have PCs available to do catalog look-ups — so it is not a show stopper.

I dutifully check them off, as I use them and some times I note my findings (or lack there of).

Future Research

Next time I will have to search more thoroughly through Beszowa and exhaustively too [for Paluch]. I will also search Dobrowoda parish too [for Major]. I will have to dedicate a lot of time to Swinary too [for Elijasz, Leszczynski, Kordos, etc.] and also Szczucin.

I will have to find a way to get to Buffalo and find my great-uncle Franciszek Leszczynski’s records and hopefully his brother Jan (aka John) Leszczynski too.

I of course need to get to Poland and visit the actual archives and parishes of my ancestors to see those records that have not yet been microfilmed — I need to write down this research plan. I already know where the civil and diocesan archives are and of course the parishes themselves. I will need an abundance of time there to get around the language and customs and the learning curve of using these resources.

How do you prepare for your genealogy trips?

Stem Cellular

Science and technology improving health outcomes

Steve Szabados Genealogy

Genealogy Columnist for the Polish American Journal and Author

From Shepherds and Shoemakers

Sharing musings, insights, resources and strategies as I discover my family history.

Find Lost Russian & Ukrainian Family

Uncovering the secrets of finding family and records in the former USSR

Historia pamięcią pisana

Historia wsi Święty Józef na Pokuciu

The Dystopian Nation of City-State

A cruel, futuristic vision created by science fiction authors James Courtney and Kaisy Wilkerson-Mills. ©2013-2016. All Rights Reserved. All writings available through Amazon.

What's Past is Prologue

Adventures in genealogy

The Family Kalamazoo

A genealogical site devoted to the history of the DeKorn and Zuidweg families of Kalamazoo and the Mulder family of Caledonia

Interesting Literature

A Library of Literary Interestingness

Globe Drifting

Global issues, travel, photography & fashion. Drifting across the globe; the world is my oyster, my oyster through a lens.

Oracle Scratchpad

Just another Oracle weblog

toledo's kuschwantz

a Polish kid and her family from Toledo

Author Michael Charton

Home of Author Michael Charton


A Journey through History in Search of a Vanished Family

The Blog

The latest news on and the WordPress community.

%d bloggers like this: