Posts tagged ‘Standards’

February 13, 2012

Blog Bigos …

by C. Michael Eliasz-Solomon

Stanczyk added a new Page (Tech Diary) to record my technology doings.

While doing that and reading from my blogroll (and emails), I discovered some history about the “defacto standard GEDCOM” (wiki: GEDCOM ). Now I strongly recommend you start from “defacto” link rather than the wikipedia link.

  • RootsTech 2012 – had two GEDCOM presentations by Ryan Heaton (FamilySearch, GEDCOMX project).
  • RootsTech 2012 – had one open source GEDCOM parser presentation by Dallan Quass. Dallan was quite remarkable in his efforts to achieve a 94% commonality amongst 7,000 different GEDCOM files. Dallan Quass has a GitHub project for his Open Source GEDCOM parser.
  • Modern Software Experience (Tamura Jones) had a couple articles that caused me to write this article. His most recent GEDCOM article that caught my eye was:  BetterGEDCOM (2/2/2012). I also noticed he had a GEDCOMX article from 12/12/2011. These two articles provide a good discussion. I also noticed that the BetterGEDCOM project had their own project blog. [also see his Gentle Introduction to GEDCOM  article].

I believe those provide the most recent current thoughts on GEDCOM (that I have not penned).

  • I have been studying GEDCOM v5.5 (the last GEDCOM standard).
  • I produced a partial Graphic Syntax Diagram of GEDCOM v5.5 [what I had been calling "Railroad Tracks"] just to demonstrate how I thought this diagram was a better vehicle to communicate the standard [than say UML object models].
  • I could not resist making slight tweaks to GEDCOM v5.5 even in my preliminary studies. Mostly so we could discuss GEDCOM in a readable fashion (i.e. whitespace for formatting, and comment lines ) or because the language cries out for consistency (i.e. requiring the HEAD tag to be a zero level, just like the TRLR tag).

My  Graphic Syntax Diagram of GEDCOM v5.5 was produced using an open source tool. It is partial and still high level. I did put in a construct so that you can clearly see all 128 standard tags. The Graphic Syntax Diagrammer is an excellent tool. I will have to offer the author a suggestion for the PNG images that it outputs. I need to take my diagram and manually edit it to make the drawing a better fit for 8.5″ x 11.0″ (aka A1) paper. I need to graphically wrap the railroad tracks and to add page breaks so that the image is itself usable for viewing/discussions. I will offer this sample drawing to any interested parties — including emailing the edited product to Ryan Heaton and Dallan Quass [who since they did not request it -- can feel free to ignore it].

My goal is to make minor tweaks to  GEDCOM v5.5 via this diagram [not programming] and try and get DallanQ to produce a one-off parser for it (call it, say GEDCOM 5.5.999) and hope that my tweaks will not lower Dallan’s hard work of achieving 94% compatibility. If it turns out to have virtually no effect on Dallan’s 94% compatibility in his Open Source parser, then I can think about  getting some software vendors to utilize the enhancements (via end user requests), since they are trivial, just to move the standard forward and to open an interest in the vendors to looking at how we create a new Open Standard for GEDCOM.

P.S.

Thanks to Tamura Jones, I now know I need to update my diagram to GEDCOM v5.5.1 first

February 12, 2012

GEDCOM Standards – Where Genealogy Meets Technology — #Genealogy, #Technology, #Standards

by C. Michael Eliasz-Solomon

Stanczyk, has been churning since about November of last year (2011).  I have a number of ideas rummaging around my brain for genealogy apps. For over a quarter century, I have been a computer professional and used and/or developed a lot of  programs using a myriad of technologies. At my core, I am a data expert: design it, store it, query it, manage it, analyze it and protect it. It being the data.

Before going to #RootsTech 2012, I knew GEDCOM was the core of our hobby/business/research. GEDCOM is our defacto standard. It is how data in exchanged between us and our various programs. I say defacto because as a standard goes it is not a very open standard (one organization “owns”   it, and  the rest of us go along with it). It also has not changed in about decade and a half; So Ryan Heaton was correct in calling it “stale”. It does still work .. mostly. Although if a standard does not progress then you get a lot of proprietary “enhancements” that prevent the interchange of data completely — since one vendor does not know how to deal with another vendor’s file in totality.

At present, GEDCOM maxes out at version 5.5, although there are various other variations you might  see. But 5.5 was the last standard version. I counted 128 total tags and a provision for creating non-standard tags (they start with an underscore).

[Mike thanks to Tamura Jones! Even though GEDCOM v5.5.1 was never finalized, it IS the defacto max version of GEDCOM. GEDCOM v5.5.1 added 9 tags, removed the BLOB tag, so we now have a total of 136 tags.   -- I will need to update even my high level graphic syntax diagram]

Tags are like:

INDI,   FAMC,   FAMS,   SOUR,   REPO,   HEAD,   TRLR    etc.   -or-      ALIA,   ANCE

The first bunch is familiar and are probably in your family tree (if you ever exported the GEDCOM file). The ALIA tag is one that Dallan Quass said was universally used wrong by all programs. After seeing its definition, I can see how it  is confusing.  As for the ANCE, tag I do not recall seeing any program letting me do any functionality that might utilize this tag. This tag is probably one of those tags that Dallan said is not used at all.

I looked at the “MULTIMEDIA” section of the standard. It looks like it is woefully out of date and probably not used at all (at least not in any standard way), which is probably why our pics, audio, and video (or any other media file like PDF, MS Word) do not move with the GEDCOM. Has any program ever used the ENCODING/DECODING of a multimedia file? The standard seems to imply a buffer of only 32K (for a line) and even if you used a large number of  CONC tags strung one after another you need 100 lines to store a 3.2MB file in-line in the GEDCOM. I do not think I have seen that in a GEDCOM. They probably stored these binary large objects (BLOBs) outside the gedcom and refer to their path on the computer/network.  I did some noodling. I have 890 MB (or approximately  890,000 KB) in pictures and scanned source documents for about 1,000 people in my family tree. So I use nearly a gigabyte (1GB) for my family tree and all other multimedia — and I do not have any audio or video!  So I use almost 1MB/person.

If we did have this magical new GEDCOM standard that could carry all of our multimedia from one GEDCOM program to another GEDCOM program, the copying would take a long time. If I uploaded/download it to/from the Internet, I might incur an overage on my ISP’s usage charges, if this were technically feasible!   Imagine if I did this multiple times a month (as I got updates). I am beginning to understand why no vendor has tackled the problem. I would also like to store PDFs and other documents besides GIF/JPG/PNG which can be displayed on the Internet web pages natively in a browser. Those are not a part of the existing GEDCOM standard. Let me sling some jargon — I’d want to store any file type that there is a MIME type definition for,  that I can currently embed in emails,  or utilize in Java programs or that the HTML5 standard will allow for multimedia.

The GEDCOM 5.5 was in its infancy on dealing with character sets. It was predominantly ASCII with some funky ANSEL coding of characters to handle latin alphabet diacriticals, although it is not clear how I would do the data entry for those and it looks incomplete. It did mention UNICODE, but only cursory and just to remind us that the lengths in the GEDCOM standard were in  ‘characters’ not bytes –which was correct. Although those multibyte characters (say in Hebrew, Russian or Japanese or Chinese) would quickly use up the 32K byte line buffer  limit, which would effectively become about 8K characters per line. In fact, GEDCOM 5.5 says it will only deal with LATIN alphabets and leave Cyrillic, Hebrew and Kanji for some far flung future. Stanczyk  is Slavic, I need UNICODE to represent my ancestor’s names and places. Fortunately, I do not feel the need for Cyrillic (Russian, Ukrainian, Belorussian, Macedonian, etc.) or I’d be out of luck. I’ll just use the Polish version of those names in their ‘Latinized’ forms.

Oh that is another area the standard needs to be enhanced. NAMES. Dallan mentioned that Personal Names do not get a thorough treatment in the standard (I am refusing to read the data model and I am a Data Architect). Location Names get almost no treatment — they do give you a place to store your locations  (PLAC tag). What language should I use, after all my ancestors are from POLAND for God’s sake. Besides the obvious Polish, I have German, Russian and Latin to deal with and being American I prefer English. Slavic names often do not translate well. For example Wladyslaw is Ladislaus in Latin, but in English there is no equivalent — maybe that is why my ancestors use ‘Walter’ instead. But the point is, how should I store the name? Can I store all of the equivalents and search on any of them? Nope.

Damn, Russian is Cyrillic.  GEDCOM doesn’t deal with non Latin alphabets;  And even though I can read the Russian genealogy records, I ‘d rather not nor would I want to try and do data entry that way either. Besides, the communists reformed the language in 1918 (making War & Peace considerably shorter in Russian); That reform eliminated several characters. Most modern software is not aware of the eliminated characters  much less able to generate them. This whole Language/Unicode/Name thing is complicated and I have not even mentioned the changing borders or the renaming of cities in different languages or over time or their changing jurisdictions. I cannot fault GEDCOM for all of these woes. I have them in my own research and I have not yet found any satisfying way to  handle them. I find it helps to have a very good memory and keep these things in my head — but there is no backup for that.

How are we ever going to arrive at the vision Jay Verkler put forth at #RootsTech?  GEDCOM needs to become an open standard. Once it is standardized again, then it needs to become modern again and deal with the current technology, so we can get around to the tough problems of conforming: names, places, sources/repositories, calendars/dates  and doing complex analyses like Social Network Analysis as a way to gather wayward ancestors into a family for which we lack documentation to prove (Genealogically). I hope the future includes Bieder-Morse phonetic matching and can deal with folding diacritical characters into a base character (ex.  change ę into e) for searches.

FamilySearch, if you are going to register GEDCOM tools, then please do a few more things for the NEW standard. First, make each vendor add to an APPENDIX the name and complete definition of their NON-STANDARD tags, in case anyone else wishes to implement or deal with them. Put a section in the header (HEAD tag) that lists all NON-STANDARD tags (just once each) along with its vendor so that someone else can go look at the standard and see what these tags mean and possibly implement the good ones. Forget that two byte thing before the HEAD tag. Just make the HEAD tag ‘s  CHAR sub-tag indicate the character set (ANSI | ANSEL | UNICODE ).  Please administer a #RootsTech keynote to vote on annual changes to the GEDCOM standard. Provide a GEDCOM validator and also a GEDCOM converter webpage to allow users/vendors to validate/convert their gedcom file(s).

Make multimedia be meta-data and allow users to define “LOCATIONS” where multimedia files can be found using either a PATH or a URL (or a relative path / URL). Make it a part of the standard that the meta-data must move, but the multimedia files can optionally stay put. Multimedia should be able to be placed on a LOCAL/NETWORK, or on the INTERNET or on a multimedia  removable volume(s) [thumb drives, CDs, DVDs, etc.]. Make the multimedia “LOCATIONS” editable so a user can switch between LOCAL/NETWORK, INTERNET, or REMOVABLE including using some of each type of LOCATION. Allows these files to exist or not (show “UNAVAILABLE” or some equivalent visual clue, if accessed and they do not exist).  The mapping between an Individual (INDI) or a family (FAM) or some other future GROUP and its multimedia file(s) must move as a part of the meta-data (even if the multimedia file(s) do not). That way the end-user need only edit his LOCATIONS meta-data (and ensure the files are in that/those location(s)) when he runs the software.

Define an API for GEDCOM plug-ins so that new software can access the GEDCOM without parsing the gedcom file. The API should give the external plug-in a wrapped interface to the underlying data model without having to know the data model, just the individual, family, or location, or a name list of individuals, families, or locations. This will allow new software to provide additional functionality to a family tree or to provide inter-operability between trees/websites. Obviously security/privacy rules would limit this kind of  plug-in access.

That’s Stanczyk’s vision of the GEDCOM future!

February 5, 2012

Is GEDCOM Dead? Date/Place of Death, Please?

by C. Michael Eliasz-Solomon

The RootsTech Conference is living up to its name. Everywhere there was a sea of: iPhones/Androids, iPads (in huge numbers), and laptops. Even the very elderly were geared up. Google, Dell, and Microsoft were at RootsTech. — why not Apple, especially since their customers were present in LARGE numbers??? [note to Tim Cook have Apple sponsor and show up as a vendor.]

According to Ryan Heaton (FamilySearch), “GEDCOM is stale.” He went on to speak about GEDCOMX as the next standard as if GEDCOM were old and/or dead. They were not even going to make GEDCOMX backwards compatible! In a future session I had with Heaton I asked the Million dollar question, “How do I get my GEDCOM into GEDCOMX”? After a moments pause he said they’d write some sort of tool to import or convert the existing GEDCOM files. Well that was reassuring??? So they want GEDCOMX to be a standard but FamilySearch are the only ones working on it and they have not had the ability to reach out to the software vendors yet (I know I asked).

My suggestion was to publish the language (like HTML, SQL, or GEDCOM). I asked for “railroad tracks“, what we used to call finite state automata, and what Oracle uses to demonstrate SQL syntax, statements that are valid with options denoted and even APIs for embedding SQL into other programming languages. Easy to write a parser or something akin to a validator (like W3C has for HTML).

Dallan Quass  took a better tack on GEDCOM. His approach was more evolutionary, rather than revolutionary. He collected some 7,000+ gedcoms

GEDCOM Tags

and wrote an open source parser for the current GEDCOM standard (v5.5). He analyzed the flaws in the current standard and saw unused tags, tags like ALIA
that were always used wrong, custom tags and errors in applying the standard. He also pointed out that the concept of a NAME is not fully defined in the standard and so is left to developers (i.e. vendors) to implement as they want. These were the issues making gedcoms incompatible between vendors. He said his open source parser could achieve 94% round trip from one vendor to another vendor.

Now that made the GEDCOMX guys take notice — here was their possible import/conversion tool.

The users just want true portability of their own gedcoms and the ability to not have to re-enter pics, audio, movies over and over again. RootsTech’s vision of APIs that would allow the use of “authorities” to conform names, places, and sources would also help move genealogy to the utopian future Jay Verkler spoke of at the keynote. APIs would also provide bridges into the GEDCOM for chart/output tools, utilities(merge trees), Web 2.0 sharing across websites / search engines / databases (more utopian vision).

GEDCOM is the obvious path forward. Why not improve what is mostly working and focus on the end users and their needs?

FamilySearch get vendors involved and for God’s sake get Dallan Quass involved. Publish a new GEDCOM spec with RailRoad tracks (aka Graphic Syntax Diagrams) and then educate vendors and Users on the new gedcom/gedcomx.    Create a new gedcom validator and let users run their current gedcoms against it to produce new gedcoms (which should be backward compatible with old gedcom to get at least 94% compliance that Quass can already do)!

Ask users for new “segments” in the railroad tracks to get new features that real users and possibly vendors want in future gedcoms. Let there be an annual RootsTech keynote where all attendees can vote via the RootsTech app on the proposed new gedcom enhancements.

How about that FamilySearch? Is that doable? What do you my readers think? Email me (or comment below).


P.S.
       Do Not use UML models to communicate the standard. It is simply not accessible to genealogists. Trust me I am a Data Architect.

Tags: ,
Follow

Get every new post delivered to your Inbox.

Join 434 other followers

%d bloggers like this: