Responses – Exploring Gedcom — #Technology, #Genealogy

by C. Michael Eliasz-Solomon

Mail Room

The mailroom received three emails / comments from the “Exploring Gedcom” article.

Tamura Jones (Modern Software Experience), Louis Kessler (BeholdGenealogy.com),  and  Stan Mitchell (GenApps.net / ezGED Viewer).

Good solid GEDCOM experts all (unlike this jester who is only a journeyman apprentice to these fine men) and as you can see they are also bloggers themselves.

Here’s my summation:

  1.  WHITESPACE – All three disputed the whitespace proposal. Even though it was an optional feature accessed via on/off check-box — I yield to what I see as rising tide that I cannot swim against. I assume that XML will also be treated as badly by all for EXACTLY the same reasons — too verbose and makes for a poor data transfer mechanism because of  the bloat.
  2. UNICODE – Tamura pointed out that it was in the 5.5.1 standard. I said maybe so, but hardly implemented, needs to be mandatory. I also hate the two-hexabyte binary debris that makes an otherwise TEXT file into a partial binary file. Tamura points out that this byte-order indicator is commonly hidden (I am old school and use vi — nothing hidden) by PC editors. Besides, the HEAD tag CHAR sub-tag could be used to determine character set and keep file textual. Tamura said that would be a catch-22 (since you do not know the files encoding). Tamura points out that everyone does as I suppose and use ASCII (or UTF-16LE or UTF-16BE) to determine encoding.  Documentation needs to be updated too. Really almost no support for generating UNICODE chars in app — should be required, otherwise data entry is limited to clever users (i.e. tech-types or their friends).
  3. DATES – nobody liked DATES (or NAMES) as a zero level. I can live without dates, as I can always create a dimension with every possible date to slice & dice my genealogy facts. Names was also not part of my vision, other than I want a bunch of AKA names for an INDI.
  4. LOCN – Everyone agreed, the PLAC tag did provide a minimalist capability. 2/3 saw a good reason to have LOCN at the zero level, as I proposed. Let’s hope this feature gets in.  We may need to keep PLAC tag for backwards compatibility until all gedcoms have been converted.
  5. EVENTS as a zero level tag seemed to interest people, as MULTI-PERSON events (aka EVENT_TYPE_FAMILY) is NOT adequately dealt with in GEDCOM 5.5.1. I also think people want to standardize these events as much as possible and leave open the ability for a user to add their own events. EVENTS was also related to GROUPS which people seem to want in some fashion. The need to analyze a social network needs to have some better GROUP/EVENT/ROLE visibility that the current standard provides. I think we really we need EVENT and EVENT_TYPE tags to keep from adding a new GEDCOM tag every time someone says we need a new event (BIRT | CHR | BAPM | BARM | BASM | BLES | SLGC). All TYPE tags should be from a standard list that a user can add to. The list should be allowed to be localized (into a native langauge). This keeps parsing to a minimal while allowing for expansion. OLD tags are kept for backwards compatibility until a gedcom is upgraded to a later version. I think we also need a ROLE_TYPE to replace ROLE_IN_EVENT and add more standard roles, (i.e. GodMother, GodFather, Witness, Neighbor, MidWife, Rabbi, Brother, Sister, Aunt, Uncle, Cousin, Border, etc.) and this should also be localized and user upgradeable. Keep  EVENT_TYPE_FAMILY and EVENT_TYPE_INDIVIDUAL for backwards compatibility.
  6. DOCUMENTS (deferred) – The case was not made nor was the concept adequately explained. To Stanczyk this is related to multimedia and is for making it possible to locate all documents of a certain type (i.e.  Declaration of Intent) with its date and location and a locator to the multimedia  representation and pull these documents out in their own right regardless of the individual or family or group.
  7. GROUPS (deferred) – Some interest in defining a non-family group (ex. military unit, college fraternity/sororiety, religious society, etc.). These groups would be interesting in their own right to study. At RootsTech 2012, this seemed to be a novel idea that had a positive feel to the audience.

GROUPS is not intended for a non-traditional family unit which needs some thought and design in this Modern Family World.

Tamura chided me that many of my “wants” are a part of an offshoot of GEDCOM called GEDCOM 5.5EL (Extended Location GEDCOM derived from version 5.5). The only difference is  I want to get rid of the need for _LOC (a custom tag under GEDCOM definition) and use LOCN instead. Also I would want their undefined tag called NAMC (possibly renamed NAMA for NAMe Alias or NAMe Altertnate) be 0:M; meaning that you can have zero, one or many alternate/alias names for this LOCN (or INDI why not?).

Also the NAMC (or NAMA) should have a subtype FONE and FONETYPE (soundex, DM, Bieder-Morse, etc.) to aid in advanced searches or Google searches. But this is the argument for NAMES at zero level. The last names are usually where the  soundex/phonetic matching need to be stored. We do not need to repeat this data for each individual (INDI) just for each unique last name or alternate name. These things get created in Surname Index pages – how much easier if the NAMES (re surnames and alternate surname) had a zero level with the FONE, FONETYPE and the INDI had an XREF pointing to each NAME/NAME ALTERNATE s/he had used during their life. One might even hope that the zero level name had an XREF to each INDI too. If I were the Data Architect for GEDCOM, I have a zero level NAMES (for SURNAMES) and their soundex/phonetic codes, plus XREFs back and forth to INDI.

I need to cut this response short, but a great thanks to all who read the article and the above three for improving my thoughts by their comments/emails.

P.S. – If you follow the GEDCM 5.5EL link, you will notice they show their gedcom indented and then go on to say about the whitespace:

” improves readability … should (and will) not be performed at real Gedcom-output.”

My sentiments exactly. Sometimes you just need to show the gedcom (and indentation improves the presentation and understanding). I too never intended it to be processed into the gedcom output/exported to a file for transport (just for my own personal examination or writing purposes). However, I have capitulated on whitespace — so please no more email on whitespace and bloat.

4 Comments to “Responses – Exploring Gedcom — #Technology, #Genealogy”

  1. Stanczyk:

    Hmmm. Lots to chew on. Overall, you’re proposing “tweaks” to GEDCOM similar (but different) than the tweaks I propose. Both of us are different than the vast majority who want to do a total re-write.

    There is no difference between GEDCOM syntax capability and XML capability and JSON capability. One can be mapped into the other. A simple mechanical method can be used to translate a GEDCOM file into an equivalent GEDCOM-XML file. If the standard is written in one, it is effectively written in the others as well. Any program should be allowed to input one or the other, because there will be programs written that will translate between them. The syntax and prettiness and size of file is irrelevant. The structure is what’s important.

    Who cares about readability. GEDCOM is meant for data transfer. For that reason, you want to make it easiest for the programs to read – not for humans. I know you’ve already capitulated on whitespace, but you continued on about other niceties like alternative localized tags and omitting the BOM (Byte Order Mark) to make it easier for people viewing the file. All these considerations simply make it harder for the programmer. For us, the simpler you can make it, and the least number of option to choose from, the better.

    Your “DOCUMENTS” are no different that the MULTIMEDIA_RECORD (OBJE) that is already in GEDCOM. So it’s there, and just in need of tweaks.

    I see what you are trying to do, but putting Dates, Location, Events, Documents and Groups at a record level. This allows information to be attached to them. But you’ve got to be careful in not over-normalizing the GEDCOM structure like this. Too much disaggregation will make it tough for programmers to support. Most existing genealogy software don’t have a place for this data in their databases. They won’t want to add a new datastructure just to support a standard or just to allow a soundex to be transferred between programs. It’s got to be something important and truly required – not just something that would be nice to have.

    Louis

  2. Louis Kessler (@louiskessler)
    February 26, 2012 at 6:27 pm

    Stanczyk:

    Hmmm. Lots to chew on. Overall, you’re proposing “tweaks” to GEDCOM similar (but different) than the tweaks I propose. Both of us are different than the vast majority who want to do a total re-write.

    There is no difference between GEDCOM syntax capability and XML capability and JSON capability. One can be mapped into the other. A simple mechanical method can be used to translate a GEDCOM file into an equivalent GEDCOM-XML file. If the standard is written in one, it is effectively written in the others as well. Any program should be allowed to input one or the other, because there will be programs written that will translate between them. The syntax and prettiness and size of file is irrelevant. The structure is what’s important.

    Who cares about readability. GEDCOM is meant for data transfer. For that reason, you want to make it easiest for the programs to read – not for humans. I know you’ve already capitulated on whitespace, but you continued on about other niceties like alternative localized tags and omitting the BOM (Byte Order Mark) to make it easier for people viewing the file. All these considerations simply make it harder for the programmer. For us, the simpler you can make it, and the least number of option to choose from, the better.

    Your “DOCUMENTS” are no different that the MULTIMEDIA_RECORD (OBJE) that is already in GEDCOM. So it’s there, and just in need of tweaks.

    I see what you are trying to do, but putting Dates, Location, Events, Documents and Groups at a record level. This allows information to be attached to them. But you’ve got to be careful in not over-normalizing the GEDCOM structure like this. Too much disaggregation will make it tough for programmers to support. Most existing genealogy software don’t have a place for this data in their databases. They won’t want to add a new datastructure just to support a standard or just to allow a soundex to be transferred between programs. It’s got to be something important and truly required – not just something that would be nice to have.

    Louis

    • Louis,
      I like regular contributors. Thanks! I must be brief:

      1) We agree – Evolve Gedcom (Thou shalt not lose the work of prior researchers)
      2) We agree – gedcom is isomorphic to JSON or XML (and HTML); It is also isomorphic to 3NF(relational db modeling technique for Third Normal Form) or my preference BNF (Backus-Naur Form, a rigorous 3NF)); So it is isomorphic to 3NF, BNF, UML, SQL (at least Create-Table and similar DDL). These are implementation details. I leave it to each developer to use whatever isomorphic projections you wish, but GEDCOM is the core we all start from and is the lowest common denominator, agreed?

      All these things are languages are represented by grammars. Hence my obsession with Graphic Syntax Diagrams to speak to all people(USERS and DEVELOPERS — albeit I expect most users have glossy eyes now and don’t care how we get there)

      3) We disagree – on what a data model at 3NF would do. As a data architect who started as a coder back in high school 35 years ago, I have learned that a simple and complete model makes programming simpler — not harder. You have shrewdly seen that I am after making GEDCOM conform to a model that could be easily modeled in 3NF/BNF (isomorphic to UML for Ryan Heaton / GEDCOMX) with an eye to making an end user’s data entry and maintenance minimalist and reducing possibilities for errors due to redundancies that never get cleaned up everywhere.

      Tamura chided me for my 1,020 INDIs being a small family tree (isomorphic to gedcom). I have too many redundancies to cleanup; What must the people with LARGE family trees think? I also have an eye towards making a STAR model (where I denormalize my 3NF data that the users enter into a star schema that can be queried/analyzed). This is the domain of my professional career and what I understand well. The “vision” is similar to what Ancestry does or what GENI does. Mashup a bunch of family trees together in a transactional (3NF) model, cleanse data and conform dimensions and do all of the 38 steps that Ralph Kimbal says we need to make/manage a star schema, then do really cool things with genealogy.

      All we need is a rich, consistent standard GEDCOM and then we also need standard API’s into the transactional model (in whatever geek tech the vendor chooses) and standard APIs into the star schema (in another geek’s dream technology) and we should be able to do miraculous things with genealogy and by projection into other fields (like history — which isomorphic to genealogy). Every genealogy vendor that wants to tackle some problem space would be enabled to do so. But we must have a STANDARD (possibly ISO). No foundation, no building this Tower of Babel to Heaven. Actually that’s a pretty good analogy!

      Our common language has become befuddled and it confounds the engineers from continuing their building their goal. So I guess gedcom is isomorphic to building a Tower to Heaven too.

      Forgive this jester, I have failed again at brevity.

  3. Stanczyk:

    To be brief, I agree with what you say.

    Louis

Tell Me Your Thoughts ...

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 416 other followers

%d bloggers like this: