Meme: Exploring GEDCOM – Gedcom Lines — #Genealogy, #Technology, #Mashup

by C. Michael Eliasz-Solomon

Stanczyk wants to introduce a new meme, “Exploring GEDCOM“. I was musing upon why is the state of a GEDCOM standard, … so CHAOTIC? GEDCOM has languished for about a decade and a half now with no new standard — hence my article, “Is GEDCOM dead?” (2/5/2012) . I was left in a perplexed state after RootsTech 2012. Why is FamilySearch working on a “standard” in a vacuum? Why is there so little communication with the existing software vendors — the purveyors of GEDCOM and why do the end users have no voice into what is needed in a GEDCOM standard?

So I decided that GEDCOM needed an Evangelist. I believe there are already a plethora of GEDCOM Evangelists so perhaps I will just add to the milieu (or is it the meme). To be frank, most GEDCOM Evangelists are really GEDCOM complainers — nay, I think we are all complainers, because there are no GEDCOM complimenters, not even amongst the GEDCOM purveyors. Even FamilySearch, which “owns” GEDCOM (how can that be a standard) wants to make their latest effort (GEDCOMX) a “clean sheet” project. No backwards compatibility even!

Is GEDCOM just an ugly baby whose parentage is in doubt?

So this meme is on Exploring GEDCOM. What is it? How can it be improved? What should a TRUE gedcom standard include? I’ll probably write once to three or four times a month on this meme until I have exhausted myself on this topic. My goal is ultimately, is to get this to be a part of RootsTech and to be an OPEN STANDARD with an open, transparent definition and process for change, which I hope to have tied to RootsTech attendees voting on this, possibly via the RootsTech App.

Allow non-attendees to vote if they register who they are and their role: genealogist, technologist, software vendor, etc. and why they want to be a voter. I think conference attendees (genealogists, technologist, or vendor-of-any-kind, organizer) get an automatic vote, prior attendees get a vote, gedcom software vendors get a vote. All prior voters get to vote in all future votes on the open standard (as long as their email address works or when it is corrected again). OPEN STANDARD means that all stakeholders need to have an opportunity to influence the standard.

Let me start the Meme by revisiting graphic syntax diagrams …

I started with this railroad track (2/16/2012) to define a gedcom file. Our discussion will focus upon gedcom v5.5.1 and launch from that rocket pad into some far flung future gedcom feature(s). This diagram was derived from the standard in PDF form. I have attempted to make the standard more “grammatical” and formularize/define ambiguities to my genealogical/technological world view. We see a HEAD tag, a TRLR tag and an option SUBN tag with a whole bunch of “gedcom lines”.

Gedcom Line

A V5.5.1 Gedcom Line

This is what a gedcom line looks like. I have added a wish for optional whitespace at the beginning of a line. That is my first proposal. The number at the beginning of each line is meant to be “an outline level”. So I wanted the option of outputting lines with leading blanks corresponding to the level of indentation appropriate for the outline level — to aid readability of seeing what inner outline indentations go with which outermost level. Make the whitespace a checkbox on export (directed at you software vendor guys) and default it to off.

We see that a gedcom line at its (current) core describes: families, individuals, notes, repositories, sources, submitters & their multimedia (digital documents, notes, memories, etc.). This is still a very high level discussion. We have only spoken of 3 of the 136 tags. But already this jester has a suggestion/complaint. Let me defer a discussion of Multimedia_Records to its own article as this requires many words, a lot of which are jargon. The complaint – we need more zero level tags!

So deferring multimedia, we have six types of records. A software vendor might think six different tables (or objects) that need to be described and stored as we “parse” each gedcom line in the file that stores our family tree. Do not lose sight that these files are family trees of some researcher — not abstract or theoretical data. These are research from current or prior genealogists and they need to be preserved … without loss.

At its inner core is a set of individuals (INDI tag). I once wrote a PERL script to pull out all individuals with their vital data (B/M/D). Very easy thing to do. I mention this now to illustrate that these compact files are at the intersection of genealogy and technology. These gedcom files are emblematic of the technology / genealogy mashup that is RootsTech! They are also the way we can interface our genealogies with other non-family tree tools to do additional things. Lets call those gedcom ADD-ONS (or PLUG-INS or APPs) that, I am hopeful, that with a standard API to be able pull this info out, just like my PERL script pulled out the individuals. That is the essence of an INDIVIDUAL gedcom record.

There are also FAMILY gedcom records that are defined by FAM, FAMC, FAMS and the temple ordinance (i.e. LDS) FAMF tags. Likewise, we have NOTE (NOTE), SUBMITTER (SUBM), and REPOSITORY (REPO)/SOURCE (SOUR) records too. I mentioned the FAMC/FAMS tags in addition to FAM which really equates to the FAMILY-RECORD, in order to point out that an individual is part of two families. S/He is a part of a family where they are a child(FAMC) and they are also part of the family where they are a parent (re SPOUSE, hence FAMS). This is evident when you realize that we are speaking of a family tree and that a tree really goes forward and backward linking the present to the past (and logically, vice-versa).

What’s Missing? – A Proposal (the first of many)

I am still ignoring MULTIMEDIA — so that is not it. If we believe in Jay Verkler‘s RootsTech 2012 vision for genealogy, then we need to conform (i.e. standardize): Dates, Locations, Names. I would also add: Events, Documents, and possibly Groups. So that is six more zero level RECORDS.

DATES I assume need to be standardized because of the many problems: missing date, partial date, estimated date, various calendars, etc.

NAMES are also a problem area. For example, how do I record my ancestor’s name? Do I conform his name to ENGLISH (i.e. does Piotr become Peter)? Should I record it in his context, (i.e. Pawel for Paul)? Should I record it in the language of the record (my ancestors come in Latin (Paulus), Polish, and Russian. Oh, some of those names do not translate to the other language, so we have adopted names/name changes/nicknames. Latin alphabet versus Cyrillic characters versus Hebrew characters or even just recording diacritical letters like slashed-l (ł ).

UNICODE support is a MUST in any new standard.

We also need Locations, Events, Documents, and Groups as zero level “records”, so that we can pull those out of the file, just as I pulled Individuals out of the file. Locations (i.e. Biechów, Busko, Kielce, Poland) that is the administrative hierarchy of one of my ancestral villages. Of course, it changed over time or by whoever occupied Poland (or should I view it as Congress Poland/Vistulaland as a part of the Russian Empire’s many gubernias). Clearly locales have a time component.

I deferred MULTIMEDIA because it is technical and also because I want to make the case that we need EVENTS and/or DOCUMENTS instead and that MULTIMEDIA are just NOTES that are not textual and often this is congruent with the fact that this digital media is a representation of some document(s) that documented an event. I also propose GROUPS as a record because people want to record connections to MILITARY units, CHURCH SOCIETIES, SCHOOLS, BUSINESSES/ORGANIZATIONS, REUNIONS, or GOVERNMENTAL/HISTORICAL units that may be of a historical or a strong emphasis within a family history. I think the GROUPS could all be user-defined, with maybe a conformed group-type (i.e. military, religious, government, historical, etc.). This does not feel like the same level of importance as the others: Names, Dates, Locations, Events or Documents.

Summary of Proposed GEDCOM Enhancements

(excluding MULTIMEDIA)

whitespace – for readability
UNICODE support so proper nouns can be recorded in their context with diacriticals or character sets (that are not Latin).
New Zero Level TAGS: NAME, DATE (not mine, but Jay Verkler’s emphasis)
New Zero TAGS (that Stanczyk wants): EVNT, DOCS, & LOCN (Jay also wanted locn).
Possibly GRUP – to support development of non-familial group memberships in trees

The new zero level tags are to support future CONFORMATION (standardization) efforts and also are the most likely to be sought after via any future API for enhanced analyses or specialized output in reports/charts.

Stanczyk views the Zero Level TAGs as possible dimensions for slicing-dicing a genealogy cube, what Data Architects see as OLAP analysis/reporting — sorry that jargon just slipped out.

The vision is cross family tree bumping or cross website bumping of gedcom data against databases to accomplish new and novel approaches to searching, merging or analyzing. This genealogy data could also be of use to historians or scientists as new sources of data to be mined for their research.

That’s the gedcom exploration for today!

P.S.

Please read the comments too. Apparently, I was wrong. There is a GEDCOM Evangelist who is not a gedcom complainer.

Posted on February 23, 2012 at 6:00 am in GEDCOM, Musings | RSS feed | Reply | Trackback URL

Tags: 5.5.1, Enhancements, Exploration, GEDCOM, standard

4 Comments to “Meme: Exploring GEDCOM – Gedcom Lines — #Genealogy, #Technology, #Mashup”

Louis Kessler (@louiskessler)
February 23, 2012 at 8:14 pm

Stanczyk:

I’m enjoying all your posts.

I’m a developer and GEDCOM evangelist and I am a GEDCOM supporter. See: http://www.beholdgenealogy.com/blog/?p=803 or http://bettergedcom.wikispaces.com/message/view/What+IS+BetterGEDCOM/32188988 where I say: “I am probably alone in this view, but … GEDCOM is NOT as bad as everyone thinks.”

First, I don’t agree that GEDCOM needs whitespace. It’s only techies like you and I who look at GEDCOMs. All other people who simply use them don’t care what they look like. They only want them to store and transfer their data corrrectly.

For top level groups, I don’t feel the need for Dates or Names or Documents. We do need Locations (or Places if we want to call them that), and like you say possibly Groups.

With regards to Event records, I once thought they’d be useful, but my vision has since changed. What is really needed is a “Source Detail” record, so that the various specific items within each source can be identified. Currently, these are hacked into GEDCOM as subdetails under the SOUR link that is in the INDI and FAM records. This mixes conclusion information with evidence information and is bad-bad-bad. By separating out a Source Detail record, the raw evidence (just-the-facts) can be cleanly dividide from the conclusion information that is with the event info under the INDI or FAM. This will finally allow source-based data entry and a proper implementation of evidence/conclusion modeling. Think about this. It’s a much better solution than Event Records which will cause more complication and confusion than benefit.

So in summary, I say only Location, Source Detail, and possibly Groups are new needed GEDCOM records.

Dates are actually quite well defined in GEDCOM. Just a few tweaks would be needed there.

Names are okay and a bit of simple things can be done to improve them.

GEDCOM 5.5.1 already has Unicode. It supports UTF8, but they did not change the document enough to make that apparent everywhere in it.

I say the OBJE record would be a decent placeholder for your multimedia items.

Keep up the great work. And keep thinking!

Louis

LikeLike

Reply
stanlmitchell
February 24, 2012 at 2:36 pm

Hi Stanczyk,

I’ve been following your blog since the “Is GEDCOM Dead?” posting. I’m happy to see you are picking up the GEDCOM torch!

Am I a GEDCOM fan? Yes & No. Yes, because of the immense investment of genealogists who have stored their years of research in this format. Yes, because the format is easily comprehended and humanly readable. No, because of the lack of support for the standard. No, because software vendors have polluted the standard with custom tags or don’t follow the published specification. Despite the “No’s”, I have developed a GEDCOM reader for android (https://market.android.com/details?id=com.sourcequest.ezged).

Having spent quite a bit of time with GEDCOM recently, I thought I’d weigh-in on your proposals:

Whitespace: I suspect the number of people who handcraft or read GEDCOMs is rather small. Although I find the indentation makes the structure more obvious, I don’t think it belongs in the standard.

UNICODE: Although the specifications state that UNICODE is supported in 5.5 and UTF-8 in 5.5.1, I think there should be clarification about distinguishing UTF-16LE and UTF-16BE. Also the specification sections on supported languages needs to be updated.

Level Zero Records: I think candidates for level 0 are either shared or reusable records. They are given unique identifiers so they can be referenced elsewhere in the file. A Date applies to the information it is associated with, so I don’t see a benefit in separating it from that information – so I would keep it embedded in the related structure. Similarly, a Name is associated with an individual and it should remain embedded in the INDI structure. However, there are certainly some improvements that can be made in the Date and Name structures. Since Places tend to be shared or reused, I would agree with creating a level 0 LOCN record.

I’m not sure how your proposed DOCS record would be different than a SOUR record. I agree some facility needs to be added to support modern citation principles (e.g. Evidence Explained). The linkage I would hope to see is from source to reference note to event/fact. Events and facts are now embedded in INDI and FAM records. Splitting them out might be the way to go, but it depends on how the new records interrelate. This is the area where I would hope to see the most improvement.

I look forward to reading your future musings on GEDCOM!

-Stan

LikeLike

Reply
C. Michael Eliasz-Solomon
February 24, 2012 at 3:11 pm

Welcome – Louis & Stan to the blog. I appreciate both of your well thought/ articulated comments. As a result of your comments and the always cogent and gracious Tamura Jones (who sent me a private email), I will write a response article and discuss everybody’s points.

All three of you have pistol whipped me on whitespace. I am afraid to capitulate on my need, but I guess I will have to abandon my need (and laziness) and continue to edit gedcom files for my own edification since it is evident nobody else sees the need for debugging gedcom files (or they expect programmers and other gedcom detectives to do for themselves). I certainly do not want to bloat the GEDCOM. But then I will also expect/demand the same kind of universal revile for XML which would bloat a gedcom file far/far more than a few blanks at the beginning of a line (which could be turned off/on).

Please look for my response post .. soon

LikeLike

Reply
Tim Forsythe
March 1, 2012 at 8:05 am

Stanczyk, if your pushing to improve GEDCOM here are few non-standard features I’ve implemented on my website that I think others might find useful.

1. Parental Associations – Better defining of parental associations – I’ve blogged about this a lot, but basically, this is the most important record as it defines ancestry. GEDCOM has an ASSOciation field, but no predefined “Biological Child of X” relation type. I use this field extensively so that I can add sources to the relationship. Just adding a child to a family does not give you this ability (http://ancestorsnow.com/press/news.php?item.8.2)

2. Source Categories – All sources should be categorized by any of several attributes to establish reliability. I support 3 categories: Authority, Concurrency, and Association (http://ancestorsnow.com/press/news.php?item.9.1)

3. Certainty Assessments – Each claim made needs to have a certainty assessment field that can be set to indicate how certain a claim is especially if the claim has been disproved but remains in your database so that it is not lost (http://ancestorsnow.com/press/news.php?item.45.1).

4. Multiple Source Citations/Quotations – We need to be able to add multiple citations per source under each source. I use the NOTE field for this, but that is not its intended use.

5. Better privacy controls – we need to have a privacy tag for every record at every level, so that they can be restricted from use by displaying applications. We also need to be able to mark individuals as living and as dead. When applications hide living persons, we need to be able to selectively show records using a special tag. Likewise, we need to be able to selectively hide records when not using privacy restrictions (http://ancestorsnow.com/press/e107_plugins/forum/forum_viewtopic.php?77).

6. Marriage Order – There should be an easy way to set the marriage order separately for husbands and wives. Likewise for children’s order of birth. This is a claim so source references must be allowed (http://ancestorsnow.com/press/e107_plugins/forum/forum_viewtopic.php?81)

There are of course may others, like URLs and file PATHs for persons, sources, source quotations, and perhaps locations. I use all these on my website when displaying my data making for a much richer experience for the visitor. My description of GREnDL 2.0 (http://ancestorsnow.com/press/news.php?item.52.1) includes these features and others.

[Editor –
Tim,
Thanks for your comments — the more the merrier! GEDCOM needs to be open to all comments. Welcome/Thanks for improving the conversation on GEDCOM.

–Stanczyk ]

LikeLike

Reply

Stanczyk – Internet Muse™

Subscribe via RSS

Blogroll

Pages

Post Me A Missive (click-pic)

Musings / Index :

Category Keywords

Readers / Writers

Blogs To Follow