Stanczyk wants to introduce a new meme, “Exploring GEDCOM“. I was musing upon why is the state of a GEDCOM standard, … so CHAOTIC? GEDCOM has languished for about a decade and a half now with no new standard — hence my article, “Is GEDCOM dead?” (2/5/2012) . I was left in a perplexed state after RootsTech 2012. Why is FamilySearch working on a “standard” in a vacuum? Why is there so little communication with the existing software vendors — the purveyors of GEDCOM and why do the end users have no voice into what is needed in a GEDCOM standard?
So I decided that GEDCOM needed an Evangelist. I believe there are already a plethora of GEDCOM Evangelists so perhaps I will just add to the milieu (or is it the meme). To be frank, most GEDCOM Evangelists are really GEDCOM complainers — nay, I think we are all complainers, because there are no GEDCOM complimenters, not even amongst the GEDCOM purveyors. Even FamilySearch, which “owns” GEDCOM (how can that be a standard) wants to make their latest effort (GEDCOMX) a “clean sheet” project. No backwards compatibility even!
Is GEDCOM just an ugly baby whose parentage is in doubt?
So this meme is on Exploring GEDCOM. What is it? How can it be improved? What should a TRUE gedcom standard include? I’ll probably write once to three or four times a month on this meme until I have exhausted myself on this topic. My goal is ultimately, is to get this to be a part of RootsTech and to be an OPEN STANDARD with an open, transparent definition and process for change, which I hope to have tied to RootsTech attendees voting on this, possibly via the RootsTech App.
Allow non-attendees to vote if they register who they are and their role: genealogist, technologist, software vendor, etc. and why they want to be a voter. I think conference attendees (genealogists, technologist, or vendor-of-any-kind, organizer) get an automatic vote, prior attendees get a vote, gedcom software vendors get a vote. All prior voters get to vote in all future votes on the open standard (as long as their email address works or when it is corrected again). OPEN STANDARD means that all stakeholders need to have an opportunity to influence the standard.
Let me start the Meme by revisiting graphic syntax diagrams …
I started with this railroad track (2/16/2012) to define a gedcom file. Our discussion will focus upon gedcom v5.5.1 and launch from that rocket pad into some far flung future gedcom feature(s). This diagram was derived from the standard in PDF form. I have attempted to make the standard more “grammatical” and formularize/define ambiguities to my genealogical/technological world view. We see a HEAD tag, a TRLR tag and an option SUBN tag with a whole bunch of “gedcom lines”.
A V5.5.1 Gedcom Line
This is what a gedcom line looks like. I have added a wish for optional whitespace at the beginning of a line. That is my first proposal. The number at the beginning of each line is meant to be “an outline level”. So I wanted the option of outputting lines with leading blanks corresponding to the level of indentation appropriate for the outline level — to aid readability of seeing what inner outline indentations go with which outermost level. Make the whitespace a checkbox on export (directed at you software vendor guys) and default it to off.
We see that a gedcom line at its (current) core describes: families, individuals, notes, repositories, sources, submitters & their multimedia (digital documents, notes, memories, etc.). This is still a very high level discussion. We have only spoken of 3 of the 136 tags. But already this jester has a suggestion/complaint. Let me defer a discussion of Multimedia_Records to its own article as this requires many words, a lot of which are jargon. The complaint – we need more zero level tags!
So deferring multimedia, we have six types of records. A software vendor might think six different tables (or objects) that need to be described and stored as we “parse” each gedcom line in the file that stores our family tree. Do not lose sight that these files are family trees of some researcher — not abstract or theoretical data. These are research from current or prior genealogists and they need to be preserved … without loss.
At its inner core is a set of individuals (INDI tag). I once wrote a PERL script to pull out all individuals with their vital data (B/M/D). Very easy thing to do. I mention this now to illustrate that these compact files are at the intersection of genealogy and technology. These gedcom files are emblematic of the technology / genealogy mashup that is RootsTech! They are also the way we can interface our genealogies with other non-family tree tools to do additional things. Lets call those gedcom ADD-ONS (or PLUG-INS or APPs) that, I am hopeful, that with a standard API to be able pull this info out, just like my PERL script pulled out the individuals. That is the essence of an INDIVIDUAL gedcom record.
There are also FAMILY gedcom records that are defined by FAM, FAMC, FAMS and the temple ordinance (i.e. LDS) FAMF tags. Likewise, we have NOTE (NOTE), SUBMITTER (SUBM), and REPOSITORY (REPO)/SOURCE (SOUR) records too. I mentioned the FAMC/FAMS tags in addition to FAM which really equates to the FAMILY-RECORD, in order to point out that an individual is part of two families. S/He is a part of a family where they are a child(FAMC) and they are also part of the family where they are a parent (re SPOUSE, hence FAMS). This is evident when you realize that we are speaking of a family tree and that a tree really goes forward and backward linking the present to the past (and logically, vice-versa).
What’s Missing? – A Proposal (the first of many)
I am still ignoring MULTIMEDIA — so that is not it. If we believe in Jay Verkler‘s RootsTech 2012 vision for genealogy, then we need to conform (i.e. standardize): Dates, Locations, Names. I would also add: Events, Documents, and possibly Groups. So that is six more zero level RECORDS.
DATES I assume need to be standardized because of the many problems: missing date, partial date, estimated date, various calendars, etc.
NAMES are also a problem area. For example, how do I record my ancestor’s name? Do I conform his name to ENGLISH (i.e. does Piotr become Peter)? Should I record it in his context, (i.e. Pawel for Paul)? Should I record it in the language of the record (my ancestors come in Latin (Paulus), Polish, and Russian. Oh, some of those names do not translate to the other language, so we have adopted names/name changes/nicknames. Latin alphabet versus Cyrillic characters versus Hebrew characters or even just recording diacritical letters like slashed-l (ł ).
UNICODE support is a MUST in any new standard.
We also need Locations, Events, Documents, and Groups as zero level “records”, so that we can pull those out of the file, just as I pulled Individuals out of the file. Locations (i.e. Biechów, Busko, Kielce, Poland) that is the administrative hierarchy of one of my ancestral villages. Of course, it changed over time or by whoever occupied Poland (or should I view it as Congress Poland/Vistulaland as a part of the Russian Empire’s many gubernias). Clearly locales have a time component.
I deferred MULTIMEDIA because it is technical and also because I want to make the case that we need EVENTS and/or DOCUMENTS instead and that MULTIMEDIA are just NOTES that are not textual and often this is congruent with the fact that this digital media is a representation of some document(s) that documented an event. I also propose GROUPS as a record because people want to record connections to MILITARY units, CHURCH SOCIETIES, SCHOOLS, BUSINESSES/ORGANIZATIONS, REUNIONS, or GOVERNMENTAL/HISTORICAL units that may be of a historical or a strong emphasis within a family history. I think the GROUPS could all be user-defined, with maybe a conformed group-type (i.e. military, religious, government, historical, etc.). This does not feel like the same level of importance as the others: Names, Dates, Locations, Events or Documents.
Summary of Proposed GEDCOM Enhancements
- whitespace – for readability
- UNICODE support so proper nouns can be recorded in their context with diacriticals or character sets (that are not Latin).
- New Zero Level TAGS: NAME, DATE (not mine, but Jay Verkler’s emphasis)
- New Zero TAGS (that Stanczyk wants): EVNT, DOCS, & LOCN (Jay also wanted locn).
- Possibly GRUP – to support development of non-familial group memberships in trees
The new zero level tags are to support future CONFORMATION (standardization) efforts and also are the most likely to be sought after via any future API for enhanced analyses or specialized output in reports/charts.
Stanczyk views the Zero Level TAGs as possible dimensions for slicing-dicing a genealogy cube, what Data Architects see as OLAP analysis/reporting – sorry that jargon just slipped out.
The vision is cross family tree bumping or cross website bumping of gedcom data against databases to accomplish new and novel approaches to searching, merging or analyzing. This genealogy data could also be of use to historians or scientists as new sources of data to be mined for their research.
That’s the gedcom exploration for today!
Please read the comments too. Apparently, I was wrong. There is a GEDCOM Evangelist who is not a gedcom complainer.