Posts tagged ‘standard’

March 2, 2012

Diacritical Redux – Ancestry GEDCOM — #Genealogy, #Technology

by C. Michael Eliasz-Solomon

As Stanczyk, was writing about the GEDCOM standard since #RootsTech 2012, I began to pick apart my own GEDCOM file (*.ged). I did this as I was engaged with Tamura Jones (a favorite foil to debate Genealog Technology with). During our tête-á-tête, I noticed that my GEDCOM lacked diacriticals???

What happened? At first I thought it was the software that Tamura had recommended I use, but it was not the problem of that software (PAF). So I looked at the gedcom file that I had imported and the diacriticals were missing from there meaning, my export software was the culprit.

I looked at the GEDCOM’s  HEAD tag and the CHAR sub-tag, and it said “ANSI” [no quotes] was the value. That is not even a valid possible value! According to the GEDCOM 5.5.1 standard [on page 44 of the FamilySearch PDF document]:

CHARACTER_SET:= {Size=1:8}
[ ANSEL |UTF-8 | UNICODE | ASCII ]

Who is this dastardly purveyor of substandard GEDCOM that strips out your diacriticals (that I assumed you have been working so hard to add since my aritcle on Tuesday,  “Dying For Diacriticals“)? I’ll give you a HINT, it is the #1 Genealogy Website  – Yes,  it is ANCESTRY.COM !

Now what makes this error even more dastardly is that the website shows you the diacriticals in the User Interface (UI), but when you go to export/download the diacriticals are not there in the gedcom and unless you study things closely, you may be oblivious (as Stanczyk was for a long time) that these errors have crept into your research. I also found a spurious NOTE that I cannot find anywhere on anyone in my tree — which gets attributed to my home person (uh, me). This is very alarming to me too !!!

Tim Sullivan (CEO of Ancestry.com), I expected better of you and your website. I entrusted my family tree to you and that is what you did with my gedcom? Now I did some more investigating and I found that Ancestry does not strip ALL diacriticals. My gedcom had diacriticals in the PLAC tags and in NOTE tags. But NOT (I repeat NOT) in the NAME tags.

So Tim [pretend there is a shaky leaf here] , if you or a reputation defender or some other minion skims the Internet (for your name) here is what  I hope You/Ancestry.com will do:

  1. Do NOT strip diacriticals from the NAME tag !!!
  2.  Fix the Export GEDCOM to create a gedcom file with diacriticals in NAME tags
  3. Fix the Export GEDCOM to create a valid CHAR tag value: UNICODE, UTF-8, ASCII, ANSEL. I put them in my prioritized/preferred order [from left-to-right]. I hope you will not use ASCII or ANSEL.
  4. Run a GEDCOM validator against the gedcom file your Export GEDCOM software creates to download and fix the other “little things” too  (Mystery NOTEs ???).
February 26, 2012

Responses – Exploring Gedcom — #Technology, #Genealogy

by C. Michael Eliasz-Solomon

Mail Room

The mailroom received three emails / comments from the “Exploring Gedcom” article.

Tamura Jones (Modern Software Experience), Louis Kessler (BeholdGenealogy.com),  and  Stan Mitchell (GenApps.net / ezGED Viewer).

Good solid GEDCOM experts all (unlike this jester who is only a journeyman apprentice to these fine men) and as you can see they are also bloggers themselves.

Here’s my summation:

  1.  WHITESPACE – All three disputed the whitespace proposal. Even though it was an optional feature accessed via on/off check-box — I yield to what I see as rising tide that I cannot swim against. I assume that XML will also be treated as badly by all for EXACTLY the same reasons — too verbose and makes for a poor data transfer mechanism because of  the bloat.
  2. UNICODE – Tamura pointed out that it was in the 5.5.1 standard. I said maybe so, but hardly implemented, needs to be mandatory. I also hate the two-hexabyte binary debris that makes an otherwise TEXT file into a partial binary file. Tamura points out that this byte-order indicator is commonly hidden (I am old school and use vi — nothing hidden) by PC editors. Besides, the HEAD tag CHAR sub-tag could be used to determine character set and keep file textual. Tamura said that would be a catch-22 (since you do not know the files encoding). Tamura points out that everyone does as I suppose and use ASCII (or UTF-16LE or UTF-16BE) to determine encoding.  Documentation needs to be updated too. Really almost no support for generating UNICODE chars in app — should be required, otherwise data entry is limited to clever users (i.e. tech-types or their friends).
  3. DATES – nobody liked DATES (or NAMES) as a zero level. I can live without dates, as I can always create a dimension with every possible date to slice & dice my genealogy facts. Names was also not part of my vision, other than I want a bunch of AKA names for an INDI.
  4. LOCN – Everyone agreed, the PLAC tag did provide a minimalist capability. 2/3 saw a good reason to have LOCN at the zero level, as I proposed. Let’s hope this feature gets in.  We may need to keep PLAC tag for backwards compatibility until all gedcoms have been converted.
  5. EVENTS as a zero level tag seemed to interest people, as MULTI-PERSON events (aka EVENT_TYPE_FAMILY) is NOT adequately dealt with in GEDCOM 5.5.1. I also think people want to standardize these events as much as possible and leave open the ability for a user to add their own events. EVENTS was also related to GROUPS which people seem to want in some fashion. The need to analyze a social network needs to have some better GROUP/EVENT/ROLE visibility that the current standard provides. I think we really we need EVENT and EVENT_TYPE tags to keep from adding a new GEDCOM tag every time someone says we need a new event (BIRT | CHR | BAPM | BARM | BASM | BLES | SLGC). All TYPE tags should be from a standard list that a user can add to. The list should be allowed to be localized (into a native langauge). This keeps parsing to a minimal while allowing for expansion. OLD tags are kept for backwards compatibility until a gedcom is upgraded to a later version. I think we also need a ROLE_TYPE to replace ROLE_IN_EVENT and add more standard roles, (i.e. GodMother, GodFather, Witness, Neighbor, MidWife, Rabbi, Brother, Sister, Aunt, Uncle, Cousin, Border, etc.) and this should also be localized and user upgradeable. Keep  EVENT_TYPE_FAMILY and EVENT_TYPE_INDIVIDUAL for backwards compatibility.
  6. DOCUMENTS (deferred) – The case was not made nor was the concept adequately explained. To Stanczyk this is related to multimedia and is for making it possible to locate all documents of a certain type (i.e.  Declaration of Intent) with its date and location and a locator to the multimedia  representation and pull these documents out in their own right regardless of the individual or family or group.
  7. GROUPS (deferred) – Some interest in defining a non-family group (ex. military unit, college fraternity/sororiety, religious society, etc.). These groups would be interesting in their own right to study. At RootsTech 2012, this seemed to be a novel idea that had a positive feel to the audience.

GROUPS is not intended for a non-traditional family unit which needs some thought and design in this Modern Family World.

Tamura chided me that many of my “wants” are a part of an offshoot of GEDCOM called GEDCOM 5.5EL (Extended Location GEDCOM derived from version 5.5). The only difference is  I want to get rid of the need for _LOC (a custom tag under GEDCOM definition) and use LOCN instead. Also I would want their undefined tag called NAMC (possibly renamed NAMA for NAMe Alias or NAMe Altertnate) be 0:M; meaning that you can have zero, one or many alternate/alias names for this LOCN (or INDI why not?).

Also the NAMC (or NAMA) should have a subtype FONE and FONETYPE (soundex, DM, Bieder-Morse, etc.) to aid in advanced searches or Google searches. But this is the argument for NAMES at zero level. The last names are usually where the  soundex/phonetic matching need to be stored. We do not need to repeat this data for each individual (INDI) just for each unique last name or alternate name. These things get created in Surname Index pages – how much easier if the NAMES (re surnames and alternate surname) had a zero level with the FONE, FONETYPE and the INDI had an XREF pointing to each NAME/NAME ALTERNATE s/he had used during their life. One might even hope that the zero level name had an XREF to each INDI too. If I were the Data Architect for GEDCOM, I have a zero level NAMES (for SURNAMES) and their soundex/phonetic codes, plus XREFs back and forth to INDI.

I need to cut this response short, but a great thanks to all who read the article and the above three for improving my thoughts by their comments/emails.

P.S. – If you follow the GEDCM 5.5EL link, you will notice they show their gedcom indented and then go on to say about the whitespace:

” improves readability … should (and will) not be performed at real Gedcom-output.”

My sentiments exactly. Sometimes you just need to show the gedcom (and indentation improves the presentation and understanding). I too never intended it to be processed into the gedcom output/exported to a file for transport (just for my own personal examination or writing purposes). However, I have capitulated on whitespace — so please no more email on whitespace and bloat.

February 23, 2012

Meme: Exploring GEDCOM – Gedcom Lines — #Genealogy, #Technology, #Mashup

by C. Michael Eliasz-Solomon

Stanczyk wants to introduce a new meme,  “Exploring GEDCOM“.   I was musing upon why is the state of a GEDCOM standard,  … so CHAOTIC?    GEDCOM has languished for about a decade and a half now with no new standard  – hence my article, “Is GEDCOM dead?” (2/5/2012) .  I was left in a perplexed state after RootsTech 2012. Why is FamilySearch working on a “standard” in a vacuum? Why is there so little communication with the existing software vendors — the purveyors of GEDCOM and why do the end users have no voice into what is needed in a GEDCOM standard?

So I decided that GEDCOM needed an Evangelist. I believe there are already a plethora of GEDCOM Evangelists so perhaps I will just add to the milieu (or is it the meme). To be frank, most GEDCOM Evangelists are really GEDCOM complainers — nay, I think we are all complainers, because there are no GEDCOM complimenters, not even amongst the GEDCOM purveyors. Even FamilySearch, which “owns” GEDCOM (how can that be a standard) wants to make their latest effort (GEDCOMX) a “clean sheet” project. No backwards compatibility even!

Is GEDCOM  just an ugly baby whose parentage is in doubt?

So this meme is on Exploring GEDCOM. What is it? How can it be improved? What should a TRUE gedcom  standard include?  I’ll probably write once to three or four times a month on this meme until I have exhausted myself on this topic. My goal is ultimately, is to get this to be a part of RootsTech and to be an OPEN STANDARD with an open, transparent definition and process for change, which I hope to have tied to RootsTech attendees voting on this, possibly via the RootsTech App.

Allow non-attendees to vote if they register who they are and their role: genealogist, technologist, software vendor, etc. and why they want to be a voter. I think conference attendees (genealogists, technologist, or vendor-of-any-kind, organizer) get an automatic vote, prior attendees get a vote, gedcom software vendors get a vote. All prior voters get to vote in all future votes on the open standard (as long as their email address works or when it is corrected again). OPEN STANDARD means that all stakeholders need to have an opportunity to influence the standard.

Let me start the Meme by revisiting graphic syntax diagrams  …

I started with this railroad track (2/16/2012) to define a gedcom file. Our discussion will focus upon gedcom v5.5.1 and launch from that rocket pad into some far flung future gedcom feature(s). This diagram was derived from the standard in PDF form. I have attempted to make the standard more “grammatical” and formularize/define ambiguities to my genealogical/technological world view. We see a HEAD tag, a TRLR tag and an option SUBN tag with a whole bunch of “gedcom lines”.

Gedcom Line

A V5.5.1 Gedcom Line

This is what a gedcom line looks like. I have added a wish for optional whitespace at the beginning of a line. That is my first proposal. The number at the beginning of each line is meant to be “an outline level”. So I wanted the option of outputting lines with leading blanks corresponding to the level of indentation appropriate for the outline level — to aid readability of seeing what inner outline indentations go  with which outermost level. Make the whitespace a checkbox on export (directed at you software vendor guys) and default it to off.

We see that a gedcom line at its (current) core describes: families, individuals, notes, repositories, sources, submitters & their multimedia (digital documents, notes, memories, etc.). This is still a very high level discussion. We have only spoken of 3 of the 136 tags. But already this jester has a suggestion/complaint.  Let me defer a discussion of Multimedia_Records to its own article as this requires many words, a lot of which are jargon. The complaint – we need more zero level tags!

So deferring multimedia, we have six types of records. A software vendor might think six different tables (or objects) that need to be described and stored as we “parse” each gedcom line in the file that stores our family tree. Do not lose sight that these files are family trees of some researcher — not abstract or theoretical data. These are research from current or prior genealogists and they need to be preserved …  without loss.

At its inner core is a set of individuals (INDI tag). I once wrote a PERL script to pull out all individuals with their vital data (B/M/D). Very easy thing to do. I mention this now to illustrate that these compact files are at the intersection of genealogy and technology. These gedcom files are emblematic of the technology / genealogy mashup that is RootsTech! They are also the way we can interface our genealogies with other non-family tree tools to do additional things. Lets call those gedcom ADD-ONS (or PLUG-INS or APPs) that,  I am hopeful, that with a standard API to be able pull this info out, just like my PERL script pulled out the individuals.  That is the essence of an INDIVIDUAL gedcom record.

There are also FAMILY gedcom records that are defined by FAM,  FAMC,  FAMS and the temple ordinance (i.e. LDS) FAMF tags. Likewise, we have NOTE (NOTE), SUBMITTER (SUBM),  and REPOSITORY (REPO)/SOURCE (SOUR) records too. I mentioned the FAMC/FAMS tags in addition to FAM which really equates to the FAMILY-RECORD, in order to point out that an individual is part of two families. S/He is a part of a family where they are a child(FAMC) and they are also part of the family where they are a parent (re SPOUSE, hence FAMS). This is evident when you realize that we are speaking of a family tree and that a tree really goes forward and backward linking the present to the past (and logically,  vice-versa).

What’s Missing? – A Proposal (the first of many)

I am still ignoring MULTIMEDIA — so that is not it. If we believe in Jay Verkler‘s RootsTech 2012 vision for genealogy, then we need to conform (i.e. standardize):  Dates, Locations, Names. I would also add: Events, Documents,  and possibly Groups. So that is six more zero level RECORDS.

DATES I assume need to be standardized because of the many problems: missing date, partial date, estimated date, various calendars, etc.

NAMES are also a problem area. For example, how do I record my ancestor’s name? Do I conform his name to ENGLISH (i.e. does Piotr become Peter)? Should I record it in his context, (i.e. Pawel for Paul)? Should I record it in the language of the record (my ancestors come in Latin (Paulus), Polish, and Russian. Oh, some of those names do not translate to the other language, so we have adopted names/name changes/nicknames. Latin alphabet versus Cyrillic characters versus Hebrew characters or even just recording diacritical letters like slashed-l (ł ).

UNICODE support is a MUST in any new standard.

We also need Locations, Events, Documents, and Groups as zero level “records”, so that we can pull those out of the file, just as I pulled Individuals out of the file. Locations (i.e. Biechów, Busko, Kielce, Poland) that is the administrative hierarchy of one of my ancestral villages. Of course, it changed over time or by whoever occupied Poland (or should I view it as Congress Poland/Vistulaland as a part of the Russian Empire’s many gubernias). Clearly locales have a time component.

I deferred MULTIMEDIA because it is technical and also because I want to make the case that we need EVENTS and/or DOCUMENTS instead and that MULTIMEDIA are just NOTES that are not textual and often this is congruent with the fact that this digital media is a representation of some document(s) that documented an event. I also propose GROUPS as a record because people want to record connections to MILITARY units, CHURCH SOCIETIES,  SCHOOLS, BUSINESSES/ORGANIZATIONS, REUNIONS, or GOVERNMENTAL/HISTORICAL units that may be of a historical or a strong emphasis within a family history. I think the GROUPS could all be user-defined, with maybe a conformed group-type (i.e. military, religious, government, historical, etc.). This does not feel like the same level of importance as the others: Names, Dates, Locations, Events or Documents.

Summary of Proposed GEDCOM Enhancements

(excluding MULTIMEDIA)
  1. whitespace – for readability
  2. UNICODE support so proper nouns can be recorded in their context with diacriticals or character sets (that are not Latin).
  3. New Zero Level TAGS:  NAME, DATE (not mine, but Jay Verkler’s emphasis)
  4. New Zero TAGS (that Stanczyk wants):  EVNT,  DOCS, &  LOCN (Jay also wanted locn).
  5. Possibly GRUP – to support development of non-familial group memberships in trees

The new zero level tags are to support future CONFORMATION (standardization) efforts and also are the most likely to be sought after via any future API for enhanced analyses or specialized output in reports/charts.

Stanczyk views the Zero Level TAGs as possible dimensions for slicing-dicing a genealogy cube, what Data Architects see as OLAP analysis/reporting   sorry that jargon just slipped out.

The vision is cross family tree bumping or cross website bumping of gedcom data against databases to accomplish new and novel approaches to searching, merging or analyzing. This genealogy data could also be of use to historians or scientists as new sources of data to be mined for their research.

That’s the gedcom exploration for today!

 

P.S. 

Please read the comments too. Apparently, I was wrong. There is a GEDCOM Evangelist who is not a gedcom complainer.

Follow

Get every new post delivered to your Inbox.

Join 368 other followers

%d bloggers like this: