Dying For Diacriticals … Beyond ASCII — #HowTo, #Genealogy, #Polish

by C. Michael Eliasz-Solomon

Stanczyk mused recently upon a few of the NAMEs in my genealogy:

Bębel, Elijasz, Guła, Leszczyński, Kędzierski, Wątroba, Wleciał, Biechów, Pacanów, Żabiec

If you want to write Elijasz (or any of its variants) you are golden. But each of the other names require a diacritic (aka diacritical mark). Early on, I had to drop the diacritics, because I did not have computer software to generate these characters (aka glyphs). So my genealogy research and my family tree were recorded in ASCII characters. For the most part that is not a concern unless you are like John Rys and trying to find all of the possibly ways your Slavic name can be spelled/misspelled/transliterated and eventually recorded in some document and/or database that you will need to search for. Then the import becomes very clear. Also letters with an accent character (aka diacritic) sort differently than letters without the diacritic mark. For years, I thought Żabiec was not in a particular Gazetteer I use, until I realized there was a dot above the Z and the dotted-Z named villages came after all of the plain Z (no dot) villages and there was Żabiec many pages later! The dot was not recorded in the Ship Manifest, nor in a Declaration of Intent document. So I might not have found the parish so easily that Żabiec belongs to. I hope you are beginning to see the import of recording diacritics in your family tree.

How?

The rest of my article today teaches you how to do this. Mostly we are in a browser, surfing the ‘net, in all its www glory. After my “liberal indoctrination” (aka #RootsTech 2012), I have switched browsers to Google’s Chrome (from Mozilla Firefox) browser. Now I did this to await the promised “microdata” technology that will improve my genealogical search experience. I am still waiting, Mr Google !!! But while I am waiting, I did find a new browser extension that I am rather fond of that solves my diacritical problem: Virtual Keyboard Interface 1.45. I just double-click in a text field and a keyboard pops-up:

Just double-click on a text field, say at Ancestry.com . Notice the virtual keyboard has a drop down (see “Polski“), so I could have picked Русский (for Russian) if I was entering Cyrillic characters into my family tree.

But I want to keep using my browser … OK! Now I used to prepare an MS Word document or maybe a Wordpad document with just the diacriticals I need (say Polish, Russian, and Hebrew) then I can cut & paste them from that editor into my browser or computer application as needed — a bit tedious and how did I create those diacritical characters anyway?

I use Character Map in Windows and Character Palette -or- Keyboard Viewer on the MAC:

Now if I use one of these Apps, then I can forgo the Wordpad document ( of special chars. ) altogether and just copy / paste from these to generate my diacritical characters.

What I would like to see from web 2.0 pages and websites is what Logan Kleinwaks did on his WONDERFUL GenealogyIndexer.org website. Give us a keyboard widget like Logan’s, please ! What does a near perfect solution look like …

Logan has thoughtfully provided ENglish, HEbrew, POlish, HUngarian, ROmanian, DEutsche (German), Slavic, and RUssian characters. Why is it only nearly perfect? Logan, may I please have a SHIFT (CAPITAL) key on the BKSP / ENTER line for uppercase characters? That’s it [I know it is probably a tedious bit of work to this].

Beyond ASCII ?

The title said beyond Ascii. So is everything we have spoken about. Ascii is a standard that is essentially a typewriter keyboard, plus the extra keys (ex. Backspace, Enter, Ctrl-F, etc.) that do special things on a computer. So what is beyond Ascii? Hebrew characters (ℵ), Chinese/Japanese glyphs (串), Cyrillic (Я), Polish slashed-L (Ł), or Dingbats (❦ – Floral Heart). You can now enter of these beyond ascii characters (UNICODE) in any program with the above suggestions.

Programmer Jargon – others proceed with caution …

The above are all UNICODE character sets. UTF-8 can encode all of the UNICODE characters (1.1 Million so far) in nice and easy 8bit bytes (called octets — this is why UTF-8 is not concerned with big/little endianess). In fact, UTF-8‘s first 128 characters is an exact 1:1 mapping of ASCII making ascii a valid UNICODE characters set. In fact, more than half of all web pages out on the WWW (‘Net) are encoded with UTF-8. Makes sense that our gedcom files are too! In fact UTF-8 can have that byte-order-mark (BOM) at the front of our gedcom or not and it is still UTF-8. In fact the UTF-8 standard prefers there be no byte order mark [see Chapter 2 of UNICODE] at the beginning of a file. So please FamilySearch remove the BOM from the GEDCOM standard.

If FamilySearch properly defines the newline character in the gedcom grammar [see Chapter 5, specifically 5.8 of UNICODE] then there is nothing in the HEAD tag that would be unreadable to a program written in say Java (which is UTF-16 capable to represent any character U+0000 to U+FFFF) unless there is an invalid character which then makes the gedcom invalid. Every character in the HEAD tag is actually defined within 8bit ascii which can be read by UTF-8 and since UTF-8 can read all UNICODE encodings you could use any computer language that is at least UTF-8 compliant to read/parse the HEAD tag (which has the CHAR tag and its value that defines the character set). Everything in the HEAD tag, with the exception of the BOM is within the 8bit ascii character set. Using UTF-8 as a default encoding to read the HEAD will work even if there is a BOM.

Posted on February 28, 2012 at 5:51 pm in Data, Databases, GEDCOM, genealogia, Genealogy, Internet, Jewish, Musings, Names, Polish, Russian | RSS feed | Reply | Trackback URL

Tags: Apps, ASCII, Genealogy, UNICODE, UTF16, UTF8, Widgets

4 Responses to “Dying For Diacriticals … Beyond ASCII — #HowTo, #Genealogy, #Polish”

Louis Kessler (@louiskessler)
February 28, 2012 at 7:54 pm

“UTF-8 can have that byte-order-mark (BOM) at the front of our gedcom or not and it is still UTF-8 ” – I didn’t know that. Thanks.

LikeLike

Reply
David A Knight
February 29, 2012 at 2:35 am

Small correction.

ASCII is 7 bit, not 8. UTF-8 will fail if it encounters a character > 127 in HEAD as that value indicates that the character is more than 1 octet in UTF-8. This could very easily happen with NAME, CORP, ADDR etc. in HEAD prior to the CHAR tag informing you of the actual character set in use.

The 8 bit character sets you will normally see are ANSI with one of the various codepages.

[Editor –
David,
Yes, you are correct. Thanks for helping me out — ASCII is indeed 7 bit (0-127 decimal) ANSI standard. When I wrote 8bit Ascii, I was meaning the ISO_8859_ 1/2 code pages (or Windows-1250) which provided the diacriticals by using the 8th bit to map Ascii codes for the characters 128-255 decimal.

–Stanczyk ]

LikeLike

Reply

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29

Stanczyk – Internet Muse™

Subscribe via RSS

Blogroll

Pages

Post Me A Missive (click-pic)

Musings / Index :

Category Keywords

Readers / Writers

Blogs To Follow