20130502

GEDCOM musings and rambles (no rants)

GEDCOM


After several partially successful attempts to write a GEDCOM C++ library I've slowly run into enough design problems to suggest my approach is flawed. In what I thought was an obvious direction (top-down) I created top level objects of the major record types found in a GEDCOM file. Starting at the highest level:
  • 0 «Header»
  • 0 «Submission_Record»
  • 0 «Record»
  • 0 Trlr
Where Record expands to:
  • n «FAM_RECORD»
  • n «INDIVIDUAL_RECORD»
  • n «MULTIMEDIA_RECORD»
  • n «NOTE_RECORD»
  • n «REPOSITORY_RECORD»
  • n «SOURCE_RECORD»
  • n «SUBMITTER_RECORD»
And as an example the FAM_RECORD expands to:
  • n @<XREF:FAM>@ FAM
    • +1 RESN <RESTRICTION_NOTICE>
    • +1 «FAMILY_EVENT_STRUCTURE»
    • +1 HUSB @<XREF:INDI>@
    • +1 WIFE @<XREF:INDI>@
    • +1 CHIL @<XREF:INDI>@
    • +1 NCHI <COUNT_OF_CHILDREN>
    • +1 SUBM @<XREF:SUBM>@
    • +1 «LDS_SPOUSE_SEALING»
    • +1 REFN <USER_REFERENCE_NUMBER>
      • +2 TYPE <USER_REFERENCE_TYPE>
    • +1 RIN <AUTOMATED_RECORD_ID>
    • +1 «CHANGE_DATE»
    • +1 «NOTE_STRUCTURE»
    • +1 «SOURCE_CITATION»
    • +1 «MULTIMEDIA_LINK»
Were this all it would have been a successful strategy. However, note the presence of «SOMETHING» references. These are why GEDCOM is referred to as a Linage-Linked document form. Any item shown that way links to another sub-record which may well be a mix of similar nature: primitives and higher level forms. As an illustration of the problem consider the NOTE_STRUCTURE and CHANGE_DATE links. Interestingly enough, each NOTE_STRUCTURE link contains a CHANGE_DATE link. Even more fun, each CHANGE_DATE link contains a NOTE_STRUCTURE link.

This ramble is a kind of thinking on paper exercise to allow stating the problem and hopefully deriving a solution.

That said, my current thinking (hopefully box escaping) lies in reconsidering the nature of the Linage-Linked format. I my initial rush to create a hierarchy of objects I believe that I missed the obvious. The clue lies in the word 'linked'. The entire form can (and hopefully should) be thought of as a linked list. In a kind of lispish format, the highest level can be thought of this way:


  • (header)(link)
  • (submission_record)(link)
  • (record)(link)
  • (trlr)
Essentially the idea is to flatten the entire form into a single data type: a list. Each list has a type and data. Data may be terminal or a list of lists. If terminal it is a simple string. If a list of lists, it is a simple collection of the basic data type. (type)(link) all the way down.

I'm going to wander off and play with this idea—I'll return to this white board when I learn more about what I am thinking here…