Is there a GEDCOM parser written in Python? [closed]

https://stackoverflow.com/questions/1919593

20-09-2019
|

Question

GEDCOM is a standard for exchanging genealogical data.

I've found parsers written in

C
perl
Ruby
and even Factor

but none so far written in Python. The closest I've come is the file libgedcom.py from the GRAMPS project, but that is so full of references to GRAMPS modules as to not be usable for me.

I just want a simple standalone GEDCOM parser library written in Python. Does this exist?

Solution

A few years ago I wrote a simplistic GEDCOM to XML translator in Python as part of a larger project. I found that dealing with the GEDCOM data in an XML format was much easier (especially when the next step involved XSLT).

I don't have the code online at the moment, so I've pasted the module into this message. This works for me; no guarantees. Hope this helps though.

import codecs, os, re, sys
from xml.sax.saxutils import escape

fn = sys.argv[1]

ged = codecs.open(fn, encoding="cp437")
xml = codecs.open(fn+".xml", "w", "utf8")
xml.write("""<?xml version="1.0"?>\n""")
xml.write("<gedcom>")
sub = []
for s in ged:
    s = s.strip()
    m = re.match(r"(\d+) (@(\w+)@ )?(\w+)( (.*))?", s)
    if m is None:
        print "Error: unmatched line:", s
    level = int(m.group(1))
    id = m.group(3)
    tag = m.group(4)
    data = m.group(6)
    while len(sub) > level:
        xml.write("</%s>\n" % (sub[-1]))
        sub.pop()
    if level != len(sub):
        print "Error: unexpected level:", s
    sub += [tag]
    if id is not None:
        xml.write("<%s id=\"%s\">" % (tag, id))
    else:
        xml.write("<%s>" % (tag))
    if data is not None:
        m = re.match(r"@(\w+)@", data)
        if m:
            xml.write(m.group(1))
        elif tag == "NAME":
            m = re.match(r"(.*?)/(.*?)/$", data)
            if m:
                xml.write("<forename>%s</forename><surname>%s</surname>" % (escape(m.group(1).strip()), escape(m.group(2))))
            else:
                xml.write(escape(data))
        elif tag == "DATE":
            m = re.match(r"(((\d+)?\s+)?(\w+)?\s+)?(\d{3,})", data)
            if m:
                if m.group(3) is not None:
                    xml.write("<day>%s</day><month>%s</month><year>%s</year>" % (m.group(3), m.group(4), m.group(5)))
                elif m.group(4) is not None:
                    xml.write("<month>%s</month><year>%s</year>" % (m.group(4), m.group(5)))
                else:
                    xml.write("<year>%s</year>" % m.group(5))
            else:
                xml.write(escape(data))
        else:
            xml.write(escape(data))
while len(sub) > 0:
    xml.write("</%s>" % sub[-1])
    sub.pop()
xml.write("</gedcom>\n")
ged.close()
xml.close()

OTHER TIPS

I've taken code from mwhite's answer, extended it a bit (OK, more than just a bit) and posted at github: http://github.com/dijxtra/simplepyged. I take suggestions about what else to add :-)

I know this thread is pretty old, but I found it in my searches as well as this project https://github.com/madprime/python-gedcom/

The source is squeeky clean and very functional.

A general-purpose GEDCOM parser in Python is linked from http://ilab.cs.byu.edu/cs460/2006w/assignments/program1.html

You could use the SWIG tool for including C libraries though the native language interface. You'll have to make calls against the C api from within Python, but the rest of your code can be Python only.

May sound a bit daunting, but once you get thing setup, using the two together won't be bad. There may be some quirks depending how the C library was written, but you'd have to deal with some no matter which option you used.

Another basic parser for the GEDCOM 5.5 format: https://github.com/rootsdev/python-gedcom-parser

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow