Question

Basically, I'm looking for an efficient way (in terms of coding effort) to present a list of pairs of Dicts in a human-readable form. In Python 2.7.

I have two lists of OrderedDict. Each dict is a record of book data (title, author etc). One list has messy data (typo's etc), the other has tidy data. I'm using difflib.SequenceMatcher to find the closest match of untidy titles, to tidy ones. That works nicely.

It gives me a list of pairs of dicts, namely each untidy dict to it's closest matching tidy one. Those pairs need to be reviewed, pair by pair, by humans. So I want to output each pair to the screen, showing the untidy and the tidy dict side by side, each in it's own panel. Each dict may have a varying amount of additional fields, eg. co-author, publisher, date, etc.

difflib.HtmlDiff doesn't really do what I want.

Exporting to Excel (via CSV) is not ideal, because data isn't flat. (One line in excel will have a different number of fields than another). Likewise for Google Refine, I think that's more oriented towards tabular data.

Call me lazy, but Tkinter or XML/HTML seem to be overkill. It's just a once-off exercise.

I'm not familiar at all with JSON nor YAML, maybe I should look there?

Any better suggestions?

I have this hunch that I haven't found the right search terms yet.

Was it helpful?

Solution

What I had to output was a list of 3-item lists, each 3-item list containing matching number and two ordered dicts with title-to-correct and best matching title, both with additional info such as author, shelfmark, etc.

I went for output in Yaml, because it's advertised as human-readable and human-editable. For this I have no user testimonial yet, but creating the output file was really easy (if you take time to read the PyYaml documentation).

import yaml
.
.

with codecs.open('Lit_titles_match.yml', 'w', 'utf-8-sig') as m:
    # match is a list of lists of one float and two dicts.   
    m.write(yaml.dump_all(match, default_flow_style=False))
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top