Question

I've been looking online for a method to sort the kind of data I have (LDIF files), but I haven't found quite what I'm after. There are already programs out there to accomplish this sorting, but they fail with extremely large data sets. Well, for me extremely large is around 2 GB worth of these blocks, and this exhausts memory when using the ldifsort.pl script even if I have 6 GB RAM available and several more GB of swap. So I'm hoping to write a program that will store the blocks of data to the hard drive, sort the keys in memory, and then reassemble the blocks in sorted order. And I'd like to use python3 as I'm trying to learn that language. So if anyone has suggestions on basic strategy or on specific ways to do this with python3, I'd really appreciate the help.

I have large text files containing LDAP data, basically in the (much simplified) form of:

dn: Subscriber=UniqueName1@domain.com;RestOfTree=node1
groups: 1
permissions: 1
IsActive: FALSE
Barring: TRUE

dn: ProfileID=UniqueName1@domain.com;Subscriber=UniqueName1;RestOfTree=node1
groups: 1
permissions: 1
ServiceProfile: Lemur

dn: Subscriber=UniqueName2@domain.com;RestOfTree=node1
groups: 1
permissions: 1
IsActive: FALSE
Barring: TRUE

dn: ProfileID=UniqueName2@domain.com;Subscriber=UniqueName2;RestOfTree=node1
groups: 1
permissions: 1
ServiceProfile: Lemur

Each subscriber has three more blocks that are associated with it (my example code only shows one other block associated with the subscriber), and I would like to keep all four blocks together after the sorting is complete.

So if I read the dn's in this order (the data associated with the dn's is hidden for brevity):

dn: Subscriber=UniqueName2@domain.com;RestOfTree=node
dn: ProfileID=UniqueName2@domain.com;Subscriber=UniqueName2;RestOfTree=node
dn: Subscriber=UniqueName4@domain.com;RestOfTree=node
dn: ProfileID=UniqueName4@domain.com;Subscriber=UniqueName4;RestOfTree=node
dn: Subscriber=UniqueName1@domain.com;RestOfTree=node
dn: Subscriber=UniqueName3@domain.com;RestOfTree=node
dn: ProfileID=UniqueName3@domain.com;Subscriber=UniqueName3;RestOfTree=node
dn: ProfileID=UniqueName1@domain.com;Subscriber=UniqueName1;RestOfTree=node

I would want the output to be:

dn: Subscriber=UniqueName1@domain.com;RestOfTree=node
dn: ProfileID=UniqueName1@domain.com;Subscriber=UniqueName1;RestOfTree=node
dn: Subscriber=UniqueName2@domain.com;RestOfTree=node
dn: ProfileID=UniqueName2@domain.com;Subscriber=UniqueName2;RestOfTree=node
dn: Subscriber=UniqueName3@domain.com;RestOfTree=node
dn: ProfileID=UniqueName3@domain.com;Subscriber=UniqueName3;RestOfTree=node
dn: Subscriber=UniqueName4@domain.com;RestOfTree=node
dn: ProfileID=UniqueName4@domain.com;Subscriber=UniqueName4;RestOfTree=node

One thought I had was to use sqlite3 to store the data as python reads it, then sort the keys in python, then use queries to extract the data again from sqlite and write the data to a file. But I'm afraid the time spent searching for the keys in sqlite will be excessive. Then I thought I could sort the data in sqlite while I'm inserting the data, but it seems that sqlite doesn't support this, and I don't know if there is another database system that does.

Any help or direction is appreciated.

Thanks to Zach for his suggestion of just using GNU sort instead of a database system. Here is the solution I developed with his help.

awk -f ldifformatter.awk LDAP-data-files*.ldif | sort -t \| -k1 | sed '1d;s/|/\n/g' > sorted.txt

where ldifformatter.awk exchanges all newlines with "|" except for the top level dn's, which get used for sorting.

Thanks, Rusty

Was it helpful?

Solution 2

The command-line sort utility can sort extremely large text files without reading them entirely into memory (at least, the GNU version can). However, to use it you will have to reformat the data so that each record (everything that should be kept together) appears on one line. If the records looked something like this:

dn: Subscriber=UniqueName1@domain.com;RestOfTree=node1|groups: 1|permissions: 1|IsActive: FALSE|Barring: TRUE||dn: ProfileID=UniqueName1@domain.com;Subscriber=UniqueName1;RestOfTree=node1|groups: 1|permissions: 1|ServiceProfile: Lemur

then sort -t \| -k1 will do the job.

You can write a program in Python that streams the data into a temporary file in the appropriate format, invokes sort using subprocess.check_call, and then restores the original format. Use tmpfile.NamedTemporaryFile to create the temporary file.

OTHER TIPS

You should not sort your data in memory. You could use merge sort.

Guido van Rossum wrote an article about the same problem — Sorting a million 32-bit integers in 2MB of RAM using Python. There are code examples in this article.

I wonder if SQLite is really not up to the task. But anyway, you could use an external sorting algorithm, e.g. Mergesort, to keep the memory usage low.

http://en.wikipedia.org/wiki/External_sorting

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top