Question

I need a program that reads the contents of a file and write them into another file but only the characters that are valid utf-8 characters. The problem is that the file may come in any encoding and the contents of the file may or may not correspond to such encoding.

I know it's a mess but that's the data I get to work with. The files I need to "clean" can be as big as a couple of terabytes so I need the program to be as efficient as humanly possible. Currently I'm using a program I write in python but it takes as long as a week to clean 100gb.

I was thinking of reading the characters with the w_char functions and then manage them as integers and discard all the numbers that are not in some range. Is this the optimal solution?

Also what's the most efficient way to read and write in C/C++?

EDIT: The problem is not the IO operations, that part of the question is intended as an extra help to have an even quicker program but the real issue is how to identify non UTF character quickly. Also, I have already tried palatalization and RAM disks.

Was it helpful?

Solution

Utf8 is just a nice way of encoding characters and has a very clearly defined structure, so fundamentally it is reasonably simple to read a chunk of memory and validate it contains utf8. Mostly this consists of verifying that certain bit patterns do NOT occur, such as C0, C1, F5 to FF. (depending on position)

It is reasonably simple in C (sorry, dont speak python) to code something that is a simple fopen/fread and check the bit patterns of each byte, although i would recommend finding some code to cut/paste ( eg http://utfcpp.sourceforge.net/ but i havent used these exact routines) as there are some caveats and special cases to handle. Just treat the input bytes as unsigned char and bitmask them directly. I would paste what I use, but not in office.

A C program will rapidly become IO bound, so suggestions about IO will then apply if you want ultimate performance, however direct byte inspection like this will be hard to beat in performance if you do it right. Utf8 is nice in that you can find boundaries even if you start in the middle of the file, so this leads nicely to parallel algorithms.

If you build you own, watch for BOM masks that might appear at start of some files.

Links

http://en.wikipedia.org/wiki/UTF-8 nice clear overview with table showing valid bit patterns.

https://www.rfc-editor.org/rfc/rfc3629 the rfc describing utf8

http://www.unicode.org/ homepage for unicode consortitum.

OTHER TIPS

Your best bet according to me is parallilize. If you can parallelize the cleaning and clean many contents simoultaneously then the process will be more efficient. I'd look into a framework for parallelization e.g. mapreduce where you can multithread the task.

I would look at memory mapped files. This is something in the Microsoft world, not sure if it exists in unix etc., but likely would.

Basically you open the file and point the OS at it and it loads the file (or a chunk of it) into memory which you can then access using a pointer array. For a 100 GB file, you could load perhaps 1GB at a time, process and then write to a memory mapped output file.

http://msdn.microsoft.com/en-us/library/windows/desktop/aa366556(v=vs.85).aspx

http://msdn.microsoft.com/en-us/library/windows/desktop/aa366542(v=vs.85).aspx

This should I would think be the fastest way to perform big I/O, but you would need to test in order to say for sure.

HTH, good luck!

Unix/Linux and any other POSIX-compliant OSs support memory map(mmap) toow.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top