How to check if a char is valid in C++

Question 1

Utf8 is just a nice way of encoding characters and has a very clearly defined structure, so fundamentally it is reasonably simple to read a chunk of memory and validate it contains utf8. Mostly this consists of verifying that certain bit patterns do NOT occur, such as C0, C1, F5 to FF. (depending on position)

It is reasonably simple in C (sorry, dont speak python) to code something that is a simple fopen/fread and check the bit patterns of each byte, although i would recommend finding some code to cut/paste ( eg http://utfcpp.sourceforge.net/ but i havent used these exact routines) as there are some caveats and special cases to handle. Just treat the input bytes as unsigned char and bitmask them directly. I would paste what I use, but not in office.

A C program will rapidly become IO bound, so suggestions about IO will then apply if you want ultimate performance, however direct byte inspection like this will be hard to beat in performance if you do it right. Utf8 is nice in that you can find boundaries even if you start in the middle of the file, so this leads nicely to parallel algorithms.

If you build you own, watch for BOM masks that might appear at start of some files.

Links

http://en.wikipedia.org/wiki/UTF-8 nice clear overview with table showing valid bit patterns.

https://www.rfc-editor.org/rfc/rfc3629 the rfc describing utf8

http://www.unicode.org/ homepage for unicode consortitum.

Question 2

Your best bet according to me is parallilize. If you can parallelize the cleaning and clean many contents simoultaneously then the process will be more efficient. I'd look into a framework for parallelization e.g. mapreduce where you can multithread the task.

Question 3

I would look at memory mapped files. This is something in the Microsoft world, not sure if it exists in unix etc., but likely would.

Basically you open the file and point the OS at it and it loads the file (or a chunk of it) into memory which you can then access using a pointer array. For a 100 GB file, you could load perhaps 1GB at a time, process and then write to a memory mapped output file.

http://msdn.microsoft.com/en-us/library/windows/desktop/aa366556(v=vs.85).aspx

http://msdn.microsoft.com/en-us/library/windows/desktop/aa366542(v=vs.85).aspx

This should I would think be the fastest way to perform big I/O, but you would need to test in order to say for sure.

HTH, good luck!

Question 4

Unix/Linux and any other POSIX-compliant OSs support memory map(mmap) toow.