سؤال

I need to convert text files' character encodings without hogging the server's memory, while the input file is user configured and its size isn't limited.

Would it be more efficient to wrap an unix's iconv command using exec() (which I'd rather avoid, although I already use it in the application for other file operations), or should I read the file line by line and output it into another file?

I'm thinking working this way:

$in = fopen("in.txt", "r");
$out = fopen("out.txt", "w+");
while(($line = fgets($in, 4096)) !== false) {
    $converted = iconv($charset["in"], $charset["out"], $line);
    fwrite($out, $converted);
}
rename("out.txt", "in.txt");

Is there any better approach to convert the file fast and efficiently? I'm thinking this might be rather CPU intensive, but then I believe iconv itself is an expensive task so I'm not sure if I can make it actually not eat the server much at all.

Thanks!

هل كانت مفيدة؟

المحلول

Alright, thanks for the input, I did "my homework" based on it and got the results, working with 50MB sample of actual CSV data:

First, iterating over the file using PHP:

$in = fopen("a.txt", "r");
$out = fopen("p.txt", "w+");

$start = microtime(true);

while(($line = fgets($in)) !== false) {
    $converted = iconv("UTF-8", "EUC-JP//TRANSLIT", $line);
    fwrite($out, $converted);
}

$elapsed = microtime(true) - $start;
echo "<br>Iconv took $elapsed seconds\r\n";


Iconv took 2.2817220687866 seconds

That's not so bad I guess, so I tried the exact same approach in #bash, so it wouldn't have to load all the file but iterate over each line instead (which might not exactly happen as I understand what Lajos Veres answered). Indeed, this method wasn't exactly efficient (CPU was under a heavy load all the time). Also, the output file is smaller than the other 2, although after a quick look it looks the same, so I must have made a mistake in the bash script, however, that shouldn't have such effect on performance anyway:

#!/bin/bash
echo "" > b.txt
time echo $(
    while read line
    do
        echo $line |iconv -f utf-8 -t EUC-JP//TRANSLIT >> b.txt
    done < a.txt
)

real 9m40.535s user 2m2.191s sys 3m18.993s

And then the classic approach which I would have expected to hog the memory, however, checking the CPU/Memory usage, it didn't seem to take any more memory than any other approach, therefore being a winner:

#!/bin/bash
time echo $(
    iconv -f utf-8 -t EUC-JP//TRANSLIT a.txt -o b2.txt
)

real 0m0.256s user 0m0.195s sys 0m0.060s

I'll try to get a bigger file sample to test the 2 more efficient methods to make sure the memory usage doesn't get significant, however, the result seems obvious enough to assume the single pass through the whole file in bash is the most efficient (I didn't try that in PHP, as I believe loading an entire file to an array/string in PHP isn't ever a good idea).

نصائح أخرى

fyi: http://sourceware.org/bugzilla/show_bug.cgi?id=6050

Anyway the os needs to read the whole file sooner or later. It means when it reads the cache purging lru-like logic will free up memory. lru means probably older pages will be throwed out.

You cant be 100% sure how your system will tolerate this. You have to separate this process with a different hw or virtualizaton but these solutions can also create bottlenecks.

Prudential testing is probably the most cost effectice way. But not the implementation can cause most of the headaches rather the expected workload.

I mean processing lots of g files in hundred paralell thread is totally different than a few files/day.

This is benchmarking for Iconv with PHP and Iconv with Unix Bash.

For PHP ->

<?php
$text = file('a.txt');
$text = $text[0];
$start = microtime(true);
for ($i = 0; $i < 1000; $i++) {
 $str =  iconv("UTF-8", "EUC-JP", $text);
}
$elapsed = microtime(true) - $start;
echo "<br>Iconv took $elapsed seconds\r\n";
?>

Depends on my server results,

root@ubuntu:/var/www# php benc.php
<br>Iconv took 0.0012350082397461 seconds

For Unix Bash ->

#!/bin/bash
begin_time=$(($(date +%N)/10000000))
for i in {0..1000}
 do
      iconv -f utf-8 -t EUC-JP a.txt -o b.txt
 done
end_time=$(($(date +%s%N)/1000000))
total_time=$((end_time-begin_time))
echo ${total_time}

Depends on my server results,

root@ubuntu:/var/www#bash test.sh
1380410308211

Results clearly say that you get more performance from iConv with PHP in CPU usage percentage. It's point out that, the winner use memory less as CPU.

Note: If you run, you should create and a.txt file in same dictionary with *.sh and *.php files.

Why don't you just do it in the system rather than reading the file chunk by chunk. Given that iconv exists in your system

system(sprintf('iconv -f %s -t %s %s > %s',
                $charset['in'], $charset['out'], "in.txt", "out.txt"));
مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top