Question

I wrote a simple script that is supposed to read an entire directory and then parse the HTML data into normal script by getting rid off the HTML tags and then write it into one file.

I have 8GB memory and also plenty of available virtual memory. When I am doing this I have more than 5GB RAM available. The largest file in the directory is 3.8 GB.

The script is

file_count = 1
File.open("allscraped.txt", 'w') do |out1|
    for file_name in Dir["allParts/*.dat"] do
        puts "#{file_name}#:#{file_count}"
        file_count +=1
        File.open(file_name, "r") do |file|
            source = ""
            tmp_src = ""
            counter = 0
            file.each_line do |line|
                scraped_content = line.gsub(/<.*?\/?>/, '')
                tmp_src << scraped_content
                if (counter % 10000) == 0
                    tmp_src = tmp_src.gsub( /\s{2,}/, "\n" )
                    source << tmp_src
                    tmp_src = ""
                    counter = 0
                end
                counter += 1
            end
            source << tmp_src.gsub( /\s{2,}/, "\n" )
            out1.write(source)
            break
        end
    end
end

The full error code is:

realscraper.rb:33:in `block (4 levels) in <main>': failed to allocate memory (No
MemoryError)
        from realscraper.rb:27:in `each_line'
        from realscraper.rb:27:in `block (3 levels) in <main>'
        from realscraper.rb:23:in `open'
        from realscraper.rb:23:in `block (2 levels) in <main>'
        from realscraper.rb:13:in `each'
        from realscraper.rb:13:in `block in <main>'
        from realscraper.rb:12:in `open'
        from realscraper.rb:12:in `<main>'

Where line#27 is file.each_line do |line| and 33 is source << tmp_src. The failing file is the largest one (3.8 GB). What is the problem here? Why am I getting this error even though I have enough memory? Also how can I fix it?

Was it helpful?

Solution

The problem is on these two lines:

source << tmp_src
source << tmp_src.gsub( /\s{2,}/, "\n" )

When you read a large file you are slowly growing a very large string in memory.

The simplest solution is not to use this temporary source string at all, but to write the results directly to the file. Just replace those two lines with this instead:

# source << tmp_src
out1.write(tmp_src) 

# source << tmp_src.gsub( /\s{2,}/, "\n" )
out1.write(tmp_src.gsub( /\s{2,}/, "\n" ))                     

This way you're not creating any big temporary strings in memory and it should work better (and faster) this way.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top