سؤال

I am collecting full HTML from a service that provides access to a very large collection of blogs and news websites. I am checking the HTML as it comes (in real-time) to see if it contains some keywords. If it contains one of the keywords, I am writing the HTML to a text file to store it.

I want to do this for a week. Therefore I am collecting a large amount of data. Testing the program for 3 minutes yielded a text file of 100MB. I have 4 TB of space, and I can't use more than this.

Also, I don't want the text files to become too large, because I assume they'll become un-openable.

What I am proposing is to open a text file, and write HTML to it, frequently checking its size. If it becomes bigger than, let's say 200MB, I close the text file and open another. I also need to keep a running log of how much space I've used in total, so that I can make sure that I don't get close to 4 TB.

The question I have at this point is how to check the size of the text file before the file has been closed (using FileWriter.close()). Is there a function for this or should I count the number of characters written to the file and use that to estimate the file size?

A separate question: are there ways of minimising the amount of space my text files take up? I am working in Java.

هل كانت مفيدة؟

المحلول

Create a writer which counts the number of characters written and use that to wrap your OutputStreamWriter.

[EDIT] Note: The correct way to save text to a file is:

new BufferedWriter( new OutputStreamWriter( new FileOutputStream( file ), encoding ) ) );

The encoding is important; it's usually "UTF-8".

This chain gives you two places where you can inject your wrapper: You can wrap the writer to get the number of characters or the inner OutputStream to get bytes written.

نصائح أخرى

To minimize space, you could zip your text files with Java. Why not add each file to a zip after closing it. After zipping, you could check the size of the zip to see your your cumulative storage consumption.

HTML will easily compress with a high compression ratio. Consider using a GZIPOutputStream to "minimzie the amount of space" your text files take up.

I continuation to Aaron's answer. You can use CountingOutputStream: just wrap your FileOutputStream using CountingOutputStream and you will be able to know how many bytes have you already written.

Did it occur to you to count how many bytes you write to the file?

import java.io.File;
import java.io.FileWriter;
import java.io.IOException;


public class TestFileWriter {

    /**
     * @param args
     * @throws IOException 
     */
    public static void main(String[] args) throws IOException {
        FileWriter fileWriter= new FileWriter("test.txt");
        for (int i=0; i<1000; i++) {
            fileWriter.write("a very long string, a very long string, a very long string, a very long string, a very long string\n");
            if ((i%100)==0) {
                File file=new File("test.txt");
                System.out.println("file size=" +  file.length());
            }
        }
        fileWriter.close();
        File file=new File("test.txt");
        System.out.println("file size=" +  file.length());

    }

}

This example demonstrates that if you are using a file writer you can obtain its size in realtime while writing and with the writer open. If you want to save space you can zip the stream.

Apologies for being slightly off-topic:

Does it have to be in Java? Depending on how you get your feed data, this sounds like a job for a fairly simple shell script to me (grep or fgrep for checking for keywords, gzip for compressing...)

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top