Pregunta

I'm trying to tokenize a large amount of text in Java. When I say large, I mean entire chapters of books at a time. I wrote the first draft of my code by using a single page from a book and everything worked fine. Now that I'm trying to process entire chapters things aren't working. It processes part of the chapter correctly and then it just stops.

Below is all of the relevant code

File folder = new File(Constants.rawFilePath("eng"));
    FileHelper fileHelper = new FileHelper();
    BPage firstChapter = new BPage();
    BPage firstChapterSpanish = new BPage();
    File[] allFiles = folder.listFiles();
    //read the files into memory
    ArrayList<ArrayList<String>> allPages = new ArrayList<ArrayList<String>>();

    //for the english
    for(int i=0;i<allFiles.length;i++)
    {
        String filePath = Constants.rawFilePath("/eng/metamorph_eng_"+String.valueOf(i)+".txt");
        ArrayList<String> pageToAdd = fileHelper.readFileToMemory(filePath);
        allPages.add(pageToAdd);
    }

    String allPagesAsString = "";

    for(int i=0;i<allPages.size();i++)
    {
        allPagesAsString = allPagesAsString+fileHelper.turnListToString(allPages.get(i));
    }

    firstChapter.setUnTokenizedPage(allPagesAsString);
    firstChapter.tokenize(Languages.ENGLISH);

    folder = new File(Constants.rawFilePath("spa"));
    allFiles = folder.listFiles();
    //for the spanish
    for(int i=0;i<allFiles.length;i++)
    {
        String filePath = Constants.rawFilePath("/eng/metamorph_eng_"+String.valueOf(i)+".txt");
        ArrayList<String> pageToAdd = fileHelper.readFileToMemory(filePath);
        allPages.add(pageToAdd);
    }

    allPagesAsString = "";

    for(int i=0;i<allPages.size();i++)
    {
        allPagesAsString = allPagesAsString+fileHelper.turnListToString(allPages.get(i));
    }

    firstChapterSpanish.setUnTokenizedPage(allPagesAsString);
    firstChapterSpanish.tokenize(Languages.SPANISH);

    fileHelper.writeFile(firstChapter.getTokenizedPage(), Constants.partiallyprocessedFilePath("eng_ch_1.txt"));
    fileHelper.writeFile(firstChapterSpanish.getTokenizedPage(), Constants.partiallyprocessedFilePath("spa_ch_1.txt"));
}

even though I'm reading all of the files in the directory where I expect my text to be, only the first coups of files are being added to the string that I'm processing. It seems like after a while the code will still run but it only adds characters to my string up to a certain point.

What do I have to change so that I can process all of my files at once?

¿Fue útil?

Solución

This part

String allPagesAsString = "";

for(int i=0;i<allPages.size();i++)
{
    allPagesAsString = allPagesAsString+
       fileHelper.turnListToString(allPages.get(i));
}

will be really slow if your copying larger strings.

Using a StringBuilder will speed things up a bit:

int expectedBookSize = 10000;
StringBuilder allPagesAsString = new StringBuilder(expectedBookSize); 
for(int i=0;i<allPages.size();i++)
{
        allPagesAsString.append(fileHelper.turnListToString(allPages.get(i)));
}

Can't you process one page at a time? That would be the best solution.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top