Question

I tried searching sentences with icepdf.And got the right results most of the time.But the problems i am facing now are

  • I failed on searching for sentences which starts with one line and ends in the next line. Is there any solution for finding the same? I tried splitting those sentences and searching them separately.But it may cause more problems.

  • And finally, is there any method by which i can know the line numbers on which i got the search key matched. Please help.

Was it helpful?

Solution

Loop through all the lines in the document and create a list of a sentences. Each sentence can be a list of WordText objects. Then search through the list of lists you have created to find your sentence.

Here is some example code (I have not checked as of now) to build the list of lists of WordText objects.

ArrayList<ArrayList<WordText>> Sentences = new ArrayList<ArrayList<WordText>>;
ArrayList<WordText> currentSentence = new ArrayList<WordText>;
Document document = new Document();

// Build sentences
for (int pageNumber = 0, max = document.getNumberOfPages(); 
     pageNumber < max;     pageNumber++) {
  PageText pageText = document.getPageText(pageNumber);
  ArrayList<LineText> pageLines = pageText.getPageLines();
  for (LineText pageLine : pageLines) {
    ArrayList<WordText> words = pageLine.getWords();
    for (WordText word : words) {
      // If this is a word, and the last word was not a space, 
      // start a new sentence
      if(!word.getText().equals(" ") && currentSentence.size() > 0
         !currentSentence.get(currentSentence.size() - 1).getText().equals(" ")) {
        sentences.add(currentSentence);
        currentSentence = new ArrayList<WordText>;
      }
      // Add word to current sentnece
      currentSentence.add(word);
    }
   // Add the last sentence in
   sentences.add(currentSentence);
  }
}

If you need to sort your WordText lists, you can compare the WordText objects y and then x values.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top