Question

I need to determine which pages of a Word document that a keyword occurs on. I have some tools that can get me the text of the document, but nothing that tells me which pages the text occurs on. Does anyone have a good starting place for me? I'm using .NET

Thanks!

edit: Additional constraint: I can't use any of the Interop stuff.

edit2: If anybody knows of stable libraries that can do this, that'd also be helpful. I use Aspose, but as far as I know that doesn't have anything.

Was it helpful?

Solution

This is how I get the text out, I believe you can set set the selection range to a page, then you could test that text, might be a little backwards from what you need but could be a place to start.

Microsoft.Office.Interop.Word.Application wordApplication = new Microsoft.Office.Interop.Word.Application();
object missing = Type.Missing;
object fileName = @"c:\file.doc";
object objFalse = false;

wordApplication.DisplayAlerts = Microsoft.Office.Interop.Word.WdAlertLevel.wdAlertsNone;
Microsoft.Office.Interop.Word.Document doc = wordApplication.Documents.Open(ref fileName, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing,ref objFalse, ref missing, ref missing, ref missing, ref missing);

//I belevie you can define a SelectionRange and insert here
doc.ActiveWindow.Selection.WholeStory();
doc.ActiveWindow.Selection.Copy();

IDataObject data = Clipboard.GetDataObject();
string text = data.GetData(DataFormats.Text).ToString();

doc.Close(ref missing, ref missing, ref missing);
doc = null;

wordApplication.Quit(ref missing, ref missing, ref missing);
wordApplication = null;

OTHER TIPS

How are you defining a page?

If you only count section/hard page breaks it complex, but doable. If you want to count soft page breaks the task becomes very very difficult and somewhat meaningless. Consider that the determination of where soft-page breaks land is dynamically generated at run-time and is not stored in the file itself. It depends on a huge number of factors including the active printer driver (yes it can change for the same file on a different computer), fonts, kerning, line spacing, margins, etc, etc ,etc.

One crappy way to do this with Aspose is to convert the Word file to a PDF and then grab text on each page.

I don't know anything about the Aspose internals or how they define their soft pages when converting, but this is the best I've got so far.

Thank you for using Aspose.Words.

In the public API we currently have only the "flow-document" information e.g. paragraphs, tables, lists etc. Internally, we build a page layout model that has classes like page, block of text, line of text and so on. There are internal links of course between the document model and the layout model and it is possible to find out which page ends where and all the stuff. Making this information available via the public API is (well, still) high on our priority list.

Have you logged your request in the Aspose.Words support forums? We use this info to maintain a voting system and will work on features that get more votes first.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top