Question

I'm trying to open .doc file and read its content. But i can't find any way how to do this without launching MSWord.

Now I have following code:

Microsoft.Office.Interop.Word.Application app = new Microsoft.Office.Interop.Word.Application();
object nullObject = System.Reflection.Missing.Value;
object file = @"C:\doc.doc";
Microsoft.Office.Interop.Word.Document doc = app.Documents.Open(ref file, ref nullObject, ref nullObject,
         ref nullObject, ref nullObject, ref nullObject, ref nullObject, ref nullObject, ref nullObject,
         ref nullObject, ref nullObject, ref nullObject, ref nullObject, ref nullObject, ref nullObject,
         ref nullObject);
doc.ActiveWindow.Selection.WholeStory();
doc.ActiveWindow.Selection.Copy();
IDataObject data = Clipboard.GetDataObject();
string text = data.GetData(DataFormats.Text).ToString();
doc.Close(ref nullObject, ref nullObject, ref nullObject);
app.Quit(ref nullObject, ref nullObject, ref nullObject);

But it launches MSWord, any solution to do it without launching?

Was it helpful?

Solution

Two possibilities: either use Microsoft's spec to write your own parser for the .doc format, or use an existing library for the purpose (e.g., from Aspose). Unless you have a couple of spare years to spend on the task, the latter is clearly the correct choice.

OTHER TIPS

Last time I did this (via COM from C++), I recall a 'Visible' property in the Application interface (true=visible).

However, it seems to me that the default was false, so you had to set it to true to make Word appear.

Regardless of whether or not the user can see Word, you will still see winword.exe (or whatever it's called today) in your task manager. I don't think there's a way to access Word through this interface without it launching Word (behind the scenes or not).

If you don't want Word to launch at all, you may have to find another solution.

Add the Namespace using Add Reference-->Browse-->Code7248.word_reader.dll

Download dll from the given URL :

sourceforge.net/p/word-reader/wiki/Home

(A simple .NET Library compatible with .NET 2.0, 3.0, 3.5 and 4.0 for C#. It can currently extract only the raw text from a .doc or .docx file.)

The Sample Code is in simple Console in C#:

using System;
using System.Collections.Generic;
using System.Text;
//add extra namespaces
using Code7248.word_reader;


namespace testWordRead
{
    class Program
    {
        private void readFileContent(string path)
        {
            TextExtractor extractor = new TextExtractor(path);
            string text = extractor.ExtractText();
            Console.WriteLine(text);
        }
        static void Main(string[] args)
        {
            Program cs = new Program();
            string path = "D:\Test\testdoc1.docx";
            cs.readFileContent(path);
            Console.ReadLine();
        }
    }
}

It is working fine.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top