Convert Word doc or docx files into text files?

https://stackoverflow.com/questions/1110409

12-09-2019
|

Question

I need a way to convert .doc or .docx extensions to .txt without installing anything. I also don't want to have to manually open Word to do this obviously. As long as it's running on auto.

I was thinking that either Perl or VBA could do the trick, but I can't find anything online for either.

Any suggestions?

Solution

Note that an excellent source of information for Microsoft Office applications is the Object Browser. You can access it via Tools → Macro → Visual Basic Editor. Once you are in the editor, hit F2 to browse the interfaces, methods, and properties provided by Microsoft Office applications.

Here is an example using Win32::OLE:

#!/usr/bin/perl

use strict;
use warnings;

use File::Spec::Functions qw( catfile );

use Win32::OLE;
use Win32::OLE::Const 'Microsoft Word';
$Win32::OLE::Warn = 3;

my $word = get_word();
$word->{Visible} = 0;

my $doc = $word->{Documents}->Open(catfile $ENV{TEMP}, 'test.docx');

$doc->SaveAs(
    catfile($ENV{TEMP}, 'test.txt'),
    wdFormatTextLineBreaks
);

$doc->Close(0);

sub get_word {
    my $word;
    eval {
        $word = Win32::OLE->GetActiveObject('Word.Application');
    };

    die "$@\n" if $@;

    unless(defined $word) {
        $word = Win32::OLE->new('Word.Application', sub { $_[0]->Quit })
            or die "Oops, cannot start Word: ",
                   Win32::OLE->LastError, "\n";
    }
    return $word;
}
__END__

OTHER TIPS

A simple Perl only solution for docx:

Use Archive::Zip to get the word/document.xml file from your docx file. (A docx is just a zipped archive.)
Use XML::LibXML to parse it.
Then use XML::LibXSLT to transform it into text or html format. Seach the web to find a nice docx2txt.xsl file :)

Cheers !

For .doc, I've had some success with the linux command line tool antiword. It extracts the text from .doc very quickly, giving a good rendering of indentation. Then you can pipe that to a text file in bash.

For .docx, I've used the OOXML SDK as some other users mentioned. It is just a .NET library to make it easier to work with the OOXML that is zipped up in an OOXML file. There is a lot of metadata that you will want to discard if you are only interested in the text. Some other people have already written the code I see: DocXToText.

Aspose.Words has a very simple API with great support too I have found.

There is also this bash command from commandlinefu.com which works by unzipping the .docx:

unzip -p some.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'

I strongly recommend AsposeWords if you can do Java or .NET. It can convert, without Word installed, between all major text file types.

If you have some flavour of unix installed, you can use the 'strings' utility to find and extract all readable strings from the document. There will be some mess before and after the text you are looking for, but the results will be readable.

Note that you can also use OpenOffice to perform miscellaneous document, drawing, spreadhseet etc. conversions on both Windows and *nix platforms.

You can access OpenOffice programmatically (in a way analogous to COM on Windows) via UNO from a variety of languages for which a UNO binding exists, including from Perl via the OpenOffice::UNO module.

On the OpenOffice::UNO page you will also find a sample Perl scriptlet which opens a document, all you then need to do is export it to txt by using the document.storeToURL() method -- see a Python example which can be easily adapted to your Perl needs.

.doc's that use the WordprocessingML and .docx's XML format can have their XML parsed to retrieve the actual text of the document. You'll have to read their specifications to figure out which tags contain readable text.

The method of Sinan Ünür works well.
However, I got some crash with the files I was transforming.

Another method is to use Win32::OLE and Win32::Clipboard as such:

Open the Word document
Select all the text
Copy in the Clipboard
Print the content of Clipboard in a txt file
Empty the Clipboard and close the Word document

Based on the script given by Sigvald Refsu in http://computer-programming-forum.com/53-perl/c44063de8613483b.htm, I came up with the following script.

Note: I chose to save the txt file with the same basename as the .docx file and in the same folder but this can easily be changed

########################################### 
use strict; 
use File::Spec::Functions qw( catfile );
use FindBin '$Bin';
use Win32::OLE qw(in with); 
use Win32::OLE::Const 'Microsoft Word'; 
use Win32::Clipboard; 

my $monitor_word=0; #set 1 to watch MS Word being opened and closed

sub docx2txt {
    ##Note: the path shall be in the form "C:\dir\ with\ space\file.docx"; 
    my $docx_file=shift; 

    #MS Word object
    my $Word = Win32::OLE->new('Word.Application', 'Quit') or die "Couldn't run Word"; 
    #Monitor what happens in MS Word 
    $Word->{Visible} = 1 if $monitor_word; 

    #Open file 
    my $Doc = $Word->Documents->Open($docx_file); 
    with ($Doc, ShowRevisions => 0); #Turn of revision marks 

    #Select the complete document
    $Doc->Select(); 
    my $Range = $Word->Selection();
    with ($Range, ExtendMode => 1);
    $Range->SelectAll(); 

    #Copy selection to clipboard 
    $Range->Copy();

    #Create txt file 
    my $txt_file=$docx_file; 
    $txt_file =~ s/\.docx$/.txt/;
    open(TextFile,">$txt_file") or die "Error while trying to write in $txt_file (!$)"; 
    printf TextFile ("%s\n", Win32::Clipboard::Get()); 
    close TextFile; 

    #Empty the Clipboard (to prevent warning about "huge amount of data in clipboard")
    Win32::Clipboard::Set("");

    #Close Word file without saving 
    $Doc->Close({SaveChanges => wdDoNotSaveChanges});

    # Disconnect OLE 
    undef $Word; 
}

Hope it can helps you.

You can't do it in VBA if you don't want to start Word (or another Office application). Even if you meant VB, you'd still have to start a (hidden) instance of Word to do the processing.

I need a way to convert .doc or .docx extensions to .txt without installing anything

for I in *.doc?; do mv $I `echo $ | sed 's/\.docx?/\.txt'`; done

Just joking.

You could use antiword for the older versions of Word documents, and try to parse the xml of the new ones.

With docxtemplater, you can easily get the full text of a word (works with docx only).

Here's the code (Node.JS)

DocxTemplater=require('docxtemplater');
doc=new DocxTemplater().loadFromFile("input.docx");
result=doc.getFullText();

This is just three lines of code and doesn't depend on any word instance (all plain JS)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow