Pregunta

As required I am trying to convert doc or docx (Microsoft word) files to html format with Apache tika

I end up with following code which works fine, But its not adding any style sheet to result html.

 import javax.xml.transform.OutputKeys;
 import java.io.*;
 import javax.xml.transform.stream.StreamResult;
 import javax.xml.transform.sax.SAXTransformerFactory;
 import javax.xml.transform.sax.TransformerHandler;
 import org.apache.tika.metadata.Metadata;
 import org.apache.tika.parser.AutoDetectParser;
 import org.apache.tika.parser.ParseContext;
 import org.apache.tika.detect.DefaultDetector;


public class DocxConvert

  {

  public static void main(String []args) 
   {
      InputStream input=null;

     try
        {
    StringWriter sw = new StringWriter();
            SAXTransformerFactory factory = (SAXTransformerFactory)
            SAXTransformerFactory.newInstance();
            TransformerHandler handler = factory.newTransformerHandler();
            handler.getTransformer().setOutputProperty(OutputKeys.METHOD,"html");
            handler.getTransformer().setOutputProperty(OutputKeys.INDENT,"yes");
            handler.setResult(new StreamResult(sw));
            input = new FileInputStream("f:\\file.doc");
            DefaultDetector detector = new DefaultDetector();
            Metadata metadata = new Metadata();
            org.apache.tika.parser.Parser parser = new AutoDetectParser(detector); 
            parser.parse(input, handler, metadata, new ParseContext());

            System.out.print(sw.toString());

        }
      catch (Exception ex)
   { 
        ex.printStackTrace();
   }
      finally {
              try {
            input.close();
          }
                  catch (IOException e)
                 {
            // TODO Auto-generated catch block
            e.printStackTrace();
          }
       } 

 }

}

Is there any way to add/generate style sheet to output? kindly help !

¿Fue útil?

Solución

I used version 1.6 of Tika and that worked fine for me. Here is the pom dependency I used.

http://tika.apache.org/download.html

   <dependencies>
        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-core</artifactId>
            <version>1.6</version>
        </dependency>
        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-parsers</artifactId>
            <version>1.6</version>
        </dependency>
    </dependencies>

Otros consejos

You can use unoconv and it requires Openoffice or Libreoffice. Download from here and it provides doc,docx,xls etc. to pdf conversion from command line in your server. if you want to show embedding pdf file with apache or apache tomcat, i think pdf.js is good solution.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top