Question

Reference to this question I would like to send a MS Word (.doc) file to a tika application running as a service, how can I do this?

There is this link for running tika: http://mimi.kaktusteam.de/blog-posts/2013/02/running-apache-tika-in-server-mode/

But for the python code to access it I am not sure if I can use sockets or urllib or what exactly?

Était-ce utile?

La solution

For remote access to Tika, there are basically two methods available. One is the Tika JAXRS Server, which provides a full RESTful interface. The other is the simple Tika-App --server mode, which just works at a network pipe level.

For production use, you'll probably want to use the Tika JAXRS server, as it's more fully featured. For simple testing and getting started, the Tika App in Server mode ought to be fine

For the latter, just connect to the port that you're running the Tika-App on, stream it your document data, and read your html back. For example, in one terminal run

$ java -jar tika-app-1.3.jar --server --port 1234

Then, in another, do

$ nc 127.0.0.1 1234 < test.pdf

You'll then see the html returned of your test PDF

From python, you just want a simple socket call much as netcat there is doing, send over the binary data, then read back your result. For example, try something like:

#!/usr/bin/python
import socket, sys

# Where to connect
host = '127.0.0.1'
port = 1234

if len(sys.argv) < 2:
  print "Must give filename"
  sys.exit(1)

filename = sys.argv[1]
print "Sending %s to Tika on port %d" % (filename, port)

# Connect to Tika
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((host,port))

# Open the file to send
f = open(filename, 'rb')

# Stream the file to Tika
while True:
  chunk = f.read(65536)
  if not chunk:
    # EOF
    break
  s.sendall(chunk)

# Tell Tika we have sent everything
s.shutdown(socket.SHUT_WR)

# Get the response
while True:
  chunk = s.recv(65536)
  if not chunk:
    # EOF
    break
  print chunk
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top