Are you sure that you absolutely need unoconv for your use case? It is powerful, but since it needs a full-fledged LibreOffice to run, it is: 1) somewhat slow to convert files; 2) slow to start; 3) uses a lot of RAM; 4) not very scalable.
Why don't you try Apache Tika (which is based on Apache POI)? It is somewhat more lightweight and more than good enough for most of the day-to-day tasks.
Launch Tika to process PDF files too, or use magic to distinguish between file types and go with a separate pdftotext utility or something similar. Here's a simplified version of what you can use to convert office files to, let's say, text:
import subprocess
from django.db import models
import magic # https://github.com/ahupp/python-magic
PDFTOTEXT_COMMAND = '/usr/bin/pdftotext'
JAVA_COMMAND = '/usr/bin/java'
TIKA_PATH = '/path/to/tika.jar'
PDFTOTEXT_OPTIONS = [u'-', ]
JAVA_OPTIONS = [ u'-jar', TIKA_PATH, u'--text', ]
mime = magic.Magic(mime=True)
class UploadedFileModel(models.Model):
file = models.FileField(upload_to='files/')
def get_txt(self):
if not ('application/pdf' in mime.from_file(
self.file.path.encode('utf-8'))):
option_list = [JAVA_COMMAND, ] + JAVA_OPTIONS + [self.file.path, ]
else:
option_list = [PDFTOTEXT_COMMAND, ] + [self.file.path, ] +\
PDFTOTEXT_OPTIONS
pipe = subprocess.Popen(option_list, stdout=subprocess.PIPE)
txt = pipe.communicate()[0]
if pipe.returncode:
return None
else:
return txt
P.S.
The error unoconv: Cannot find a suitable pyuno library and python binary combination
can be related to a broad number of issues. It is impossible to tell for sure without you providing additional information. For example, it could be a problem with paths.
Be sure to check out the relevant unoconv
troubleshooting guides: