Domanda

I am using the following technologies:

  1. JRuby 1.7.4
  2. Rails 3.2.13
  3. Ubuntu 13.04
  4. DB2 C-Express
  5. Torquebox server 2.3.0

My goal is to make a simple controller which implements the following functions:

  1. Upload text files (MS Word format, Open Office or Libre Office formats)
  2. Perform a full text search on the uploaded files
  3. Display the text files in the browsers as PDFs

I have searched for gems that can help me to achieve that and have the following questions:

  1. What should be the column type of the field that is storing the text file. Generally, I supposed it should be binary type.
  2. Is it possible to perform full text search using Sunspot? As I have read, it seems to work with fields of type text, not binary.
  3. I read about two gems that allow me to generated PDFs. The Prawn gem, which adds more flexibility and the PDFkit which can generates PDFs from HTML pages. Can any of this be used to display the text file? I am supposing that I should first display somehow in HTML, and then use the PDF gem.

Has anyone done something like this and could you point me in the right direction?

È stato utile?

Soluzione

I haven't ever done most of the things in your requirements, but I work quite heavily with a text parser that converts MS Word documents into XML documents. Perhaps I can at least get you started in the right direction for that.

We use a Java library called POI, by Apache that makes the DOC -> XML conversion a simple process. Since you're using JRuby, I'd imagine it'll be much easier for you to integrate it into your project since we're using MRI Ruby. That was a PITA because we had to include lots of bridges and other junk just to be able to use the .jar files.

Personally, I've used the Carrierwave gem to handle file uploading. It's a snap to upload files & attach them to models. You simply use the Carrierwave generator to generate an Uploader class that attaches to a field in a model, configure it to store & process the file based on your specifications, and PROFIT! The docs are great, but I'm happy to help you if you need it. If you need multi-file uploading, I explained in detail about how I accomplished it in a different SO post.

Hope that helps!

Altri suggerimenti

To answer to your questions:

  1. I would use two columns, one binary (BLOB) to store the original document (MS Word or LibreOffice). This would be useful for the translation into PDF. And then another column for the fulltext search; this would be of type TEXT and contain only plain text.
  2. I wouldn't use a gem for fulltext search, I would rather use SQL 'LIKE' keyword.
  3. As far as I know Prawn is the best. You could also search if some gem directly converts MSWord into pdf or Libreoffice doc into PDF.

Finally, Libreoffice texts are simple compressed archives where text is stored into a XML file. To extract it do:

content = `unzip -cq \"#{file_path}\" content.xml`
require 'nokogiri'
@nokogiri = Nokogiri::XML(content)
paragraphs = []
@nokogiri.xpath('//text:p').each do |t|
    paragraphs << t.content
end
text = paragraphs.join ' '
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top