Sunspot Solr – Indexing PDF documents

Indexing PDF documents using sunspot / solr seemed like a difficult task initially. There are a number of tutorials dealing with this topic but they seemed overly complicated. In the end, I decided the simplest way to go about providing search functionality for indexed PDF documents was to

1) parse the documents locally
2) input the raw PDF text into an ActiveRecord Object.
3) index the associated raw text field and make it searchable

This allows us to skip the unnecessary step of having Solr index the PDF document, and instead indexes the raw PDF text – which is really the only thing we need.

The configuration is as follows

Gemfile

gem sunspot_rails
gem pdf-reader
group:development do
  gem sunspot_solr
end #we're using websolr on Heroku

 

Model Configuration

searchable do
  text :pdf_contents
end

 

Parsing and automatic indexing.

file = open(url_to_pdf_document) #if you are downloading the file, otherwise skip
reader = PDF::Reader.new(File.open(file,"rb"))
contents = ""
reader.pages.each do |page|
  #remove all newlines and extraneous white spaces from the raw content
  contents += page.text.gsub("\\n","").gsub(/\\s+/," ").strip 
end
object.pdf_contents = contents
object.save

 

The PDF document is now fully indexed and searchable within the rails app.

This post does not cover making the PDFs accessible through your App – this will depend on your specific situation. The steps above are storage agnostic and will work with any configuration (s3, local, other cloud service, etc).

2 thoughts on “Sunspot Solr – Indexing PDF documents

  1. Jason Perrone

    I ended up doing the same thing. Tried using Yomu (an Apache Tika wrapper), but 50% of my PDFs didn’t parse well. pdf-reader seemed to be fine with any pdf I threw at it. So, I’ll probably use pdf-reader for pdfs and Tika for Word and other docs

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *