solr, sunspot, bad request, illegal character

Question 1

Put the following in an initializer to automatically clean sunspot calls of any UTF8 control characters:

# config/initializers/sunspot.rb
module Sunspot
  # 
  # DataExtractors present an internal API for the indexer to use to extract
  # field values from models for indexing. They must implement the #value_for
  # method, which takes an object and returns the value extracted from it.
  #
  module DataExtractor #:nodoc: all
    # 
    # AttributeExtractors extract data by simply calling a method on the block.
    #
    class AttributeExtractor
      def initialize(attribute_name)
        @attribute_name = attribute_name
      end

      def value_for(object)
        Filter.new( object.send(@attribute_name) ).value
      end
    end

    # 
    # BlockExtractors extract data by evaluating a block in the context of the
    # object instance, or if the block takes an argument, by passing the object
    # as the argument to the block. Either way, the return value of the block is
    # the value returned by the extractor.
    #
    class BlockExtractor
      def initialize(&block)
        @block = block
      end

      def value_for(object)
        Filter.new( Util.instance_eval_or_call(object, &@block) ).value
      end
    end

    # 
    # Constant data extractors simply return the same value for every object.
    #
    class Constant
      def initialize(value)
        @value = value
      end

      def value_for(object)
        Filter.new(@value).value
      end
    end

    # 
    # A Filter to allow easy value cleaning
    #
    class Filter
      def initialize(value)
        @value = value
      end
      def value
        strip_control_characters @value
      end
      def strip_control_characters(value)
        return value unless value.is_a? String

        value.chars.inject("") do |str, char|
          unless char.ascii_only? and (char.ord < 32 or char.ord == 127)
            str << char
          end
          str
        end

      end
    end

  end
end

Source (Sunspot Github Issues): Sunspot Solr Reindexing failing due to illegal characters

Question 2

I tried the solution @thekingoftruth proposed, however it did not solve the problem. Found an alternative version of the Filter class in the same github thread that he links to and that solved my problem.

The main difference was the i use nested models through HABTM relationships.

This is my search block in the model:

 searchable do
    text :name, :description, :excerpt
    text :venue_name do
      venue.name if venue.present?
    end
    text :artist_name do
      artists.map { |a| a.name if a.present? } if artists.present?
    end
  end

Here is the initializer that worked for me: (in: config/initializers/sunspot.rb)

module Sunspot
  #
  # DataExtractors present an internal API for the indexer to use to extract
  # field values from models for indexing. They must implement the #value_for
  # method, which takes an object and returns the value extracted from it.
  #
  module DataExtractor #:nodoc: all
    #
    # AttributeExtractors extract data by simply calling a method on the block.
    #
    class AttributeExtractor
      def initialize(attribute_name)
        @attribute_name = attribute_name
      end

      def value_for(object)
        Filter.new( object.send(@attribute_name) ).value
      end
    end

    #
    # BlockExtractors extract data by evaluating a block in the context of the
    # object instance, or if the block takes an argument, by passing the object
    # as the argument to the block. Either way, the return value of the block is
    # the value returned by the extractor.
    #
    class BlockExtractor
      def initialize(&block)
        @block = block
      end

      def value_for(object)
        Filter.new( Util.instance_eval_or_call(object, &@block) ).value
      end
    end

    #
    # Constant data extractors simply return the same value for every object.
    #
    class Constant
      def initialize(value)
        @value = value
      end

      def value_for(object)
        Filter.new(@value).value
      end
    end

    #
    # A Filter to allow easy value cleaning
    #
    class Filter
      def initialize(value)
        @value = value
      end

      def value
        if @value.is_a? String
          strip_control_characters_from_string @value
        elsif @value.is_a? Array
          @value.map { |v| strip_control_characters_from_string v }
        elsif @value.is_a? Hash
          @value.inject({}) do |hash, (k, v)|
            hash.merge( strip_control_characters_from_string(k) => strip_control_characters_from_string(v) )
          end
        else
          @value
        end
      end

      def strip_control_characters_from_string(value)
        return value unless value.is_a? String

        value.chars.inject("") do |str, char|
          unless char.ascii_only? && (char.ord < 32 || char.ord == 127)
            str << char
          end
          str
        end

      end
    end

  end
end

Question 3

You need to get rid of control characters from UTF8 while saving your content. Solr will not reindex this properly and throw this error.
http://en.wikipedia.org/wiki/UTF-8#Codepage_layout

You can use something like this:

name.gsub!(/\p{Cc}/, "")

edit: If you want to override it globally I think it could be possible by overriding value_for_methods in AttributeExtractor and if needed BlockExtractor. https://github.com/sunspot/sunspot/blob/master/sunspot/lib/sunspot/data_extractor.rb I wasn't checking this. If you manage to add some global patch, please let me know. I had lately same issue.