For the first one you could write a custom TokenFilter
and hook it up in your analyzers (it's not that hard, take a look at org.apache.lucene.analysis.ASCIIFoldingFilter
for some simple example).
Second one could possibly be solved by using PatternReplaceCharFilterFactory
:
http://docs.lucidworks.com/display/solr/CharFilterFactories
You would have to remove '-' character from numbers and index/search for numbers only. Similar question: Solr PatternReplaceCharFilterFactory not replacing with specified pattern
Simple example removing gatan from end of each token:
public class Gatanizer extends TokenFilter {
private final CharTermAttribute termAttribute = addAttribute(CharTermAttribute.class);
/**
* Construct a token stream filtering the given input.
*/
protected Gatanizer(TokenStream input) {
super(input);
}
@Override
public boolean incrementToken() throws IOException {
if (input.incrementToken()) {
final char[] buffer = termAttribute.buffer();
final int length = termAttribute.length();
String tokenString = new String(buffer, 0, length);
tokenString = StringUtils.removeEnd(tokenString, "gatan");
termAttribute.setEmpty();
termAttribute.append(tokenString);
return true;
}
return false;
}
}
and I've registered my TokenFilter
to some Solr field:
<fieldtype name="gatan" stored="false" indexed="false" multiValued="true" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="gatanizer.GatanizerFactory"/>
</analyzer>
</fieldtype>
You'll also need some simple GatanizerFactory
that will return your Gatanizer