Question

i have problems with regards to indexing item names with numbers and symbols. a sample of my data is shown below:

ANGLE BARS   ORANGE - 4.0MM 2 - 1/2"
B.I SQUARE TUBING     2" X 3"
B.I. PIPE S-40   10MM 3/8"
B.I SQUARE TUBING     1" X 2"
PLYWOOD   MARINE 3/4X4X8
PLYWOOD   STA. CLARA 1/8X4X8
PLYWOOD   STA. CLARA 3/16X4X8

i want to tokenize my data in white or trailing spaces without dropping the symbols because these symbols are very essential. so that whenever i search for "plywood sta. clara", "b.i square 2" X 3"", or "angle orange 2 - 1/2" will give me a result. i tried to used whitespace analyzer but the symbols are dropped. i also tried standardanalyzer but stop words and symbols are also dropped. what is the best analyzer to use instead?

Was it helpful?

Solution

You can use PatternAnalyzer by writing regular expression or create Custom Analyzer.

OTHER TIPS

Try using a org.apache.lucene.analysis.miscellaneous.PatternAnalyzer. You can supply a regular expression to define token delimiters.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top