How to configure Hibernate Search to find words with accents

https://stackoverflow.com/questions/22176323

03-06-2023
|

Question

I need to implement a global search, on a website which I am implementing using Spring (4.0.2)/Hibernate(4.3.1)/MySQL. I have decided to use Hibernate Search(4.5.0) for this.

This seems to be working fine, but only when I do a search for an exact pattern.

Imagine I have the following text on an indexed field: "A história do Capuchinho e do Lobo Mau"

1) If I search for "história" or "lobo mau", the query will retrieve the corresponding indexed entity, as I would have expected.

2) If I search for "historia" or "lobos maus" the search will not retrieve the entity.

As far as I have read, it should be possible to configure Hibernate Search to perform a much smarter search than this. Can anyone point me on the right direction to achieve this? See below key aspects of the implementation I executed. Thanks!

This is the "parent" indexed entity

@Entity
@Table(name="NEWS_HEADER")
@Indexed
public class NewsHeader implements Serializable {

static final long serialVersionUID = 20140301L;

private int                 id;
private String              articleHeader;
private String              language;
private Set<NewsParagraph>  paragraphs = new HashSet<NewsParagraph>();

/**
 * @return the id
 */
@Id
@Column(name="ID")
@GeneratedValue(strategy=GenerationType.AUTO)
@DocumentId
public int getId() {
    return id;
}
/**
 * @param id the id to set
 */
public void setId(int id) {
    this.id = id;
}
/**
 * @return the articleHeader
 */
@Column(name="ARTICLE_HEADER")
@Field(index=Index.YES, analyze=Analyze.YES, store=Store.NO)
public String getArticleHeader() {
    return articleHeader;
}
/**
 * @param articleHeader the articleHeader to set
 */
public void setArticleHeader(String articleHeader) {
    this.articleHeader = articleHeader;
}
/**
 * @return the language
 */
@Column(name="LANGUAGE")
public String getLanguage() {
    return language;
}
/**
 * @param language the language to set
 */
public void setLanguage(String language) {
    this.language = language;
}
/**
 * @return the paragraphs
 */
@OneToMany(mappedBy="newsHeader", fetch=FetchType.EAGER, cascade=CascadeType.ALL)
@IndexedEmbedded
public Set<NewsParagraph> getParagraphs() {
    return paragraphs;
}
// Other standard getters/setters go here

And this the IndexedEmbedded entity

@Entity
@Table(name="NEWS_PARAGRAPH")
public class NewsParagraph implements Serializable {

static final long serialVersionUID = 20140302L;

private int         id;
private String      content;
private NewsHeader  newsHeader;

/**
 * @return the id
 */
@Id
@Column(name="ID")
@GeneratedValue(strategy=GenerationType.AUTO)
public int getId() {
    return id;
}
/**
 * @param id the id to set
 */
public void setId(int id) {
    this.id = id;
}
/**
 * @return the content
 */
@Column(name="CONTENT")
@Field(index=Index.YES, analyze=Analyze.YES, store=Store.NO)
public String getContent() {
    return content;
}
// Other standard getters/setters go here

This is my search method, implemented on my SearchDAOImpl

public class SearchDAOImpl extends DAOBasics implements SearchDAO {
    ...
    public List<NewsHeader> searchParagraph(String patternStr) {

    Session session = null;

    Transaction tx;

    List<NewsHeader> result = null;

    try {
        session = sessionFactory.getCurrentSession();
        FullTextSession fullTextSession = Search.getFullTextSession(session);
        tx = fullTextSession.beginTransaction();

        // Create native Lucene query using the query DSL
        QueryBuilder queryBuilder = fullTextSession.getSearchFactory()
            .buildQueryBuilder().forEntity(NewsHeader.class).get();

        org.apache.lucene.search.Query luceneSearchQuery = queryBuilder
            .keyword()
            .onFields("articleHeader", "paragraphs.content")
            .matching(patternStr)
            .createQuery();

        // Wrap Lucene query in a org.hibernate.Query
        org.hibernate.Query hibernateQuery = 
            fullTextSession.createFullTextQuery(luceneSearchQuery, NewsHeader.class, NewsParagraph.class);

        // Execute search
        result = hibernateQuery.list();

    } catch (Exception xcp) {
        logger.error(xcp);
    } finally {

        if ((session != null) && (session.isOpen())) {
            session.close();
        }
    }
    return result;
}
...
}

Solution 2

You could configure, or you can use a standard language analyzer, such as PortugueseAnalyzer. I'd recommend starting from the existing analyzer, and creating you own if necessary, using it as a starting point for tweaking the filter chain.

You can set this in using the @Analyzer annotation for the field:

@Field(index=Index.YES, analyze=Analyze.YES, store=Store.NO, analyzer = @Analyzer(impl = org.apache.lucene.analysis.pt.PortugueseAnalyzer.class))

Or you can set that analyzer as the default for the class, if you place an @analyzerannotation are the head of the class instead.

OTHER TIPS

This is what I have ended up doing, to resolve my problem.

Configure an AnalyzerDef at the entity level. Within it, use LowerCaseFilterFactory, ASCIIFoldingFilterFactory and SnowballPorterFilterFactory to achieve the type of filtering I needed.

@AnalyzerDef(name = "customAnalyzer",
  tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),
  filters = {
    @TokenFilterDef(factory = LowerCaseFilterFactory.class),
    @TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
    @TokenFilterDef(factory = SnowballPorterFilterFactory.class)
})
public class NewsHeader implements Serializable {
...
}

Add this notation for each of the fields I want indexed, either in the Parent entity or its IndexedEmbedded counterpart, to use the defined above analyzer.

@Field(index=Index.YES, store=Store.NO)
@Analyzer(definition = "customAnalyzer")

You will need to either re-index, or re-insert your entities, for the analyser to take effect.

If you want to search accent character and also find same normal keyword in result then you must have to implement ASCIIFoldingFilterFactory class in analyzer like

@AnalyzerDef(name = "customAnalyzer",
tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),

filters = {
  @TokenFilterDef(factory = LowerCaseFilterFactory.class),
  @TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
  @TokenFilterDef(factory = StopFilterFactory.class, params = {
          @Parameter(name="words", value= "com/ik/resource/stoplist.properties" ),
          @Parameter(name="ignoreCase", value="true")
      })
})

@Analyzer(definition = "customAnalyzer") apply on entity or fields

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow