문제

I am working with GATE (Java Based NLP Framework) and want to find words with partial match with a dictionary. For example I have a disease dictionary with following terms

Congestive cardiac failure
Congestive Heart Failure
Colon Cancer
      .
      . 
      .
Thousands of more terms

Let's assume I have as string "Father had cardiac failure last year" from this string I want to identify "cardiac failure" as partial match because it occurs as part of a term in the dictionary.

I have seen some discussion on similar subject in Python, JS and C# but I am not sure what can help in such a case here. I wonder if I can utilize Aho-Corrasick over here.

도움이 되었습니까?

해결책

The UIMA Concept Mapper annotator addon includes a functionality similar to what you are looking. You may consider:

다른 팁

Maybe you should use Lucene. Treat each line of the dictionary as a document, and each sentence in the text as a query.

One question that arises is which substrings you want to include in the search. If you included all substrings just "Heart" would also be a match, but that is not really a disease. Maybe all right-aligned (word-)substrings (perhaps with length > 1) would be acceptable.

So one thing you could do is to train the Aho-Corrasick pattern matcher with the substrings you want to include. To keep the information from which dictionary term the substring came you probably need to modify the algorithm a bit (if keeping that information is important) or build another datastructure to look it up afterwards.

In any case I would convert the disease list and the documents you want to search to lower case before training/matching. If there is a chance of misspellings - there are also papers on fuzzy aho-corasick automata.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top