Partial match on a dictionary
-
14-04-2021 - |
문제
I am working with GATE (Java Based NLP Framework) and want to find words with partial match with a dictionary. For example I have a disease dictionary with following terms
Congestive cardiac failure
Congestive Heart Failure
Colon Cancer
.
.
.
Thousands of more terms
Let's assume I have as string "Father had cardiac failure last year"
from this string I want to identify "cardiac failure" as partial match because it occurs as part of a term in the dictionary.
I have seen some discussion on similar subject in Python, JS and C# but I am not sure what can help in such a case here. I wonder if I can utilize Aho-Corrasick over here.
해결책
The UIMA Concept Mapper annotator addon includes a functionality similar to what you are looking. You may consider:
- including using UIMA inside GATE: http://gate.ac.uk/userguide/chap:uima
- develop a similar component using the main ideas from the addon
다른 팁
Maybe you should use Lucene. Treat each line of the dictionary as a document, and each sentence in the text as a query.
One question that arises is which substrings you want to include in the search. If you included all substrings just "Heart" would also be a match, but that is not really a disease. Maybe all right-aligned (word-)substrings (perhaps with length > 1) would be acceptable.
So one thing you could do is to train the Aho-Corrasick pattern matcher with the substrings you want to include. To keep the information from which dictionary term the substring came you probably need to modify the algorithm a bit (if keeping that information is important) or build another datastructure to look it up afterwards.
In any case I would convert the disease list and the documents you want to search to lower case before training/matching. If there is a chance of misspellings - there are also papers on fuzzy aho-corasick automata.