enlarging a text corpus with classes

https://stackoverflow.com/questions/22936547

29-06-2023
|

Frage

I have a text corpus of many sentences, with some named entities marked within it. For example, the sentence:

what is the best restaurant in wichita texas?

which is tagged as:

what is the best restaurant in <location>?

I want to expand this corpus, by taking or sampling all the sentences already in it, and replacing the named entities with other similar entities from the same types, e.g. replacing "wichita texas" with "new york", so the corpus will be bigger (more sentences) and more complete (number of entities within it). I have lists of similar entities, including ones which doesn't appear in the corpus but I would like to have some probability of inserting them in my replacements.

Can you recommend on a method or direct me to a paper regarding this?

Lösung

For your specific question: This type of work, assuming you have an organized list of named entities (like a separate list for 'places', 'people', etc), generally consists of manually removing potentially ambiguous names (for example, 'jersey' could be removed from your places list to avoid instances where it refers to the garment). Once you're confident you removed the most ambiguous names, simply select an appropriate tag for each group of terms ("location" or "person", for instance). In each sentence containing one of these words, replace the word with the tag. Then you can perform some basic expansion with the programming language of your choice so that each sentence containing 'location' is repeated with every location name, each sentence containing 'person' is repeated with every person name, etc.

For a general overview of clustering using word-classes, check out the seminal Brown et. al. paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.9919&rep=rep1&type=pdf

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow