Extract the main entity from a body of text for text categorisation?

https://softwareengineering.stackexchange.com/questions/309692

12-12-2020
|

Frage

I'm trying to categorise products based on various fields of data. I've had some success just matching search terms in the product names, but this naive approach doesn't work when it comes to larger bodies of text such as descriptions since a description tends to contain a lot of additional info that isn't relevant to the category.

My thoughts to solve this problem were to extract the entities and predicates from the text, then use a process of elimination to work out which ones have to be the subject. If there's a better approach to this though, please let me know.

So as an example, take the following product description:

A classic sweatshirt with the dolman sleeves providing a modern twist, it is super-versatile and perfect for every day. Wear with our harem trousers or with jeans or a fitted skirt to balance the relaxed shape. Wide neck with ribbed dolman sleeves, rib neck and hem; and V insert at front.

I won't go through the whole thing, but here are some examples of what I would expect to extract from it:

E1. a classic sweatshirt

P1. ... has dolman sleeves

P2. ... is super versatile

P3. ... wear with ... (is an instruction a predicate?)

E2. harem trousers

...etc

So using the above I'd guess you can work out the main entity the paragraph is focusing on is "a classic sweatshirt" since the rest of the sentences start with predicates, and some weighting could be applied to it since it's in the first sentence. After that I could go back to my original approach of matching the extracted text against an index of terms and synonyms.

Is there a formal approach/algorithm that solves this problem? Or do you think the approach I've outlined is doomed to fail and I should try something else? ;)

What technology/algorithm should be used to abstract meaning or keywords from a passage of text?

Lösung

NER (Named Entity Extraction) should automate most of this implementation if you can build enough training dataset. E.g. with one of the toolkits (Apache OpenNLP), training data would look like.

A classic sweatshirt with the dolman sleeves providing a modern twist, it is super-versatile and perfect for every day. Wear with our harem trousers or with jeans or a fitted skirt to balance the relaxed shape. Wide neck with ribbed dolman sleeves, rib neck and hem; and V insert at front.

This training text would enable OpenNLP to break text into tokens and evaluate probability of predicates (start_of_product_name , end_of_product_name , no_op ) for each set of consecutive tokens.

This approach would require significant amount on text that is tagged, so that toolkit can build a language model that establishes relationship between token sequences and probability of predicates.

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit softwareengineering.stackexchange

Extract the main entity from a body of text for text categorisation?

Related