Question

We have two (MySQL) databases, one with about 200.000 products (like "Samsung Galaxy S4", db-size 200 MB) and one with about 10 million articles (plain text, db-size 20GB) which can contain zero, one or many of the product names from the product database. Now we want to find product names in the article texts and store them as facets of the articles while indexing them in elasticsearch. Using regular expressions to find the products is pretty slow, we looked at Apache OpenNLP and Stanford Named Entity Recognizer, for both we have to train our own models and there are some projects at github for integrating those NER tools into elasticsearch, but they don't seem to be ready for production use.

Products and articles are added every day, so we have to run a complete recognition every day. Is NER the way to go? Or any other ideas? We don't have to understand the grammer etc. of the text, we only have to find the product name strings as fast as possible. We can't do the calculation in realtime because that's way to slow, so we have to pre-calculate the connection between articles and products and store them as facets, so we can query them pretty fast in our application.

So what's your recommendation to find so many product names in so many articles?

Was it helpful?

Solution

One of the issues you'll run into the most is consistency... new articles and new product names are always coming in and you'll have an "eventual consistency" problem. So there are three approaches that come to mind that I have used to tackle this kind of problem.

  1. As suggested, use a full text search in MySQL, basically create a loop over your products table, and for each product name do a MATCH AGAIST query and insert productkey, and article key into a tie table. THis is fast, I used to run a system in SQL Server with over 90000 items being searched against 1B sentences. If you had a multithreaded java program that chunked up the categories and exectured the full text query, you may be surpised how fast this will be. Also, this can hammer your DB server.

  2. Use Regex. Put all the products in a collection in memory, and regex find with that list against every document. This CAN be fast if you have your docs in something like hadoop, where it can be parallelized. You could run the job at night, and have it populate a MySQL table... This approach means you will have to start storing your docs in HDFS or some NOSQL solution, or import from MySQL to hadoop daily etc etc.

  3. You can try doing it "at index time", so when a record is indexed in ElasticSearch the extraction will happen then and your facets will be built. I have only used SOLR for stuff like this... problem here is that when you add new products you will have to process in batch again anyway because the previously index docs will not have had the new products extracted from them.

so there may be better options, but the one that scales infinitely (if you can afford the machines) is option 2... the hadoop job.... but this means big change.

These are just my thoughts, so I hope others come up with more clever ideas

EDIT: As for using NER, I have used NER extensively, mainly OpenNLP, and the problem with this is that what it extracts will not be normalized, or to put it another way, it may extract pieces and parts of a product name, and you will be left dealing with things like fuzzy string matching to align the NER Results to the table of products. OpenNLP 1.6 trunk has a component called the EntityLinker, which is designed for this type of thing (linking NER results to authoritative databases). Also, NER/NLP will not solve the consistency problem, because every time you change your NER model, you will have to reprocess.

OTHER TIPS

I'd suggest a preprocessing step : tokenization. If you do so for the product list and for the incoming articles, than you won't need to have a per-product search : the product list would be an automata where each transition is a given token.

That gives us a trie that you'll use to match products against texts, searching will look like :

products = []
availableNodes = dictionary.root
foreach token in text:
    foreach node in availableNodes:
        if node.productName:
            products.append(node.productName)
    nextAvailableNodes = [dictionary.root]
    foreach node in availableNodes:
        childNode = node.getChildren(token)
        if childNode:
            nextAvailableNodes.append(childNode)
    availableNodes = nextAvailableNodes

As far as I can tell, this algorithm is quite efficient and it allows you to fine-tune node.getChildren() function (e.g. to address capitalization or diacritics issues). Loading products lists as a a trie may take some time , in that case you could cache it as a binary file.

This simple method can easily be distributed using Hadoop or other MapReduce approach, either over texts or over product list, see for instance this article (but you'll probably need more recent / accurate ones).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top