How to check rapidly if an element is present in a large set of data

https://cs.stackexchange.com//questions/116978

29-11-2019
|

Pergunta

I am trying to harvest scientific publications data from different online sources like Core, PMC, arXiv etc. From these sources I keep the metadata of the articles (title, authors, abstract etc.) and the fulltext (only from the sources that provide it).

However, I dont want to harvest the same article's data from different sources. That is, I want to create a mechanism that will tell if an article that I am trying to harvest is present in the dataset of the articles that I already harvested.

The first thing I've tried was to see if the article (which I want to harvest) has a DOI and search in the collection of metadatas (that I already harvested) for that that DOI. If it is found there then this article was already harvested. This approach, though, is very time expensive given that I should do a serial search in a collection of ~10 millions articles metadata (in XML format) and the time would increase much more for the articles that don't have a DOI and I will have to compare other metadatas (like title, authors and date of publication).

def core_pmc_sim(core_article):
    if core_article.doi is not None:      #if the core article has a doi
        for xml_file in listdir('path_of_the_metadata_files'):  #parse all PMC xml metadata files
            for event, elem in ET.iterparse('path_of_the_metadata_files'+xml_file): #iterate through every tag in the xml
                if (elem.tag == 'hasDOI'):
                    print(xml_file, elem.text, core_article.doi)
                    if elem.text == core_article.doi:  # if PMC doi is equal to the core doi then the articles are the same
                        return True
                elem.clear()
    return False

What is the most rapid and memory-efficient way to achieve this?

(Whould a bloom filter be a good approach for this problem?)

Solução

If you knew that every article had a DOI, you could just store a hashtable of the DOIs of the papers that you've already harvested. Of course, in practice, many papers don't have a DOI or don't list their DOI.

If you knew that the paper would be listed with the correct title and authors, you could use a hash of the titles and authors and store that in a hashtable. Of course, in practice it is common for titles to be mis-spelled, or for there to be variations in how they are listed (e.g., changes in capitalization; do you list the full first name of each author or just their first initial; and so on and so on).

If you knew the list of all variations, you could try to canonicalize the title and authors (e.g., lowercase the title, use only the last names of the authors, and so on). Of course, in practice we probably can't know all the variations.

If you figured that all variations would be at small edit distance, you could use fuzzy matching and some kind of database that allows approximate match lookups. (e.g., Efficient map data structure supporting approximate lookup, How fast can we identifiy almost-duplicates in a list of strings?, How to compare/cluster millions of strings?) However, I suspect that in practice edit distance might not be enough to find all matches. Also, there is a risk that you might end up conflating two different papers (for instance, I have seen a series of papers with titles like "Catching the Bugblatter, Part I" and "Catching the Bugblatter, Part II" with the same authors).

Hopefully all of this is conveying that in practice I suspect the problem is messy and there is no single clean answer. Perhaps a pragmatic solution is to use some of the above techniques to find obvious matches, with the idea that this will take care of most cases of matches. For the remainder, just accept that you might store the article multiple times, but hopefully this won't be too common. Ultimately, storage space is extremely cheap, so storing the same article a few times probably isn't so bad.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a cs.stackexchange