Python - query for inverted index

https://stackoverflow.com/questions/13259903

27-11-2021
|

Question

This is my first post on SO and I apologize in advance if my question turns out to be a bit trivial, I'm relatively new to programming world, and I chose python for my first "serious" OOP language. I did a search through SO archive but I couldn't find any question completely related to mine. Okay, long story short, here's the problem:

I'm working on inverted index. I found a couple of tutorials and tips on the net to follow and I did the following:

class Document for stemming the words and returning them with their start and end position thanks to finditer function.
class Inverted_Index that takes a collection of a documents (list in a list), tokenize them and puts them in inverted index in a form of

{'word':{document_id:(start_pos, end_pos)}}

like {'cloud': {0: [(5, 10)]}, 'document': {1: [(11, 19)], 2: [(22, 30)]} ...}. (I did document_id with a help of SO topic, iterating through enumerated collection of a documents. And about nested dictionaries, I made them amateurishly, like:

if nested_dict not in existing_dict:
    existing_dict[nested_dict] = {}

While I was reading stack owerflow I noticed that "defaultdict" datatype is pretty much superior way for doing that, but I haven't yet figured out "collections" module.).

Back on track: inside of Inverted_Index I did a Query method (just a version of OR operator) that takes string as a query, and if that string matches a key/term in my inverted index, returns back document_id with start and ending point of a term, like:

[(1, [(0, 4), (11, 19)]), ...]

And after that I was... stuck. I want to make a query output that prints out found word in a document along with it's environment, but I don't know how to connect result from a query method (document_id with start and end position) and inverted index, and I don't have a clue how to highlight that matched query inside her environment. Because of that I made starting and end point, but I have no idea how to accentuate it in python? Bold it?

I thought of result something like:

###################
Your query:'chocolate pudding'
Results:
########
In a document with id: 1
yaddi yaddi yadda chocolate bla bla bla pudding
hocolate bla bla bla pudding yaddi yaddi yadda bla

I mean, I was reading http://docs.python.org/2/library/string.html#string.center and thinking that aligning found words/queries in same column would do to trick. But I don't know how to get there, so any kind of hint would be great, because I'm not stuck in my program as I'm stuck in understanding the logic behind python, and in that case tutorials don't do justice. (Yes, I got some python books, but they have extended approach to this kind of matter, possibly considering it is not for beginners, but I don't know where to begin with, what programs to make that I can make use of. The thing is, we learn linguistic theory and IR theory in college, but we do a few things in practice.).

Thanks!

And sorry about this story-of-my-life end :D

I forgot, a code for not making this topic vague:

class inverted_index(dict):

    def __init__(self,collection_of_docs):
        for doc_id,document in enumerate(collection_of_docs):
            for word,start,end in document.tokenize(): #form: [('sky', 0, 4)]
                if word not in self:
                    self[word]={}
                if doc_id not in self[word]:
                    self[word][doc_id]=[]
                self[word][doc_id].append((start,end))


    def query(self,query_string):
        result={}
        for query_term in re.findall(r'\w+',query_string.lower(),re.UNICODE):
            for doc_id in self.get(query_term,{}):
                if doc_id not in result:
                    result[doc_id]=self[query_term][doc_id]
                else:
                    result[doc_id]=result[doc_id]+self[query_term][doc_id]
        return sorted(result.items(),key=lambda e:-len(e[1]))

Solution

You will need a 'get_with_surroundings' method on your text.

it could look like

class inverted_index(dict):
    def __init__(self,collection_of_docs):
        self.collection_of_docs = collection_of_docs #to store those
        # ... rest of your code

    def get_with_surroundings(document_id, position_tuple):
        start, end = position_tuple
        return self.collection_of_docs[document_id].text[start-10:end+10]

Where +10 and -10 could change depending on how much surroundings do you need to display. I assume there that your Document class has some 'text' attribute that is plain python string of that document.

Calling this method with one of results of your query will more-or-less archieve what you need.

This How do I print bold text in Python? could be helpful about bold text in python.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow