Question

I am in a process of developing a plagiarism detection framework. There we first preprocess the documents in the means of stemming, synonym replacement and stop word removal. So the preprocessed document is somewhat different from the original document.

After we enter the preprocessed document to our plagiarism function it returns the similar sentences.

Then in our GUI we have to display the two documents and the similar sentences by highlighting.

To highlight in java we have to get the index of the words and highlight.

The problem is that the preprocessed text is different from the original document so it is difficult to index the similar sentences in the original document.

Can anyone help me with this problem ??

Was it helpful?

Solution

You'll have to store some sort of metadata with the preprocessed document that allows to map the content of it to the original document. Like keeping a list of all gaps that result from stop word removal or storing information on where you replaced words with synonyms.

If you record every change that has been made during preprocessing (location/replaced text) then you should be able to find the original phrase in the original document.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top