Domanda

I am trying to implement a defacement detector for websites. To achieve this, I should develop a tool in Java that compares similarity between two HTML files. I intend to strip URLs and JS to treat them seperately.

I am looking for a tool/ library /algorithm that I could use to calcuate a similarity metric (percentage ideally) in order to detect significant changes in websites.

Thank you for your help.

È stato utile?

Soluzione

Since HTML is in essence just a text-based markup, the easiest way to go is the Levenshtein distance. This algorithm determines the difference between 2 input strings by assigning a single point for every addition, subtraction or removal of a single character, and determines the 'shortest' distance for this result.

Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertion, deletion, substitution) required to change one word into the other.

A sample implementation for Java can be found here.

By dividing the Levenshtein distance with the length of the largest input string you can calculate a difference percentage between the 2 strings.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top