Can similarity detection techniques be applied to a text document formatted as a big ASCII encoded integer byte array?

StackOverflow https://stackoverflow.com/questions/14395441

Question

I would like to detect similarities between files. One way of doing this is to encode the file in order to reduce the input space to the similarity algorithm and second to have more accurate results for the result. This is done by taking into account only informative features of the document. One way of doing this is to transform the file into a vector space transformation following the tf-idf frequence which scales up terms that are very informative and scales down frequent terms. My question is whether this can be done in a document where its text representation is not preserved. Suppose for instance that first the document is transformed into a big integer array where its character is represented as its ASCII value.

Was it helpful?

Solution

The encoding of your document in byte array is not a problem as @Ivan Koblik pointed out since texts are always encoded with numeric data. Your task is a standard document similarity detection problem. I suggest the below steps:

Pre-processing

  • decoding
  • tokenization;
  • case folding;
  • removing stop words;
  • stemming.

Generating features

  • as you pointed out, tf-idf might be a good one.
  • a set of weighted features constitutes a high-dimensional vector.

Fingerprinting

Using simhash or , you can transform the high-dimensional vector into an f-bit fingerprint, e.g., f=64. This is exactly the technique used by Google for near duplicate web page detection.

We maintain an f-dimensional vector V , each of whose dimensions is initialized to zero. A feature is hashed into an f-bit hash value. These f bits (unique to the feature) increment/decrement the f components of the vector by the weight of that feature as follows: if the i-th bit of the hash value is 1, the i-th component of V is incremented by the weight of that feature; if the i-th bit of the hash value is 0, the i-th component of V is decremented by the weight of that feature. When all features have been processed, some components of V are positive while others are negative. The signs of components determine the corresponding bits of the final fingerprint.

However, in your case, you might found other similarity functions like cosine distance or euclidean distance perform better. Try to find the best similarity function for your problem if simhash does not work well for you.

Query for similar documents

After you generate the fingerprint for each document, similar documents should have similar fingerprints (similar means their fingerprints' hamming distance is small). For more details, please refer to near duplicate web page detection.

Edit

If you cannot do the first step, i.e., decoding, simply counting the occurrences of each unique integer is still doable. You can apply tf-idf to those unique integers and follow the subsequence steps.

OTHER TIPS

hmm interesting question. The answer AFAIK is simply NO

text documents means words e.i. synonyms, collocation, suffixes, prefixes (stemming)...

big integer byte array lacks a possibility to cover all those things. Therefore comparing files transformed into byte arrays won't tell you whether there are similar texts.

Take this answer as a answer to your caption. Body of your question is rather complicated and it seems you are mixing more things together (it is hard to understand) - maybe making more simple questions would help ...

Bloom filter fits your case if you don't mind the TF-IDF approach: http://research.microsoft.com/en-us/um/people/navendu/mypapers/webdb-167.pdf
This way you can make vector arbitrary small or large depending on your tolerance to false positives.

Anyway looking at the NLP toolkit most of the algorithms can be adapted for the task, question is how do you specify similarity?

If you can specify different classes of documents then it's worth looking into Naive Bayes or if you want to define similarity differently MaxEnt model might work for you: http://www.kamalnigam.com/papers/maxent-ijcaiws99.pdf

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top