Can similarity detection techniques be applied to a text document formatted as a big ASCII encoded integer byte array?

Question 1

The encoding of your document in byte array is not a problem as @Ivan Koblik pointed out since texts are always encoded with numeric data. Your task is a standard document similarity detection problem. I suggest the below steps:

Pre-processing

decoding
tokenization;
case folding;
removing stop words;
stemming.

Generating features

as you pointed out, tf-idf might be a good one.
a set of weighted features constitutes a high-dimensional vector.

Fingerprinting

Using simhash or locality-sensitive-hash, you can transform the high-dimensional vector into an f-bit fingerprint, e.g., f=64. This is exactly the technique used by Google for near duplicate web page detection.

We maintain an f-dimensional vector V , each of whose dimensions is initialized to zero. A feature is hashed into an f-bit hash value. These f bits (unique to the feature) increment/decrement the f components of the vector by the weight of that feature as follows: if the i-th bit of the hash value is 1, the i-th component of V is incremented by the weight of that feature; if the i-th bit of the hash value is 0, the i-th component of V is decremented by the weight of that feature. When all features have been processed, some components of V are positive while others are negative. The signs of components determine the corresponding bits of the final fingerprint.

However, in your case, you might found other similarity functions like cosine distance or euclidean distance perform better. Try to find the best similarity function for your problem if simhash does not work well for you.

Query for similar documents

After you generate the fingerprint for each document, similar documents should have similar fingerprints (similar means their fingerprints' hamming distance is small). For more details, please refer to near duplicate web page detection.

Edit

If you cannot do the first step, i.e., decoding, simply counting the occurrences of each unique integer is still doable. You can apply tf-idf to those unique integers and follow the subsequence steps.

Question 2

hmm interesting question. The answer AFAIK is simply NO

text documents means words e.i. synonyms, collocation, suffixes, prefixes (stemming)...

big integer byte array lacks a possibility to cover all those things. Therefore comparing files transformed into byte arrays won't tell you whether there are similar texts.

Take this answer as a answer to your caption. Body of your question is rather complicated and it seems you are mixing more things together (it is hard to understand) - maybe making more simple questions would help ...

Question 3

Bloom filter fits your case if you don't mind the TF-IDF approach: http://research.microsoft.com/en-us/um/people/navendu/mypapers/webdb-167.pdf
This way you can make vector arbitrary small or large depending on your tolerance to false positives.

Anyway looking at the NLP toolkit most of the algorithms can be adapted for the task, question is how do you specify similarity?

If you can specify different classes of documents then it's worth looking into Naive Bayes or if you want to define similarity differently MaxEnt model might work for you: http://www.kamalnigam.com/papers/maxent-ijcaiws99.pdf