hashes of ngrams: document fingerprinting

https://stackoverflow.com/questions/8104719

hash
r
text-mining
fingerprinting

27-02-2021
|

Pregunta

I am trying to implement the winnowing algorithm for document fingerprinting in R.

Here the reference http://www.ida.liu.se/~TDDC03/oldprojects/2005/final-projects/prj10.pdf

My question:

how do I get hashes of n-gram and how do I select those

nGrams <- c("adoru", "dorun", "orunr", "runru", "unrun", "nrunr" ,"runru",
  "unrun","nruna", "runad", "unado", "nador", "adoru", "dorun", "orunr" ,"runru" ,
  "unrun")

Solución

It seems as though

library(digest)
v <- sapply(nGrams,digest,algo="crc32")
uv <- unique(v)
(as.integer(as.hexmode(uv))-1) %% 4 == 0

would be a good start. (CRC32 is always odd, so subtracting 1 is necessary.)

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow