Find "near duplicates" strings in R

https://stackoverflow.com/questions/13714893

04-12-2021
|

Question

I am using R to build a sentiment analysis tool and I am having some problems with duplicates. The main source of data is Twitter, and it looks like many are bypassing twitter own spam filter by adding some random text at the end of each tweet. For example

Click xxxxx to buy the amazing xxxxx for FREE ugjh

I get tons of those exact tweets with a different random string at the end. They are either from the same user or from different.

Is there any function like duplicated or unique which returns how close 2 strings are and if they are above a certain % dismiss them?

I know doing that will eventually delete real tweets from people saying exactly the same, like

I love xxxx !

but I will deal with that in the future.

Any tip in the right direction will be much appreciated!

Solution

I mentioned agrep above. Here's an example with what you've explained. By varying the max.distance we can adjust what gets kicked:

comp <- "Click xxxxx to buy the amazing xxxxx for FREE ugjh"
w <- "I love xxxx !"
x <- "Click xxxxx to purchase the awesome xxxxx for FREE bmf"

agrep(comp, c(x, w), max.distance =.4, value = TRUE)
agrep(comp, c(x, w), max.distance =.9, value = TRUE)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow