R - Longest common substring

https://stackoverflow.com/questions/1429476

07-07-2019
|

Question

Does anyone know of an R package that solves the longest common substring problem? I am looking for something fast that could work on vectors.

Solution

Check out the "Rlibstree" package on omegahat: http://www.omegahat.org/Rlibstree/.

This uses http://www.icir.org/christian/libstree/.

OTHER TIPS

You should look at the LCS function of qualV package. It is C-implemented, therefore quite efficient.

The question here is not totally clear on the intended application of the solution to the longest common substring problem. A common application that I encounter is matching between names in different datasets. The stringdist package has a useful function amatch() which I find suitable for this task.

In brief, amatch() will take as input two vectors, the first is x the vector of strings that you want to find matches from (this can also just be a single string), the second is table, which is the vector of strings you want to make comparisons to and choose the match with the longest common substring. amatch() will then return a vector whose length equals that of x - each element of this result will be an index in table that contains the best match.

Details: amatch() takes a method argument, which you specify to be lcs if you want matching on longest common substring. There are many other options for different string matching techniques (e.g. Levenshtein distance). There is also a mandatory maxDist argument. If all strings in table are greater "distance" from a given string in x, then amatch() will return NA for that element of its output. "distance" is defined differently depending on the string matching algorithm you choose. For lcs, it (more or less) just means how many different (non-matched) characters there are. See documentation for details.

Parallelization: another nice feature of amatch() is that it will automatically parallelize the operation for you, making reasonable guesses about system resources to use. If you want more control over this, you can toggle the nthread argument.

Example application:

library(stringdist)

Names1 = c(
"SILVER EAGLE REFINING, INC. (SW)",
"ANTELOPE REFINING",
"ANTELOPE REFINING (DOUGLAS FACILITY)"
)

Names2 = c(
"Mobile Concrete, Inc.",
"Antelope Refining, LLC. ",
"Silver Eagle Refining Inc."
)

Match_Idx = amatch(tolower(Names1), tolower(Names2), method = 'lcs', maxDist = Inf)
Match_Idx
# [1] 3 2 2

Matches = data.frame(Names1, Names2[Match_Idx])
Matches

#                                 Names1          Names2.Match_Idx.
# 1     silver eagle refining, inc. (sw) silver eagle refining inc.
# 2                    antelope refining   antelope refining, llc. 
# 3 antelope refining (douglas facility)   antelope refining, llc. 

### Compare Matches:

Matches$Distance = stringdist(Matches$Names1, Matches$Match, method = 'lcs')

Also, unlike functions like LCS from qualV, this will not consider "subsequence" matches that involve ignoring intermediate characters in order to form a match (as discussed here). For instance, see this:

Names1 = c(
"hello"
)

Names2 = c(
"hel123l5678o",
"hell"
)

Match_Idx = amatch(tolower(Names1), tolower(Names2), method = 'lcs', maxDist = Inf)

Matches = data.frame(Names1, Match = Names2[Match_Idx])
Matches

# 1  hello  hell

I don't know R, but I used to implement Hirschberg's algorithm which is fast and don't consume too much space.

As I remember it is only 2 or 3 recursively called short functions.

Here is a link: http://wordaligned.org/articles/longest-common-subsequence

So don't hesitate to implement it in R, it worths the effort since it is a very interesting algorithm.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow