Frage

Assume that I have the following similar data structure, where doc_id is the document identifier, text_id is the unique text/version identifier and text is a character string:

df <- cbind(doc_id=as.numeric(c(1, 1, 2, 2, 3, 4, 4, 4, 5, 6)), 
                text_id=as.numeric(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)), 
                  text=as.character(c("string1", "str2ing", "3string", 
                                      "string6", "s7ring", "string8", 
                                      "string9", "string10")))

What I am attempting to do in the loop structure is do string edit-distance comparisons, but only for different versions of the same documents. In short, I want to find matching doc_ids and pair-wise compare only different versions (text_ids) of the same document.

#Results matrix
result <- matrix(ncol=10, nrow=10)

#Loop
i=1
for (j in 1:length(df[,2])) {
  for (i in 1:length(df[,2])) {
#Conditional Statements
    if(df[i,1]==df[j,1]){
      result[i,j]<-levenshteinDist(df[j,3], df[i,3])}
    else(result[i,j]<-"Not Compared")
  }
  print(result[i,j])
  flush.console()
}

Returns:

[1] "Not Compared"
[1] "Not Compared"
[1] "Not Compared"
[1] "Not Compared"
[1] "Not Compared"
[1] "Not Compared"
[1] "Not Compared"
[1] "Not Compared"
[1] "Not Compared"
[1] "0"

The levenshteinDist() function can be found in the RecordLinkage package, but a similar function is also bundled in the utils package as adist()

My question is: why is my first conditional statement (if) being ignored, and only the else portion being returned?

Any further advice on coding or processing time efficiency gains will be greatly appreciated.

War es hilfreich?

Lösung

You're not outputting correctly. Run this version and see the comparisons happening in place. Comment out the message() once you are satisfied that everything is working correctly.

library(RecordLinkage)

df <- structure(c("1", "1", "2", "2", "3", "4", "4", "4", "5", "6", 
"1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "string1", 
"str2ing", "3string", "string6", "s7ring", "string8", "string9", 
"string10", "string1", "str2ing"), .Dim = c(10L, 3L), .Dimnames = list(
    NULL, c("doc_id", "text_id", "text")))

result <- matrix(ncol = 10, nrow = 10)
# nrow() and ncol() are more elegant ways of getting row/column counts.
for(j in 1:nrow(df)) {
    for(i in 1:nrow(df)) {
        message(sprintf("comparing i=%s (%s), j=%s (%s)\n", j, df[i, 1], i, df[j, 1]))
        if(identical(df[i, 1], df[j, 1])) {
            result[i, j] <- levenshteinDist(df[j, 3], df[i, 3])
        } else {
            result[i, j] <- "Not Compared"
        }
           # printing inside the inner for loop
        print(result[i, j])
    }

}

Andere Tipps

For starters, if I understand your objective the if-statement should read if (df[i,1]==df[j,2]), so that you are making comparisons between the values of the two columns.

The problem here isn't that your conditional is being ignored, but rather you're going about outputting your results incorrectly. result here is made up of a 10x10 matrix, but you are only printing result[i,j] inside the loop which iterates over j. I think the code should look more like this:

for (i in 1:length(df[,2])) {
    for (j in 1:length(df[,2])) {

        if(df[i,1]==df[j,2]) {
            result[i,j]<-adist(df[j,3], df[i,3])
        } else {
            (result[i,j]<-"Not Compared")
        }
    }
}

This will build the matrix of results, and you can then view the results of all 100 comparisons as you desire.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top