Вопрос

Let D be a data.frame, with D$x containing real numbers and D$y containing booleans, among other fields.

The problem is to sort the rows of D so that D$x is non-decreasing, while breaking ties in a way that minimizes the number of discontinuities in the resulting D$y.

Is there a simple fast way to accomplish this in R?

More Information

In a language like C I would first sort by x, then pass over the result sequentially with a 2-state FSM to iron out the discontinuities as far as possible. But in R, I expect iteration to carry unnecessary overhead if there are thousands of rows to process sequentially.

Example correct result:

D$x  D$y
1    FALSE
1    FALSE
1    TRUE
1    TRUE
1.2  TRUE
1.5  TRUE
1.5  FALSE

Example incorrect result:

D$x  D$y
1    TRUE
1    FALSE
1    TRUE
1    FALSE
1.2  TRUE
1.5  FALSE
1.5  TRUE

In the example, the correct result has 2 discontinuities while the incorrect result has 6.

EDIT: We can assume the data is such that the density of discontinuities in the result will be low: Less than 1 discontinuity per 1000 rows, say.

Это было полезно?

Решение 3

Much faster solution than sequential iteration (if discontinuities are sparse):

sortForMaxContY <- function(D,initialY){
    n <- nrow(D)
    D <- D[order(D$x),]

    xChanges <- D$x[-1]!=D$x[-n]
    isLastOfXVal <- c(xChanges,T)
    rankOfXVal <- cumsum(c(T,xChanges))

    oldFinalYs <- NA
    finalYs <- D$y[isLastOfXVal]
    while(!identical(finalYs,oldFinalYs)){
        finalYOfPrecedingXVal <- c(initialY,finalYs)[rankOfXVal]
        oldFinalYs <- finalYs
        D <- D[order(D$x,xor(finalYOfPrecedingXVal,D$y)),]
        finalYs <- D$y[isLastOfXVal]
    }
    return(D)
}

One of sortForMaxContY(D,T) and sortForMaxContY(D,F) is the optimum, and the other usually is too, depending on the data.

Другие советы

This will not give you perfect results, if there is an optimal rearranging of y, but otherwise will work

D[order(D$x, D$y), ]

Brute force solution:

sortForMaxContY <- function(D,initialY){
    n <- nrow(D)

    D <- D[order(D$x),]

    x <- c(D$x,Inf)
    whichT <- c(which(D$y),n+1)
    whichF <- c(which(!D$y),n+1)

    finalOrder <- rep(0,n) # allocate space
    lastY <- initialY
    iT <- 1
    iF <- 1
    for(i in 1:n){
        wT <- whichT[iT]
        wF <- whichF[iF]
        chooseT <- sign(x[wF]-x[wT])+lastY-0.5>0
        w <- ifelse(chooseT, wT, wF)
        finalOrder[i] <- w
        lastY <- D$y[w]
        iT <- iT + chooseT
        iF <- iF + !chooseT
    }

    return(D[finalOrder,])
}

One of sortForMaxContY(D,T) and sortForMaxContY(D,F) is the optimum, and the other usually is too, depending on the data.

Doesn't R have a way to do this faster?

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top