Question

I have lots of SiLK flow data that I would like to do some data mining on. It looks like the destination IP column matches the source IP column of a row of data further down. The rows (with many more columns) look like this:

UID SIP DIP PROTOCOL    SPORT   DPORT
720107626538    1207697420  3232248333  17  53  7722
720108826800    3232248333  1207697420  17  47904   53

I have never programmed in R or SPSS and am having trouble figuring out how to turn 2 rows of 27 columns of data into 1 row of 54 columns of data.

Was it helpful?

Solution

You can get corresponding SIP and DIP records on the same line through merge:

df <- data.frame(
  "UID" = c(720107626538, 720108826800),
  "SIP" = c(1207697420, 3232248333),
  "DIP" = c(3232248333, 1207697420),
  "PROTOCOL" = c(17, 17),
  "SPORT" = c(53, 47904),
  "DPORT" = c(7722, 53),
  stringsAsFactors = FALSE)

df_merged <- merge(
  df[,setdiff(colnames(df), "DIP")],
  df[,setdiff(colnames(df), "SIP")],
  by.x = "SIP",
  by.y = "DIP",
  all = FALSE,
  suffixes = c("_SIP", "_DIP"))

After that, you can use the UID fields to remove duplicates:

for(i in 2:nrow(df_merged)) {
  ind <- df_merged$UID_DIP
  ind[i] <- df_merged$UID_SIP[i]
  df_merged <- df_merged[!duplicated(ind),]
}

df_merged

df_merged
         SIP      UID_SIP PROTOCOL_SIP SPORT_SIP DPORT_SIP      UID_DIP PROTOCOL_DIP SPORT_DIP DPORT_DIP
1 1207697420 720107626538           17        53      7722 720108826800           17     47904        53

Because the de-duping relies on a loop, the whole thing could get very time-consuming if your dataset is large.

OTHER TIPS

In SPSS, I would tackle this (from what I can gather in your comments and questions), by making a new id variable to ID the cases where the the lagged values of SIP and DIP correspond to one another, and then use CASESTOVARS to reshape the data long to wide.

******************************************************************.
*Fake data that looks like yours.
data list free / UID SIP DIP PROTOCOL  SPORT.
begin data
1 1207697420  3232248333  17  53
2 3232248333  1207697420  17 47904
3 1 2 5 6
4 2 1 3 2
5 1 3 0 1
6 1 4 8 9
end data.

*Can make our own new id to reshape.
DO IF $casenum = 1.
    compute new_id = 1.
ELSE IF SIP = lag(DIP) and DIP = lag(SIP).
    compute new_id = lag(new_id).
ELSE.
    compute new_id = lag(new_id) + 1.
END IF.

*then reshape from long to wide.
CASESTOVARS
/ID new_id.
LIST. 
******************************************************************.

This is assuming, as you said in your comment, that "The DIP in one dataset is to be matched to the SIP in the second dataset, but only the very next match, sorted by UID". The end result then looks like this (with periods representing missing data).

new_id UID.1 UID.2 SIP.1 SIP.2 DIP.1 DIP.2 PROTOCOL.1 PROTOCOL.2 SPORT.1 SPORT.2

1.00     1.00     2.00 1.2E+009 3.2E+009 3.2E+009 1.2E+009     17.00      17.00     53.00 47904.00
2.00     3.00     4.00     1.00     2.00     2.00     1.00      5.00       3.00      6.00     2.00
3.00     5.00      .       1.00      .       3.00      .         .00        .        1.00      .
4.00     6.00      .       1.00      .       4.00      .        8.00        .        9.00      .

It isn't clear from your initial question what a duplicate is, but if you don't want duplicates you will want to get rid of them before the CASESTOVARS I imagine. If it is defined by having the same values for the other variables, but just with an interchangeable SIP and DIP, one thing I have done in the past is to make two new variables, and place the smaller value in the first new field and the larger value in the second field. E.g.

DO IF SID >= DID.
    compute ID1 = DID. 
    compute ID2 = SID.
ELSE.
    compute ID1 = SID.
    compute ID2 = DID.
END IF.

Then you can use the two new ID variables to identify duplicates irrespective of the order of the original SIP and DIP values.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top