R: Create a new column with multiple categories(levels) based on the critera of two other columns

StackOverflow https://stackoverflow.com/questions/19406190

  •  30-06-2022
  •  | 
  •  

Question

my data looks like the following

> head(CPUE)
    Lon.rect Lat.rect         q1          q4
    1     13.5    54.25  0.1930234  1.76096260
    2     13.5    54.75 11.6866331 19.06265440
    3     13.5    55.25 24.2532215 33.64530930
    4     13.5    55.75  0.2113688  0.05731537
    5     14.5    54.25  2.5600818  8.72482876
    6     14.5    54.75 85.8657297 34.08524869

Now, what I would like to do is create a new column, with multiple categories (levels) based on a combination of the data from subsets "Lon.rect" & "Lat.rect". I would like to name the categories something different, based on what data is in each column. e.g. for Lon.rect = 13.5, Lat.rect = 54.25, the category name in the new column would be "1A", while in row two the category would be "1B", because Lat.rect has contains different data. Row 5 would be "2A", and so on.

"Lon.rect" & "Lat.rect" contain coordinates (if that matters to anyone) and have several more combinations. From Lot 13.5 to 22.5 and Lat 54.25 to 58.75.

I created a new column called "subdiv" by:

CPUE["subdiv"] <- NA

Whole dataset now looks:

   > head(CPUE)
      Lon.rect Lat.rect         q1          q4 subdiv
    1     13.5    54.25  0.1930234  1.76096260     NA
    2     13.5    54.75 11.6866331 19.06265440     NA
    3     13.5    55.25 24.2532215 33.64530930     NA
    4     13.5    55.75  0.2113688  0.05731537     NA
    5     14.5    54.25  2.5600818  8.72482876     NA
    6     14.5    54.75 85.8657297 34.08524869     NA

I know I could enter everything like below, but that would take ages and since it's a lot of data.

CPUE$subdiv[CPUE$Lon.rect>=13 & CPUE$Lon.rect<=14 & CPUE$Lat.rect>=54.0 & CPUE$Lat.rect<=54.5] <- "1A"
CPUE$subdiv[CPUE$Lon.rect>=13 & CPUE$Lon.rect<=14 & CPUE$Lat.rect>=54.5 & CPUE$Lat.rect<=55.0] <- "1B"
CPUE$subdiv[CPUE$Lon.rect>=13 & CPUE$Lon.rect<=14 & CPUE$Lat.rect>=55.0 & CPUE$Lat.rect<=55.5] <- "1C"

I hope I made my description quite clear, otherwise don't hesitate to contact me! If anyone has a good solution to any of the steps, please write back! Thanks! /Filip

EDIT:

Further information about my problem

The names for the columns above; "1A", "1B" and "2A", are just examples to make clear how I want the relation the the source columns to be, I really want to name them something else, however I got some nice help below if someone is interested of this.

In my case I would like to name Lat.rect column after integers starting at 37. The Lon.rect would be a bit trickier. This name is composed of one letter and one number, starting at G3 (in this case). The highest number for each letter would be 9, and then next letter starts at 0, so the next name after G9 would be H0.

If it helps, I would not need a script to make this combination for the whole alphabet. The minimum possible combination (of all my data sets, not needed currently) are F9, and maximum H9.

I would also like to have the lat name first and lon name second. If it would be easier to first swap location of the column in the data.frame to then create the name, this would be fine.

The finished combination of the first row would be "37G3", and then the second row "38G3". Row 5 would be "37G4".

If anyone would be able to help me with this second part, I would be grateful!

Était-ce utile?

La solution 2

More generally, in case your data is not sorted like this (by lon and then by lat) and you want subdiv to include all levels of lot and lan, you could:

    CPUE <- data.frame(lon = as.vector(replicate(4, sample(13.5:22.5, 10, T))),
                       lat = as.vector(replicate(4, sample(seq(54, 56.25, 0.25), 10, T))))

    num <- findInterval(CPUE$lon, sort(unique(CPUE$lon)))
    lett <- findInterval(CPUE$lat, sort(unique(CPUE$lat)))

    CPUE$subdiv <- paste(num, LETTERS[lett], sep = "")

    CPUE
        lon   lat subdiv
    1  13.5 54.50     1C #this is the first possible "lon" and the third possible "lat"
    2  15.5 54.50     3C
    3  20.5 55.25     8F #this is the eigth possible "lon" and the sixth possible "lat"
    4  19.5 54.00     7A
    5  16.5 55.75     4H

NOTE: This approach won't work if (1) you don't want to include all possible levels of "lon" and "lat", and (2) your data is not sorted as posted.

EDIT

Maybe something like this?:

    CPUE <- data.frame(lon = sort(rep(13.5:22.5, 13)),
                       lat = rep(seq(54.25, 60.25, 0.5), 10))

    lat_names <- findInterval(CPUE$lat, sort(unique(CPUE$lat))) + 36

    lon_names <- as.vector(sapply(LETTERS, paste, 0:9, sep = ""))
    lon_names <- lon_names[match("G3", lon_names):length(lon_names)]
    lon_names <- lon_names[findInterval(CPUE$lon, sort(unique(CPUE$lon)))]

    CPUE$subdiv <- paste(lat_names, lon_names, sep = "")

    > CPUE
         lon   lat subdiv
    1   13.5 54.25   37G3
    2   13.5 54.75   38G3
    3   13.5 55.25   39G3
    4   13.5 55.75   40G3
    5   13.5 56.25   41G3
    6   13.5 56.75   42G3
    7   13.5 57.25   43G3
    8   13.5 57.75   44G3
    9   13.5 58.25   45G3
    10  13.5 58.75   46G3
    11  13.5 59.25   47G3
    12  13.5 59.75   48G3
    13  13.5 60.25   49G3
    14  14.5 54.25   37G4
    15  14.5 54.75   38G4
    16  14.5 55.25   39G4
    17  14.5 55.75   40G4
    18  14.5 56.25   41G4
    19  14.5 56.75   42G4
    20  14.5 57.25   43G4
    ....

Autres conseils

Using interaction would be one way to get the levels from unique combinations of factors in your columns. However I use match on the first two columns, finding the position of each element in a table of the unique elements. I can then paste these values together and use as.factor to coerce to a factor variable. I find it makes the renaming of the levels more intuitive for me and it also doesn't rely on the data.frame being sorted...

a <- match( df[,1] , unique( df[,1] ) )
b <- letters[ match( df[,2] ,  unique( df[,2] ) ) ]

df$new <- as.factor( paste0( a , b ) )
#  Lon.rect Lat.rect         q1          q4 new
#1     13.5    54.25  0.1930234  1.76096260  1a
#2     13.5    54.75 11.6866331 19.06265440  1b
#3     13.5    55.25 24.2532215 33.64530930  1c
#4     13.5    55.75  0.2113688  0.05731537  1d
#5     14.5    54.25  2.5600818  8.72482876  2a
#6     14.5    54.75 85.8657297 34.08524869  2b
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top