Question

I have a table where one of the variables is country of registration.

table(df$reg_country)

returns:

   AR    BR    ES    FR    IT
  123   202   578   642   263

Now, if I subset the original table to exclude one of the countries

df_subset<-subset(df, reg_country!='AR')
table(df_subset$reg_country)

returns:

   AR    BR    ES    FR    IT
    0   202   578   642   263

This second result is very surprising to me, as R seems to somehow magically know that I have removed the the entries from AR.

Why does that happen?

Does it affect the size of the second data frame (df_subset)? If 'yes' - is there a more efficient way to to subset in order to minimize the size?

Was it helpful?

Solution

df$reg_country is a factor variable, which contains the information of all possible levels in the levels attribute. Check levels(df_subset$reg_country).

Factor levels only have a significant impact on data size if you have a huge number of them. I wouldn't expect that to be the case. However, you could use droplevels(df_subset$reg_country) to remove unused levels.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top