Domanda

I have the following series of commands:

my_data = read.csv(file='r-stats.out', sep='\t', skip=1)
histsub = subset(my_data, my_data[,10] != "Invalid")
hist(as.numeric(histsub[,10]))

r-stats.out is a file that has 10 columns, and column number 10 (one which I am trying to plot) has numbers ranging from -2000 to 10000 or the word "Invalid" which I try to first filter out. For some reason, my histogram only has range from 0 to 2500 IGNORING everything else. Why? What is happening? I did a

print(histsub)

and everything looks okay, those numbers are there in the histsub, but not on the plot. Please help.

EDIT: Adding a few lines from my_data print and also from histsub: my_data:

38    629345  1  633201  0   -41 Invalid    0   g    0     -37
39    633201  0  628727  0  4496     323    0   g    0    4629
40    628727  0  631371  1  7835     202    0   g    0 Invalid
41    631371  1  625871  1  7317     112    0   g    0    7379
42    625871  1  633427  1  1351     348    0   g    0    1321

histsub:

38    629345  1  633201  0  -41 Invalid    0   g    0   -37
39    633201  0  628727  0 4496     323    0   g    0  4629
41    631371  1  625871  1 7317     112    0   g    0  7379
42    625871  1  633427  1 1351     348    0   g    0  1321
È stato utile?

Soluzione

Try my_data[,10]=as.numeric(as.character(my_data[,10])) and then all the Invalid string entries will get converted to NA and won't show up in histograms anyway.

Altri suggerimenti

That implies its class is character, so it's probably implicitly converting to factor, and there are ~2500 uniques. Try using the argument stringsAsFactors = FALSE in read.csv

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top