Question

I have been using R to retrieve data from the NCBI about a list of genetic polymorphisms (the rs numbers in the left most column below), and as as you can see, the table returned contains rows that lack data (essentially tab spaced nothing). The rows with data in every column (for example rs1968866) are those for which gene symbols were found, and I would like to keep these and filter out those that lack data.

The command I am familiar with for reading in tables is read.table(file, header = TRUE), which is not working in this instance, as there are rows that R reads as not matching the headers (like rs11710684).

Does anyone have a trick up their sleeve to read in only the rows that match the column headers for format (data in every column)? This would be handy as it would simultaneously allow me to discard the data that I do not need.

Here is an example of the table I retrieve from the NCBI:

marker genesymbol locusID chr chrpos fxn_class species dupl_loc current.rsid flag
rs11710684   3 166516497  Homo sapiens  rs11710684 1
rs1968866 PTRF 284119 17 40566240 intron-variant Homo sapiens  rs1968866 1
rs2309920   2 101329860  Homo sapiens  rs2309920 1
rs2384319 KIF3C 3797 2 26206255 upstream-variant-2KB Homo sapiens  rs2384319 1
rs3128894   6 29839360  Homo sapiens  rs3128894 1
rs2277329 SPRYD3 84926 12 53468419 intron-variant Homo sapiens  rs2277329 1
rs7785249 DGKB 1607 7 14327966 intron-variant Homo sapiens  rs7785249 1
Was it helpful?

Solution

In my honest opinion read.table isn't able to exclude incomplete cases. But have a look at ?read.table. There you will find the fill argument, which will add NA to your incomplete rows.

r <- read.table(file, header=TRUE, fill=TRUE)

Afterwards you could simply remove the incomplete rows:

r <- r[complete.cases(r)]

OTHER TIPS

If your data is tab-delimited, you can use read.delim. This should take care of missing values for you automatically.

If your data is space-delimited, you can use either read.delim(*, sep=" ") or read.table(*, header=TRUE, sep=" "). Either one will read your data using spaces as delimiters, with multiple consecutive spaces indicating missing values. Looking at the extract you provided, you'll have to decide if Homo sapiens is meant to be one field or two -- the latter is fine, but the former will be problematic if your data really is delimited by spaces.

Using read.delim(sep=" ") on your data imported without a hitch though, so I'm guessing Homo sapiens is meant to be two fields.

One way or the other, do read the documentation for your file. That's the only way to be sure what it contains.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top