Why is R prefixing my imported dataset names with an X [duplicate]

https://stackoverflow.com/questions/15754554

31-03-2022
|

Domanda

I can't tell why the header names get an "X." prefix when I import using quote="". Here is the code:

xhead = read.csv("~/Desktop/dbdump/users.txt", na.strings = "\\N", quote="", nrows = 1000)

Which gives me:

names(xhead)
 [1] "X.userId."             "X.fullName."           "X.email."              "X.password."          
 [5] "X.activated."          "X.registrationDate."   "X.locale."             ...

Whereas:

yhead = read.csv("~/Desktop/dbdump/users.txt", na.strings = "\\N", nrows = 1000)
names(yhead)
 [1] "userId"             "fullName"           "email"              "password"          
 [5] "activated"          "registrationDate"   "locale"            ...

The reason I have the quote="" is that I was getting records truncated presumably because buried in my 15000 records there was a stray quote.

Here's what my data file looks like:

"userId", "fullName","email","password","activated","registrationDate","locale","notifyOnUpdates","lastSyncTime","plan_id","plan_period_months","plan_price","plan_exp_date","plan_is_trial","plan_is_trial_used","q_hear","q_occupation","pp_subid","pp_payments","pp_since","pp_cancelled","apikey"
"2","Adam Smith","a@mail.com","*****","1","2004-07-23 14:19:32","en_US","1","2011-04-07 07:29:17","3",\N,\N,\N,"0","1",\N,\N,\N,\N,\N,\N,"d7734dce-4ae2-102a-8951-0040ca38ff83"

Soluzione

The column names are run through make.names before being returned. Quotes are not valid characters for column names. You can see the difference by running:

make.names(c('"userId"', "fullName"))
[1] "X.userId." "fullName"

From the make.names help:

A syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. ... The character "X" is prepended if necessary. All invalid characters are translated to ".".

A suggestion would be to call read.csv skipping the first line, and not including a header to get the bulk of the data.

dd <- read.csv("~/Desktop/dbdump/users.txt", na.strings = "\\N", 
         quote="", nrows = 1000, header = FALSE, skip = 1)

You can then read in the column names using scan (which is what read.csv is calling under the hood)

names(dd) <- scan("~/Desktop/dbdump/users.txt", what = character(), nlines=1,sep =',')

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow