EDIT: It seems that the file you provided uses a different encoding than your system's native one.
An (experimental) encoding detection done by the stri_enc_detect
function from the stringi package gives:
library('stringi')
PlayerDataRaw <- stri_read_raw('~/Desktop/PLAYERS.csv')
stri_enc_detect(PlayerDataRaw)
## [[1]]
## [[1]]$Encoding
## [1] "ISO-8859-1" "ISO-8859-2" "ISO-8859-9" "IBM424_rtl"
##
## [[1]]$Language
## [1] "en" "ro" "tr" "he"
##
## [[1]]$Confidence
## [1] 0.25 0.14 0.09 0.02
So most likely the file is in ISO-8859-1
a.k.a. latin1
. Luckily, R does not have to re-encode the input while reading this file - it may just set a different than default (== native) encoding marking. You can load the file with:
PlayerData<-read.table('~/Desktop/PLAYERS.csv',
quote=NULL, dec = ".", sep=",",
stringsAsFactors=FALSE, header=TRUE, fill=TRUE,
blank.lines.skip=TRUE, encoding='latin1')
Now you may access individual characters correctly, e.g. with the stri_sub
function:
Test<-PlayerData[c(33655:33656),]
Test
## T Away H.A Home Player Year
## 33655 33654 CrystalPalace 1 Arsenal Cazorla 2013
## 33656 33655 CrystalPalace 1 Arsenal Özil 2013
stri_sub(Test$Player, 1, length=1)
## [1] "C" "Ö"
stri_sub(Test$Player, 2, length=1)
## [1] "a" "z"
As per comparing strings, here are the results for a test for equality of strings, with accent characters "flattened":
stri_cmp_eq("Özil", "Ozil", stri_opts_collator(strength=1))
## [1] TRUE
You may also get rid of accent characters by using iconv
's transliterator (I am not sure whether it is available on Windows, though).
iconv(Test$Player, 'latin1', 'ASCII//TRANSLIT')
## [1] "Cazorla" "Ozil"
Or with a very powerful transliterator from the stringi package (stringi version >= 0.2-2):
stri_trans_general(Test$Player, 'Latin-ASCII')
## [1] "Cazorla" "Ozil"