Digits being neglected while performing N-gram in R

https://stackoverflow.com/questions/17744566

03-06-2022
|

Question

I want to get the counts of all character level Ngrams presnt in a text file. Using R I wrote a small code for the same. However the code is neglecting all the digits present in the text. Could anyone help me in fixing this issue.

Here is the code :

 library(tau)
temp<-read.csv("/home/aravi/Documents/sample/csv/ex.csv",header=TRUE,stringsAsFactors=F)
r<-textcnt(temp, method="ngram",n=4L, decreasing=TRUE)
a<-data.frame(counts = unclass(r), size = nchar(names(r)))
b<-split(a,a$size)
b

Here is the contents of the input file:

abcd123
appl2345e
coun56ry
live123
names3423bsdf
coun56ryas

This is the output:

  $`1`
  counts size
_     18    1
a      3    1
e      3    1
n      3    1
s      3    1
c      2    1
l      2    1
o      2    1
p      2    1
r      2    1
u      2    1
y      2    1
b      1    1
d      1    1
f      1    1
i      1    1
m      1    1
v      1    1

$`2`
   counts size
_c      2    2
_r      2    2
co      2    2
e_      2    2
n_      2    2
ou      2    2
ry      2    2
s_      2    2
un      2    2
_a      1    2
_b      1    2
_e      1    2
_l      1    2
_n      1    2
am      1    2
ap      1    2
as      1    2
bs      1    2
df      1    2
es      1    2
f_      1    2
iv      1    2
l_      1    2
li      1    2
me      1    2
na      1    2
pl      1    2
pp      1    2
sd      1    2
ve      1    2
y_      1    2
ya      1    2

$`3`
    counts size
_co      2    3
_ry      2    3
cou      2    3
oun      2    3
un_      2    3
_ap      1    3
_bs      1    3
_e_      1    3
_li      1    3
_na      1    3
ame      1    3
app      1    3
as_      1    3
bsd      1    3
df_      1    3
es_      1    3
ive      1    3
liv      1    3
mes      1    3
nam      1    3
pl_      1    3
ppl      1    3
ry_      1    3
rya      1    3
sdf      1    3
ve_      1    3
yas      1    3

$`4`
     counts size
_cou      2    4
coun      2    4
oun_      2    4
_app      1    4
_bsd      1    4
_liv      1    4
_nam      1    4
_ry_      1    4
_rya      1    4
ames      1    4
appl      1    4
bsdf      1    4
ive_      1    4
live      1    4
mes_      1    4
name      1    4
ppl_      1    4
ryas      1    4
sdf_      1    4
yas_      1    4

Could anyone tell what am I missing or where I went wrong. Thanks in Advance.

Solution

The default value for splits in textcnt includes "digits" , so numbers are being treated as delimiters. Remove that and things will work.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow