Question

As you know the module Text::Ngrams in Perl can give Ngrams analysis. There is the following function to retrieve the array of Ngrams and frequencies.

get_ngrams(orderby=>'ngram|frequency|none',onlyfirst=>NUMBER,out=>filename|handle,normalize=>1)

But it gives only the last Ngrams. For example the following code does not give both Uni-Gram and Bi-Gram:

my $ng3 = Text::Ngrams->new( windowsize => 2, type=>'byte');
my $text = "test teXT TESTtexT";

$text =~ s/ +/ /g; # replace multiple spaces to single
$text = uc $text; # uppercase all

$ng3->process_text($text);
my @ngramsarray = $ng3->get_ngrams(orderby=>'frequency', onlyfirst=>10, normalize=>0 );
foreach(@ngramsarray)
{
    print "$_\n";
}

output:

T E
4
E X
2
_ T
2
E S
2
S T
2
X T
2
T _
2
T T
1

However by using the function

to_string(orderby=>'ngram|frequency|none',onlyfirst=>NUMBER,out=>filename|handle,normalize=>1,spartan=>1)

it shows both of Ngrams. But only it displays the result. I need the result in an array.

How to get all Ngrams (Unigram and Bigram) at the same time by this array?

Was it helpful?

Solution

You can't get all the different sizes of n-grams at the same time, but you can get them all using multiple calls to get_ngrams. There is an undocumented parameter n to get_ngrams that says the size of the n-grams you want listed.

In your code, if you say

my @ngramsarray = $ng3->get_ngrams(
  n => 1,
  orderby = >'frequency',
  onlyfirst => 10,
  normalize => 0);

you get this list

('T', 8, 'E', 4, 'X', 2, '_', 2, 'S', 2)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top