Question

As you know the module Text::Ngrams in Perl can give Ngrams analysis. There is the following function to retrieve the array of Ngrams and frequencies.

get_ngrams(orderby=>'ngram|frequency|none',onlyfirst=>NUMBER,out=>filename|handle,normalize=>1)

But it gives only the last Ngrams. For example the following code does not give both Uni-Gram and Bi-Gram:

my $ng3 = Text::Ngrams->new( windowsize => 2, type=>'byte');
my $text = "test teXT TESTtexT";

$text =~ s/ +/ /g; # replace multiple spaces to single
$text = uc $text; # uppercase all

$ng3->process_text($text);
my @ngramsarray = $ng3->get_ngrams(orderby=>'frequency', onlyfirst=>10, normalize=>0 );
foreach(@ngramsarray)
{
    print "$_\n";
}

output:

T E
4
E X
2
_ T
2
E S
2
S T
2
X T
2
T _
2
T T
1

However by using the function

to_string(orderby=>'ngram|frequency|none',onlyfirst=>NUMBER,out=>filename|handle,normalize=>1,spartan=>1)

it shows both of Ngrams. But only it displays the result. I need the result in an array.

How to get all Ngrams (Unigram and Bigram) at the same time by this array?

Était-ce utile?

La solution

You can't get all the different sizes of n-grams at the same time, but you can get them all using multiple calls to get_ngrams. There is an undocumented parameter n to get_ngrams that says the size of the n-grams you want listed.

In your code, if you say

my @ngramsarray = $ng3->get_ngrams(
  n => 1,
  orderby = >'frequency',
  onlyfirst => 10,
  normalize => 0);

you get this list

('T', 8, 'E', 4, 'X', 2, '_', 2, 'S', 2)
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top