Corpus file format for Moses

https://stackoverflow.com/questions/14367364

linux
moses

16-01-2022
|

Question

I'm using Moses to make a Language model.

I followed the instructions from this link: Baseline System: Moses

I have google 1-gram file that looks like:

</S>    95119665584
<S>     95119665584
,       30578667846
.       22077031422
<UNK>   21594821357
the     19401194714
-       16337125274
of      12765289150
and     12522922536

That means that the word "of" appeared 12,765,289,150 times.

Now I want to make a Language Model from this file ("Build Language Model"),

I don't know if this file format will work with Moses.

In the tutorial they are working with "europarl-v6.en", but I can't find it on the web to check the file format.

LAST EDIT:

I need to represent each letter as word, so "hello" becomes "h e l l o".

After representing each word as I said , which format should I use?

Should it be:

o f
o f
o f
a n d
a n d

Or like the original format:

o f       12765289150
a n d     12522922536

Or maybe in other format ?

Does it still count as google n-gram ?

I followed the link: How can I use the Google Web N-gram corpus to build an LM as @ MukundKRoy suggested, but I don't know how to use it in my case (1-gram,2-gram...my new file isn't const).

I'll be glad if someone can tell me what format should this file be to use it with SRILM as simple as I can. Thanks

Solution

SRILM is taking care of the 1-2-3..-grams, don't bother.

I've done something similar, take a look over here:

Moses Installation and Training Run-Through

In PART II - Build a Model , section Build Language Model , it is working perfect with google n-grams.

Let me know if that worked for you.

OTHER TIPS

You can use CMU-Cambridge Statistical Language Modeling Toolkit to build your language model. Refer wfreq2vocab and text2wngram. I think this format of LM will work fine with moses.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow