Question

I want to train an ngram language model

Let's say I have the following corpus:

The sliding cat is not able to dance
He is only able to slide
Because obviously he is the sliding cat

I am planning to use tf.data.Dataset to feed my model, which is fine

But I don't know if it is better to use a sliding window to iterate through my copus or simply feed my corpus n words at a time

Using a sliding window, my model (assuming a bigram) will see:

The sliding
sliding cat
cat is
is not
...

Going n word at a time:

The sliding
cat is
not able
...

I'd appreciate any recommandation, thanks

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top