Can Vowpal Wabbit handle datasize ~ 90 GB?

https://stackoverflow.com/questions/22743678

24-06-2023
|

Question

We have extracted features from search engine query log data and the feature file (as per input format of Vowpal Wabbit) amounts to 90.5 GB. The reason for this huge size is necessary redundancy in our feature construction. Vowpal Wabbit claims to be able to handle TBs of data in a matter of few hours. In addition to that, VW uses a hash function which takes almost no RAM. But When we run logistic regression using VW on our data, within a few minutes, it uses up all of RAM and then stalls. This is the command we use-

vw -d train_output --power_t 1  --cache_file train.cache -f data.model 
--compressed --loss_function logistic --adaptive --invariant 
--l2 0.8e-8 --invert_hash train.model

train_output is the input file we want to train VW on, and train.model is the expected model obtained after training

Any help is welcome!

Solution

I've found the --invert_hash option to be extremely costly; try running without that option. You can also try turning on the --l1 regularization option to reduce the number of coefficients in the model.

How many features do you have in your model? How many features per row are there?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow