I wrote a Node.js module just for this purpose called 'unfluff':
https://github.com/ageitgey/node-unfluff
Hopefully that will solve your problem.
Unfluff is based on the popular "python-goose" and "goose" (Scala) page extraction libraries in case you are familiar with those.