I'm pulling articles from specific URLs for conversion to sentences, but the text body has a random behavior of eliminating whitespace between some sentences resulting in:

Jane went to the store.She bought a dog. The dog was very friendly.It had no teeth.

Some of my text is stock symbols (AZ.GAN) etc. So I can't simply insert a space between all periods which have no adjacent whitespace.

Jane bought several shares of (TY.JPN). She lost all her cash money."Arg!" She cried.

The above example would destroy the stock symbol variable.

Curious if anyone knows the cause of this. I have tried several HTML and DOM. I use Simple_DOM to grab the plaintext. Although, I get the same result if I do it manually, or with any other parsing engine.

有帮助吗?

解决方案

Unfortunately I don't have an approach for your specific question, but is it possible that the missing space between sentences is actually a linebreak (e.g. \n) that your text viewer (whatever it is) isn't showing you?

Perhaps try something like this just to make sure

var articleContent = ... // get content
articleContent = articleContent.replace(/\n/g, ' NEW LINE ');

其他提示

Try doing:

$str = trim(preg_replace('~([(].+?[.])\s(.+?[)])~', '$1$2', str_replace('.', '. ', $str)));
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top