Splitting HTML Content Into Sentences, But Keeping Subtags Intact

Question

Soapbox

We could craft a regex to match your specific case, but given this is HTML parsing and that your use case hints that any number of tags could be in there, you'd be best off using the DOM or using a product like HTML Agility (free)

However

If you're just looking to pull out the inner text and not interested in retaining any of the tag data, you could use this regex and replace all matches with a null

(<[^>]*>)

enter image description here

Retain sentence as is including subtags

((?:<p(?:\s[^>]*)?>).*?</p>) - retain the paragraph tags and entire sentence, but not any data outside the paragraph
(?:<p(?:\s[^>]*)?>)(.*?)(?:</p>) - retain just the paragraph innertext including all subtags, and store sentence into group 1
(<p(?:\s[^>]*)?>)(.*?)(</p>) - capture open and close paragraph tags and the innertext including any sub tags

Granted these are PowerShell examples, the regex and replace function should be similar

$string = '<img> not this stuff either</img><p class=SuperCoolStuff>This is a sample of a <a href="#">link</a> getting chewed up.</p><a> other stuff</a>'

Write-Host "replace p tags with a new span tag"
$string -replace '(?:<p(?:\s[^>]*)?>)(.*?)(?:</p>)', '<span class=sentence>$1</span>'

Write-Host
Write-Host "insert p tag's inner text into a span new span tag and return the entire thing including the p tags"
$string -replace '(<p(?:\s[^>]*)?>)(.*?)(</p>)', '$1<span class=sentence>$2</span>$3'

Yields

replace p tags with a new span tag
<img> not this stuff either</img><span class=sentence>This is a sample of a <a href="#">link</a> getting chewed up.</span
><a> other stuff</a>

insert p tag's inner text into a span new span tag and return the entire thing including the p tags
<img> not this stuff either</img><p class=SuperCoolStuff><span class=sentence>This is a sample of a <a href="#">link</a> 
getting chewed up.</span></p><a> other stuff</a>