Concept to differentiate between html tags and angle brackets

https://softwareengineering.stackexchange.com/questions/308575

html5

11-12-2020
|

Question

I have an issue with a client's requirement that wants to import a string of html text within a csv document.

For example, a sanitized version of one import line:

"IDNumber,TextIdentifierNumber,<p><strong>Hello, this text is **>** this text. 32 < 64</strong></p>"

The issue here is not importing this text, but angle brackets. These are apart of their every day business practice and are needed to indicate a less or greater denomination.

Background: At this time our client is using a .NET web application and a batch load application (console), both written in Visual Basic .NET 4.0. Our web application uses a WYSIWYG editor for entering such text and we handle such angle brackets by their named entities and encoding.

Our issue is discerning an angle bracket among an HTML rich input string.

What we have done to date:

We employ the use of HTMLAgilityPack to strictly parse through HTML and weed out HTML tags we don't allow. Unfortunately, HTMLAgilityPack strips out this angle bracket and any text that could follow a potential closing tag. This buggers up HTML string badly and causes issues in our reports.

We have kicked around a few options, such as text replacement (sending in [LESSTHAN]) by our customer and then our code converts it to proper angle bracket direction. Unfortunately, this most definitely will not work due to their source data coming from another system.

Solution 2

I know it has been a while since I asked this question, but we have found a solution to this problem in March of this year. I am just now filling all of you in.

Our C# code handles this senerio by using a simple REGEX pattern divided up into 4 stages. 2 stages handle opening tags and 2 stages handle closing tags. This regular expression finds all angle brackets that are associated with a known list of HTML tag names our customer normally uses and substitutes for non encodable characters. What is left is angle brackets in text. Those are encoded and substitution is reversed, rendering html and angle brackets properly encoded when saved to a database.

Question or comments welcome.

OTHER TIPS

Well, your client does not want to import html text. He may think that he wants to import html, but it's not what he actually wants. html text cannot contain the letters " ' & < >, for good reason. So the text that you showed is just not html.

Since html tools will not be able to handle this, I'd have a manual pass first where you look at < and > symbols, decide which ones are part of tags, and replace all others with < or >.

This cannot be solved in a general way. Take this example:

q<p>r

Without knowing the intention of the user, there is no way to tell if the brackets represent a html tag or inequality signs.

If you know some additional constraints (e.g. if there are always space around inequality signs or if the operands are always numeric) they it may be possible to solve. But as described the problem is unsolvable.

I have an issue with a client's requirement that wants to import a string of html text within a csv document.

Yet another case where you should not give the Users what they want.
You must give them what they need.

They need a way of entering formatted text so that "downstream" document processing comes out looking right. You stated it quite succinctly here:

The ability to enter in batch, their specific text with HTML format ...

If they really want to be able to enter "HTML" then they must be made to enter HTML. The "<" character is not valid HTML (except to start a new Element).
They should be using the HTML entity < instead or, rather, whatever software they are using to enter that text should do this encoding for them. This would also protect them against cross-site scripting attacks (by preventing them from entering "script" elements into the text).

There are many WYSIWYG HTML editing components out there that could take care of this.

Alternatively, you could go "Old School" and get them to use something akin to BBCode instead. BBCode is basically a very limited subset of HTML entities, using square braces instead of angled ones:

IDNumber,TextIdentifierNumber,[EM]Hello, this text is [B]>[/B] this text. 32 < 64[/EM]

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange