Question

We are using Perl version 5.8.8. I believe it has some Unicode (UTF-8) support but am not convinced that it's reliable. What is the best option using Perl 5.8.8 to process and preserve data ? What about html entities vs actually processing Unicode? We process very large documents.In order to get many features working we currently filter /replace some Unicode, do some inconstant encoding as html entitles,and some of the code is passed through but escapes matching and results in many bug fixes that must be fixed one by one. Some probably are overlooked, and we live with diminished typography. I am the type that is a bit peeved by this.
My thoughts so far are that it's a hassle to type Unicode chars and the extended punctuation characters are harder to visually differentiate than entities. Finally I've read about dealing with Unicode and think it might be good for a new project using a contemporary Perl version but difficult to retrofit so normalizing using a script into html entities seems like a better option. On the other hand the border code or script would need to use Unicode anyway. I don't think it will effect functionality in JavaScript. I believe that these entities are quickly translated into Unicode characters and become regular elements of text nodes of the DOM.

Is there a lib or script that would consistently normalize the use of Unicode and html entities? If entities it should normalize within that space using a short lexicon of named entities and default to numeric for the rest. That would a separate step, and comparatively easy. Other steps would be to modify the input scripts to help normalize the Perl code, and create some idioms to match elements like dashes, quotes that have more than one option.

Was it helpful?

Solution

Perl 5.8.8 had no problem storing strings of Unicode chars. (The same string storage format is still used today in 5.18.)

Perl 5.8.8 had no problems encoding strings of Unicode chars to UTF-8. (A newer version of Encode than the one included with 5.18 is found on CPAN, and I bet it installs perfectly fine on 5.8.8.)

HTML::Entities's encode_entities will encode the code points you want into entities, using named entities when they exist and numbered entities otherwise.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top