Question

I have a HTML file with a few non-ASCII characters, say encoded in UTF-8 or UTF-16. To save the file in ASCII, I would like to replace them with their (SGML/HTML/XML) entity codes. So for example, every ë should become ë and every should become ◊. How do I do that?

I use Emacs as an editor. I'm sure it has a function to do the replace, but I cannot find it. What am I missing? Or how do I implement it myself?

Was it helpful?

Solution 2

There is a character class which includes exactly the ASCII character set. You can use a regexp that matches its complement to find occurrences of non-ASCII characters, and then replace them with their codes using elisp:

M-x replace-regexp RET
[^[:ascii:]] RET
\,(concat "&#" (number-to-string (string-to-char \&)) ";") RET

So when, for example, á is matched: \& is "á", string-to-char converts it to (= the number 225), and number-to-string converts that to "225". Then, concat concatenates "&#", "225" and ";" to get "á", which replaces the original match.

Surround these commands with C-x ( and C-x ), and apply C-x C-k n and M-x insert-kbd-macro as usual to make a function out of them.


To see the elisp equivalent of calling this function interactively, run the command and then press C-x M-: (Repeat complex command).

A simpler version, which doesn't take into account the active region, could be:

(while (re-search-forward "[^[:ascii:]]" nil t)
  (replace-match (concat "&#"
                         (number-to-string (string-to-char (match-string 0)))
                         ";")))

(This uses the recommended way to do search + replace programmatically.)

OTHER TIPS

I searched high and low but it seems Emacs (or at least version 24.3.1) doesn't have such a function. Nor can I find it somewhere.

Based on a similar (but different) function I did find, I implemented it myself:

(require 'cl)
(defun html-nonascii-to-entities (string)
  "Replace any non-ascii characters with HTML (actually SGML) entity codes."
  (mapconcat
   #'(lambda (char)
       (case char
             (t (if (and (<= 8 char)
                         (<= char 126))
                    (char-to-string char)
                  (format "&#%02d;" char)))))
   string
   ""))
(defun html-nonascii-to-entities-region (region-begin region-end)
  "Replace any non-ascii characters with HTML (actually SGML) entity codes."
  (interactive "r")
  (save-excursion
    (let ((escaped (html-nonascii-to-entities (buffer-substring region-begin region-end))))
      (delete-region region-begin region-end)
      (goto-char region-begin)
      (insert escaped))))

I'm no Elisp guru at all, but this works!

I also found find-next-unsafe-char to be of value.

Edit: an interactive version!

(defun query-replace-nonascii-with-entities ()
  "Replace any non-ascii characters with HTML (actually SGML) entity codes."
  (interactive)
  (perform-replace "[^[:ascii:]]"
                   `((lambda (data count)
                       (format "&#%02d;" ; Hex: "&#x%x;"
                               (string-to-char (match-string 0)))))
                     t t nil))

I think you are looking for iso-iso2sgml

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top