Highlight a bunch of words?

https://stackoverflow.com/questions/11990076

26-06-2021
|

Domanda

I'm trying to highlight a bunch of words - so I've written a pygments extension. Basically it works, but still not to my satisfaction.

Here's a simple idea which should work: highlight words appropriately, and all other text which doesn't match these words - in text. But this hungs up:

from pygments.lexer import RegexLexer
from pygments.token import *

class HotKeyPoetry(RegexLexer):
    name = 'HotKeyPoetry'
    aliases = ['HotKeyPoetry']
    filenames = ['*.hkp']

    tokens = {
        'root': [

            (r'\bAlt\b', Generic.Traceback),
            (r'\bShft\b', Name.Variable),
            (r'\bSpc\b', Operator),
            (r'\bCtrl\b', Keyword.Type),
            (r'\bRet\b', Name.Label),
            (r'\bBkSpc\b', Generic.Inserted),
            (r'\bTab\b', Keyword.Type),
            (r'\bCpsLk\b', String.Char),
            (r'\bNmLk\b', Generic.Output),
            (r'\bScrlLk\b', String.Double),
            (r'\bPgUp\b', Name.Attribute),
            (r'\bPgDwn\b', Name.Builtin),
            (r'\bHome\b', Number.Oct),
            (r'\bEnd\b', Name.Constant),
            (r'\bDel\b', Name.Decorator),
            (r'\bIns\b', Number.Integer.Long),
            (r'\bWin\b', Name.Builtin.Pseudo),
            (r'\bF1?[1-9]\b', Name.Function),

            (r'(?!\b(Alt|Shft|Spc|Ctrl|Ret|BkSpc|Tab|CpsLk|NmLk|ScrlLk|PgUp|PgDwn|Home|End|Del|Ins|Win|F5)\b)', Text),

        ]
    }

May be I should better use another lexer for the job?

Edit 1

r"(.+?)(?:$|\b(?=(Alt|Shft|Spc|Ctrl|Ret|BkSpc|Tab|CpsLk|NmLk|ScrlLk|PgUp|P‌gDwn|‌Home|End|Del|Ins|Win|F[12]?[1-9])\b))"

is an exlusing regexp I've been looking for.

Now I'm trying to make # a comment char -- so that everything after it (within a line) -- is a comment: I've tried:

r"(.+?)(?:$|#.*$|\b(?=(Alt|Shft|Spc|Ctrl|Ret|BkSpc|Tab|CpsLk|NmLk|ScrlLk|PgUp|P‌gDwn|‌Home|End|Del|Ins|Win|F[12]?[1-9])\b))"

and

r"([^#]+?)(?:$|\b(?=(Alt|Shft|Spc|Ctrl|Ret|BkSpc|Tab|CpsLk|NmLk|ScrlLk|PgUp|PgD‌wn|‌Home|End|Del|Ins|Win|F[12]?[1-9])\b))"

followed by

 (r'#.*$', Comment),

I've also tried adding a second state:

'comment': [ 
      (r'#.*$', Comment),
],

-- but nothing works.

Edit 2

The complite working pygments extension python package is here. You can get and

python setup.py build
python setup.py install --user

it to register it in pygments. You can then test it with:

pygmentize -f html -O full -o test.html test.hkp

or specify a language:

pygmentize -f html -O full -l HotKeyPoetry -o test.html test.hkp

Here's a sample test.hkp:

Ctrl-Alt-{Home/End} ⇒ {beginning/end}-of-visual-line
Ctrl-Alt-{b/↓/↑} ⇒ {set/goto next/goto previous} bookmark # I have it in okular and emacs
Alt-{o/O} ⇒ switch-to-buffer{/-other-window}
Ctrl-{o/O} ⇒ find-file{/-other-window}
Ctrl-x o ⇒ ergo-undo-close-buffer # it uses ergoemacs' recently-closed-buffers
Ctrl-Alt-O ⇒ find-alternate-file

(comments are not really useful for Hot Keys -- but I need them for PyMOL).

Soluzione

Yes, the final regex isn't actually matching any characters. I tried this code:

import re

regexes = {
  "text": re.compile(r"(.+?)(?:$|\b(?=(Alt|Shft|Spc|Ctrl|Ret|BkSpc|Tab|CpsLk|NmLk|ScrlLk|PgUp|PgDwn|Home|End|Del|Ins|Win|F1?[1-9])\b))"),
  "kwd": re.compile(r"(Alt|Shft|Spc|Ctrl|Ret|BkSpc|Tab|CpsLk|NmLk|ScrlLk|PgUp|PgDwn|Home|End|Del|Ins|Win|F1?[1-9])\b")
}

def tokenise(state):
  while state["src"]:
    state["tok"] = "text" if state["tok"] == "kwd" else "kwd"
    #print "mode: {0:20} {1!r}".format(state["tok"].capitalize(), state["src"])

    m = regexes[state["tok"]].match(state["src"])
    if m:
      match = m.group(0)
      state["src"] = state["src"][m.end():]
      #print "  TOKEN({0}, {1!r})".format(state["tok"], match)
      yield "TOKEN({0}, {1!r})".format(state["tok"], match)


state = {
  "src": "A thing that, Tab, is AltCps or 'Win'. F8 is good, as is: F13.",
  "tok": "text"
}
print repr(state["src"])
print "\n".join(list(tokenise(state)))
print

state = {
  "src": "Alt thing that, Tab, is AltCps or 'Win'. F8 is good, as is: F13.",
  "tok": "text"
}
print repr(state["src"])
print "\n".join(list(tokenise(state)))
print

state = {
  "src": "Alt thing that, Tab, is AltCps or 'Win'. F8 is good, as is: F11",
  "tok": "text"
}
print repr(state["src"])
print "\n".join(list(tokenise(state)))
print

And it works I for the cases I tested, the text regex looks good in your code :)

Altri suggerimenti

1) You misunderstand how the (?! works: It doesn't match text. Your last RE (in the original code block) matches at a position that is not followed by any of the words you list. But it matches zero characters of text, so there's nothing to color and you don't move forward.

What you really meant is this: \b(?!(?:Alt|Shft|etc)\b)\w+\b. (Match any word \w+ between \bs, but not if the first \b is followed by any of the keywords)

2) About matching comments: Based on the pygments documentation, your expression (r'#.*$', Comment) ought to work. Or, in the style used in the examples:

(r'#.*\n', Comment),

3) You only need one state, so add the comment rule to the root state. Multiple states are for when you have different syntax in different places, e.g. if you have mixed html and PHP, or if you want to highlight the SQL inside a python string.

4) Your rules need to match everything in your input. Rules are tried in order until one works, so instead of trying to write a rule that does not match keywords, you can put this wildcard as your last rule:

(r'(?s).', Text),

It will advance one character at a time until you get to something your other rules can match. To repeat: Remove your long rule that matches non-keywords, and use the above instead.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow