¿Puede proponer una forma más elegante de 'tokenizar' el código C# para el formato HTML?

https://stackoverflow.com/questions/228605

04-07-2019
|

Pregunta

(Esta pregunta sobre la refactorización del código F# me dio un voto negativo, pero también algunas respuestas interesantes y útiles.Y 62 preguntas de F# de las más de 32.000 en SO parecen lamentables, ¡así que voy a correr el riesgo de recibir más desaprobación!)

Ayer estaba intentando publicar un poco de código en un blog de blogger y recurrí a este sitio, que había encontrado útil en el pasado.Sin embargo, el editor del blogger se comió todas las declaraciones de estilo, por lo que resultó ser un callejón sin salida.

Entonces (como cualquier hacker), pensé "¿Qué tan difícil puede ser?" y rodé el mío en <100 líneas de f#.

Aquí está la "meollo" del código, que convierte una cadena de entrada en una lista de "tokens".Tenga en cuenta que estos tokens no deben confundirse con los tokens de estilo lexing/parsing.Los miré brevemente y, aunque casi no entendí nada, sí entendí que me darían solo tokens, mientras que quiero conservar mi cadena original.

La pregunta es:¿Existe una forma más elegante de hacer esto?No me gustan las n redefiniciones de s necesarias para eliminar cada cadena de token de la cadena de entrada, pero es difícil dividir la cadena en tokens potenciales por adelantado, debido a cosas como comentarios, cadenas y la directiva #region (que contiene un carácter que no es una palabra).

//Types of tokens we are going to detect
type Token = 
    | Whitespace of string
    | Comment of string
    | Strng of string
    | Keyword of string
    | Text of string
    | EOF

//turn a string into a list of recognised tokens
let tokenize (s:String) = 
    //this is the 'parser' - should we look at compiling the regexs in advance?
    let nexttoken (st:String) = 
        match st with
        | st when Regex.IsMatch(st, "^\s+") -> Whitespace(Regex.Match(st, "^\s+").Value)
        | st when Regex.IsMatch(st, "^//.*?\r?\n") -> Comment(Regex.Match(st, "^//.*?\r?\n").Value) //this is double slash-style comments
        | st when Regex.IsMatch(st, "^/\*(.|[\r?\n])*?\*/") -> Comment(Regex.Match(st, "^/\*(.|[\r?\n])*?\*/").Value) // /* */ style comments http://ostermiller.org/findcomment.html
        | st when Regex.IsMatch(st, @"^""([^""\\]|\\.|"""")*""") -> Strng(Regex.Match(st, @"^""([^""\\]|\\.|"""")*""").Value) // unescaped = "([^"\\]|\\.|"")*" http://wordaligned.org/articles/string-literals-and-regular-expressions
        | st when Regex.IsMatch(st, "^#(end)?region") -> Keyword(Regex.Match(st, "^#(end)?region").Value)
        | st when st <> "" -> 
                match Regex.Match(st, @"^[^""\s]*").Value with //all text until next whitespace or quote (this may be wrong)
                | x when iskeyword x -> Keyword(x)  //iskeyword uses Microsoft.CSharp.CSharpCodeProvider.IsValidIdentifier - a bit fragile...
                | x -> Text(x)
        | _ -> EOF

    //tail-recursive use of next token to transform string into token list
    let tokeneater s = 
        let rec loop s acc = 
            let t = nexttoken s
            match t with
            | EOF -> List.rev acc //return accumulator (have to reverse it because built backwards with tail recursion)
            | Whitespace(x) | Comment(x) 
            | Keyword(x) | Text(x) | Strng(x) -> 
                loop (s.Remove(0, x.Length)) (t::acc)  //tail recursive
        loop s []

    tokeneater s

(Si alguien está realmente interesado, estaré encantado de publicar el resto del código)

EDITARUtilizando el excelente sugerencia de patrones activos por kvb, la parte central queda así, ¡mucho mejor!

let nexttoken (st:String) = 
    match st with
    | Matches "^\s+" s -> Whitespace(s)
    | Matches "^//.*?\r?(\n|$)" s -> Comment(s) //this is double slash-style comments
    | Matches "^/\*(.|[\r?\n])*?\*/" s -> Comment(s)  // /* */ style comments http://ostermiller.org/findcomment.html
    | Matches @"^@?""([^""\\]|\\.|"""")*""" s -> Strng(s) // unescaped regexp = ^@?"([^"\\]|\\.|"")*" http://wordaligned.org/articles/string-literals-and-regular-expressions
    | Matches "^#(end)?region" s -> Keyword(s) 
    | Matches @"^[^""\s]+" s ->   //all text until next whitespace or quote (this may be wrong)
            match s with
            | IsKeyword x -> Keyword(s)
            | _ -> Text(s)
    | _ -> EOF

Solución

Usaría un patrón activo para encapsular los pares Regex.IsMatch y Regex.Match, así:

let (|Matches|_|) re s =
  let m = Regex(re).Match(s)
  if m.Success then
    Some(Matches (m.Value))
  else
    None

Entonces su función nexttoken puede verse así:

let nexttoken (st:String) =         
  match st with        
  | Matches "^s+" s -> Whitespace(s)        
  | Matches "^//.*?\r?\n" s -> Comment(s)
  ...

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow