Regex per selezionare le virgole al di fuori delle virgolette

https://stackoverflow.com/questions/632475

regex
quotes

08-07-2019
|

Domanda

Non sono del tutto sicuro che sia possibile, quindi mi rivolgo a te.

Vorrei trovare una regex che selezionerà tutte le virgole che non rientrano nei set di virgolette.

Ad esempio:

'foo' => 'bar',
'foofoo' => 'bar,bar'

Questo sceglierebbe la virgola singola alla riga 1, dopo 'bar',

Non mi interessa davvero le virgolette singole o doppie.

Qualcuno ha qualche idea? Sento che questo dovrebbe essere possibile con readaheads, ma il mio regex fu è troppo debole.

Soluzione

Corrisponderà a qualsiasi stringa fino al e incluso il primo ", " non quotato. È quello che stai cercando?

/^([^"]|"[^"]*")*?(,)/

Se li vuoi tutti (e come contro-esempio per il ragazzo che ha detto che non era possibile) potresti scrivere:

/(,)(?=(?:[^"]|"[^"]*")*$)/

che corrisponderà a tutti loro. Così

'test, a "comma,", bob, ",sam,",here'.gsub(/(,)(?=(?:[^"]|"[^"]*")*$)/,';')

sostituisce tutte le virgole non all'interno delle virgolette con punti e virgola e produce:

'test; a "comma,"; bob; ",sam,";here'

Se ne hai bisogno per lavorare attraverso le interruzioni di riga aggiungi semplicemente il flag m (multilinea).

Altri suggerimenti

Le regex seguenti corrisponderebbero a tutte le virgole presenti al di fuori delle doppie virgolette,

,(?=(?:[^"]*"[^"]*")*[^"]*$)

DEMO

OPPURE (solo PCRE)

"[^"]*"(*SKIP)(*F)|,

" [^ "] * " corrisponde a tutti i blocchi tra virgolette doppie. Cioè, in questo buz, " bar, foo " , questa regex corrisponderebbe solo a " bar, foo " . Ora il seguente (* SKIP) (* F) fa fallire la partita. Quindi passa al modello accanto al simbolo | e cerca di far corrispondere i caratteri della stringa rimanente. Cioè, nel nostro output , accanto al modello | corrisponderà solo la virgola che era subito dopo buz . Nota che questo non corrisponderà alla virgola che era presente tra virgolette doppie, perché facciamo già la parte tra virgolette doppie da saltare.

DEMO

La regex seguente corrisponderebbe a tutte le virgole presenti tra virgolette doppie,

,(?!(?:[^"]*"[^"]*")*[^"]*$)

DEMO

Mentre è possibile hackerarlo con una regex (e mi piace abusare delle regex tanto quanto il ragazzo successivo), prima o poi ti troverai nei guai cercando di gestire sottostringhe senza un parser più avanzato. Possibili modi per mettersi nei guai includono virgolette miste e virgolette sfuggite.

Questa funzione divide una stringa tra virgole, ma non quelle che si trovano all'interno di una stringa a virgola singola o doppia. Può essere facilmente esteso con caratteri aggiuntivi da utilizzare come virgolette (anche se le coppie di caratteri come «» avrebbero bisogno di qualche altra riga di codice) e ti diranno anche se hai dimenticato di chiudere un preventivo nei tuoi dati:

function splitNotStrings(str){
  var parse=[], inString=false, escape=0, end=0

  for(var i=0, c; c=str[i]; i++){ // looping over the characters in str
    if(c==='\\'){ escape^=1; continue} // 1 when odd number of consecutive \
    if(c===','){
      if(!inString){
        parse.push(str.slice(end, i))
        end=i+1
      }
    }
    else if(splitNotStrings.quotes.indexOf(c)>-1 && !escape){
      if(c===inString) inString=false
      else if(!inString) inString=c
    }
    escape=0
  }
  // now we finished parsing, strings should be closed
  if(inString) throw SyntaxError('expected matching '+inString)
  if(end<i) parse.push(str.slice(end, i))
  return parse
}

splitNotStrings.quotes="'\"" // add other (symmetrical) quotes here

Prova questa espressione regolare:

(?:"(?:[^\\"]+|\\(?:\\\\)*[\\"])*"|'(?:[^\\']+|\\(?:\\\\)*[\\'])*')\s*=>\s*(?:"(?:[^\\"]+|\\(?:\\\\)*[\\"])*"|'(?:[^\\']+|\\(?:\\\\)*[\\'])*')\s*,

Ciò consente anche stringhe come & # 8220; 'foo \' bar '= > 'Bar \\', & # 8221;.

La risposta di MarkusQ ha funzionato benissimo per me per circa un anno, fino a quando non lo ha fatto. Ho appena ricevuto un errore di overflow dello stack su una riga con circa 120 virgole e 3682 caratteri in totale. In Java, in questo modo:

        String[] cells = line.split("[\t,](?=(?:[^\"]|\"[^\"]*\")*$)", -1);

Ecco il mio sostituto estremamente inelegante che non impila l'overflow:

private String[] extractCellsFromLine(String line) {
    List<String> cellList = new ArrayList<String>();
    while (true) {
        String[] firstCellAndRest;
        if (line.startsWith("\"")) {
            firstCellAndRest = line.split("([\t,])(?=(?:[^\"]|\"[^\"]*\")*$)", 2);
        }
        else {
            firstCellAndRest = line.split("[\t,]", 2);                
        }
        cellList.add(firstCellAndRest[0]);
        if (firstCellAndRest.length == 1) {
            break;
        }
        line = firstCellAndRest[1];
    }
    return cellList.toArray(new String[cellList.size()]);
}

@SocialCensus, The example you gave in the comment to MarkusQ, where you throw in ' alongside the ", doesn't work with the example MarkusQ gave right above that if we change sam to sam's: (test, a "comma,", bob, ",sam's,",here) has no match against (,)(?=(?:[^"']|["|'][^"']")$). In fact, the problem itself, "I don't really care about single vs double quotes", is ambiguous. You have to be clear what you mean by quoting either with " or with '. For example, is nesting allowed or not? If so, to how many levels? If only 1 nested level, what happens to a comma outside the inner nested quotation but inside the outer nesting quotation? You should also consider that single quotes happen by themselves as apostrophes (ie, like the counter-example I gave earlier with sam's). Finally, the regex you made doesn't really treat single quotes on par with double quotes since it assumes the last type of quotation mark is necessarily a double quote -- and replacing that last double quote with ['|"] also has a problem if the text doesn't come with correct quoting (or if apostrophes are used), though, I suppose we probably could assume all quotes are correctly delineated.

MarkusQ's regexp answers the question: find all commas that have an even number of double quotes after it (ie, are outside double quotes) and disregard all commas that have an odd number of double quotes after it (ie, are inside double quotes). This is generally the same solution as what you probably want, but let's look at a few anomalies. First, if someone leaves off a quotation mark at the end, then this regexp finds all the wrong commas rather than finding the desired ones or failing to match any. Of course, if a double quote is missing, all bets are off since it might not be clear if the missing one belongs at the end or instead belongs at the beginning; however, there is a case that is legitimate and where the regex could conceivably fail (this is the second "anomaly"). If you adjust the regexp to go across text lines, then you should be aware that quoting multiple consecutive paragraphs requires that you place a single double quote at the beginning of each paragraph and leave out the quote at the end of each paragraph except for at the end of the very last paragraph. This means that over the space of those paragraphs, the regex will fail in some places and succeed in others.

Examples and brief discussions of paragraph quoting and of nested quoting can be found here http://en.wikipedia.org/wiki/Quotation_mark .

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow