Il modo migliore per rimuovere la punteggiatura da una stringa

https://stackoverflow.com/questions/265960

06-07-2019
|

Domanda

Sembra che ci dovrebbe essere un modo più semplice di:

import string
s = "string. With. Punctuation?" # Sample string 
out = s.translate(string.maketrans("",""), string.punctuation)

Esiste?

Soluzione

Dal punto di vista dell'efficienza, non batterai

s.translate(None, string.punctuation)

Per le versioni successive di Python utilizzare il seguente codice:

s.translate(str.maketrans('', '', string.punctuation))

Sta eseguendo operazioni di stringa non elaborate in C con una tabella di ricerca - non c'è molto che lo batterà ma scrivendo il proprio codice C.

Se la velocità non è un problema, un'altra opzione è:

exclude = set(string.punctuation)
s = ''.join(ch for ch in s if ch not in exclude)

Questo è più veloce di s.replace con ogni carattere, ma non eseguirà approcci in pitone non puri come regex o string.translate, come puoi vedere dai tempi seguenti. Per questo tipo di problema, ripagarlo a un livello il più basso possibile ripaga.

Codice temporale:

import re, string, timeit

s = "string. With. Punctuation"
exclude = set(string.punctuation)
table = string.maketrans("","")
regex = re.compile('[%s]' % re.escape(string.punctuation))

def test_set(s):
    return ''.join(ch for ch in s if ch not in exclude)

def test_re(s):  # From Vinko's solution, with fix.
    return regex.sub('', s)

def test_trans(s):
    return s.translate(table, string.punctuation)

def test_repl(s):  # From S.Lott's solution
    for c in string.punctuation:
        s=s.replace(c,"")
    return s

print "sets      :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000)
print "regex     :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000)
print "translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000)
print "replace   :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000)

Questo dà i seguenti risultati:

sets      : 19.8566138744
regex     : 6.86155414581
translate : 2.12455511093
replace   : 28.4436721802

Altri suggerimenti

Le espressioni regolari sono abbastanza semplici, se le conosci.

import re
s = "string. With. Punctuation?"
s = re.sub(r'[^\w\s]','',s)

Nel codice sopra, stiamo sostituendo (re.sub) tutti i NON [caratteri alfanumerici (\ w) e gli spazi (\ s)] con una stringa vuota.
Quindi e ? la punteggiatura non sarà presente nelle variabili 's' dopo aver eseguito la variabile s tramite regex.

Per comodità d'uso, riassumo la nota di come rimuovere la punteggiatura da una stringa sia in Python 2 che in Python 3. Fare riferimento ad altre risposte per la descrizione dettagliata.

Python 2

import string

s = "string. With. Punctuation?"
table = string.maketrans("","")
new_s = s.translate(table, string.punctuation)      # Output: string without punctuation

Python 3

import string

s = "string. With. Punctuation?"
table = str.maketrans({key: None for key in string.punctuation})
new_s = s.translate(table)                          # Output: string without punctuation

myString.translate(None, string.punctuation)

Di solito uso qualcosa del genere:

>>> s = "string. With. Punctuation?" # Sample string
>>> import string
>>> for c in string.punctuation:
...     s= s.replace(c,"")
...
>>> s
'string With Punctuation'

string.punctuation è ASCII solo ! Un modo più corretto (ma anche molto più lento) è usare il modulo unicodedata:

# -*- coding: utf-8 -*-
from unicodedata import category
s = u'String — with -  «punctation »...'
s = ''.join(ch for ch in s if category(ch)[0] != 'P')
print 'stripped', s

Non necessariamente più semplice, ma diverso, se hai più familiarità con la famiglia.

import re, string
s = "string. With. Punctuation?" # Sample string 
out = re.sub('[%s]' % re.escape(string.punctuation), '', s)

Per i valori Python 3 str o Python 2 unicode , str.translate () accetta solo un dizionario; i punti di codice (numeri interi) vengono cercati in quella mappatura e tutto ciò che è mappato su None viene rimosso.

Per rimuovere la punteggiatura (alcuni?), quindi:

import string

remove_punct_map = dict.fromkeys(map(ord, string.punctuation))
s.translate(remove_punct_map)

La classe dict.fromkeys () metodo rende banale la creazione della mappatura, impostando tutti i valori su None in base alla sequenza di chiavi.

Per rimuovere la punteggiatura tutto , non solo la punteggiatura ASCII, la tua tabella deve essere un po 'più grande; vedi J.F. La risposta di Sebastian (versione Python 3):

import unicodedata
import sys

remove_punct_map = dict.fromkeys(i for i in range(sys.maxunicode)
                                 if unicodedata.category(chr(i)).startswith('P'))

string.punctuation manca un sacco di segni di punteggiatura che sono comunemente usati nel mondo reale. Che ne dici di una soluzione che funzioni per punteggiatura non ASCII?

import regex
s = u"string. With. Some・Really Weird、Non？ASCII。 「（Punctuation）」?"
remove = regex.compile(ur'[\p{C}|\p{M}|\p{P}|\p{S}|\p{Z}]+', regex.UNICODE)
remove.sub(u" ", s).strip()

Personalmente, credo che questo sia il modo migliore per rimuovere la punteggiatura da una stringa in Python perché:

Rimuove tutta la punteggiatura Unicode
È facilmente modificabile, ad es. puoi rimuovere il \ {S} se vuoi rimuovere la punteggiatura, ma mantieni simboli come $ .
Puoi essere veramente specifico su ciò che vuoi conservare e ciò che vuoi rimuovere, ad esempio \ {Pd} rimuoverà solo i trattini.
Questa regex normalizza anche gli spazi bianchi. Associa schede, ritorni a capo e altre stranezze a spazi singoli e gradevoli.

Utilizza le proprietà del carattere Unicode, che puoi leggere di più su Wikipedia .

Questa potrebbe non essere la soluzione migliore, tuttavia è così che l'ho fatto.

import string
f = lambda x: ''.join([i for i in x if i not in string.punctuation])

Ecco una funzione che ho scritto. Non è molto efficiente, ma è semplice e puoi aggiungere o rimuovere la punteggiatura che desideri:

def stripPunc(wordList):
    """Strips punctuation from list of words"""
    puncList = [".",";",":","!","?","/","\\",",","#","@","<*>quot;,"&",")","(","\""]
    for punc in puncList:
        for word in wordList:
            wordList=[word.replace(punc,'') for word in wordList]
    return wordList

Ecco una riga per Python 3.5:

import string
"l*ots! o(f. p@u)n[c}t]u[a'ti\"on#$^?/".translate(str.maketrans({a:None for a in string.punctuation}))

Non ho ancora visto questa risposta. Usa solo una regex; rimuove tutti i caratteri oltre ai caratteri di parole ( \ w ) e ai caratteri numerici ( \ d ), seguiti da un carattere di spazi bianchi ( \ s ):

import re
s = "string. With. Punctuation?" # Sample string 
out = re.sub(ur'[^\w\d\s]+', '', s)

Ecco una soluzione senza regex.

import string

input_text = "!where??and!!or$then:)"
punctuation_replacer = string.maketrans(string.punctuation, ' '*len(string.punctuation))    
print ' '.join(input_text.translate(punctuation_replacer).split()).strip()

Output>> where and or then

Sostituisce le punteggiatura con spazi
Sostituisci più spazi tra le parole con un singolo spazio
Rimuovi gli spazi finali, se presenti con striscia ()

Proprio come un aggiornamento, ho riscritto l'esempio @Brian in Python 3 e ho apportato modifiche per spostare il passo di compilazione regex all'interno della funzione. Il mio pensiero qui era quello di valutare ogni singolo passaggio necessario per far funzionare la funzione. Forse stai utilizzando il calcolo distribuito e non puoi avere oggetti regex condivisi tra i tuoi lavoratori e devi avere il passaggio re.compile su ciascun lavoratore. Inoltre, ero curioso di valutare due diverse implementazioni di maketrans per Python 3

table = str.maketrans({key: None for key in string.punctuation})

table = str.maketrans('', '', string.punctuation)

Inoltre ho aggiunto un altro metodo per utilizzare set, in cui utilizzo la funzione di intersezione per ridurre il numero di iterazioni.

Questo è il codice completo:

import re, string, timeit

s = "string. With. Punctuation"


def test_set(s):
    exclude = set(string.punctuation)
    return ''.join(ch for ch in s if ch not in exclude)


def test_set2(s):
    _punctuation = set(string.punctuation)
    for punct in set(s).intersection(_punctuation):
        s = s.replace(punct, ' ')
    return ' '.join(s.split())


def test_re(s):  # From Vinko's solution, with fix.
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    return regex.sub('', s)


def test_trans(s):
    table = str.maketrans({key: None for key in string.punctuation})
    return s.translate(table)


def test_trans2(s):
    table = str.maketrans('', '', string.punctuation)
    return(s.translate(table))


def test_repl(s):  # From S.Lott's solution
    for c in string.punctuation:
        s=s.replace(c,"")
    return s


print("sets      :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000))
print("sets2      :",timeit.Timer('f(s)', 'from __main__ import s,test_set2 as f').timeit(1000000))
print("regex     :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000))
print("translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000))
print("translate2 :",timeit.Timer('f(s)', 'from __main__ import s,test_trans2 as f').timeit(1000000))
print("replace   :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000))

Questi sono i miei risultati:

sets      : 3.1830138750374317
sets2      : 2.189873124472797
regex     : 7.142953420989215
translate : 4.243278483860195
translate2 : 2.427158243022859
replace   : 4.579746678471565

>>> s = "string. With. Punctuation?"
>>> s = re.sub(r'[^\w\s]','',s)
>>> re.split(r'\s*', s)


['string', 'With', 'Punctuation']

import re
s = "string. With. Punctuation?" # Sample string 
out = re.sub(r'[^a-zA-Z0-9\s]', '', s)

Un one-liner potrebbe essere utile in casi non molto severi:

''.join([c for c in s if c.isalnum() or c.isspace()])

#FIRST METHOD
#Storing all punctuations in a variable    
punctuation='!?,.:;"\')(_-'
newstring='' #Creating empty string
word=raw_input("Enter string: ")
for i in word:
     if(i not in punctuation):
                  newstring+=i
print "The string without punctuation is",newstring

#SECOND METHOD
word=raw_input("Enter string: ")
punctuation='!?,.:;"\')(_-'
newstring=word.translate(None,punctuation)
print "The string without punctuation is",newstring


#Output for both methods
Enter string: hello! welcome -to_python(programming.language)??,
The string without punctuation is: hello welcome topythonprogramminglanguage

with open('one.txt','r')as myFile:

    str1=myFile.read()

    print(str1)


    punctuation = ['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"] 

for i in punctuation:

        str1 = str1.replace(i," ") 
        myList=[]
        myList.extend(str1.split(" "))
print (str1) 
for i in myList:

    print(i,end='\n')
    print ("____________")

Rimuovi le parole di arresto dal file di testo usando Python

print('====THIS IS HOW TO REMOVE STOP WORS====')

with open('one.txt','r')as myFile:

    str1=myFile.read()

    stop_words ="not", "is", "it", "By","between","This","By","A","when","And","up","Then","was","by","It","If","can","an","he","This","or","And","a","i","it","am","at","on","in","of","to","is","so","too","my","the","and","but","are","very","here","even","from","them","then","than","this","that","though","be","But","these"

    myList=[]

    myList.extend(str1.split(" "))

    for i in myList:

        if i not in stop_words:

            print ("____________")

            print(i,end='\n')

Ecco come cambiare i nostri documenti in maiuscolo o in minuscolo.

print('@@@@This is lower case@@@@')

with open('students.txt','r')as myFile:

    str1=myFile.read()
    str1.lower()
print(str1.lower())

print('*****This is upper case****')

with open('students.txt','r')as myFile:

    str1=myFile.read()

    str1.upper()

print(str1.upper())

Mi piace usare una funzione come questa:

def scrub(abc):
    while abc[-1] is in list(string.punctuation):
        abc=abc[:-1]
    while abc[0] is in list(string.punctuation):
        abc=abc[1:]
    return abc

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow