Un modo elegante per ottenere hashtag da una stringa in Python?

https://stackoverflow.com/questions/6331497

27-10-2019
|

Domanda

Sto cercando un modo pulito per ottenere un set (elenco, array, qualsiasi cosa) di parole che iniziano con # all'interno di una determinata stringa.

In C #, scriverei

var hashtags = input
    .Split (' ')
    .Where (s => s[0] == '#')
    .Select (s => s.Substring (1))
    .Distinct ();

Qual è il codice relativamente elegante per farlo in Python?

<”EDIT

Esempio di input: "Hey guys! #stackoverflow really #rocks #rocks #announcement"
Output previsto: ["stackoverflow", "rocks", "announcement"]

Soluzione

Con Risposta di @ inspectorG4dget , se non desideri duplicati, puoi utilizzare set comprensensions invece di list list.

>>> tags="Hey guys! #stackoverflow really #rocks #rocks #announcement"
>>> {tag.strip("#") for tag in tags.split() if tag.startswith("#")}
set(['announcement', 'rocks', 'stackoverflow'])

Nota che la sintassi { } per la comprensione degli insiemi funziona solo a partire da Python 2.7.
Se stai lavorando con versioni precedenti, la comprensione dell'elenco di feed ([ ]) genera la funzione set come suggerito da @Bertrand .

Altri suggerimenti

[i[1:] for i in line.split() if i.startswith("#")]

Questa versione eliminerà tutte le stringhe vuote (come ho letto tali preoccupazioni nei commenti) e le stringhe che sono solo "#".Inoltre, come nel codice di Bertrand Marron , è meglio trasformarlo in un set come segue (per evitare duplicatie per O (1) tempo di ricerca):

set([i[1:] for i in line.split() if i.startswith("#")])

il metodo findall di oggetti espressione regolare può ottenerli tutti in una volta:

>>> import re
>>> s = "this #is a #string with several #hashtags"
>>> pat = re.compile(r"#(\w+)")
>>> pat.findall(s)
['is', 'string', 'hashtags']
>>>

Direi

hashtags = [word[1:] for word in input.split() if word[0] == '#']

Modifica: questo creerà un set senza duplicati.

set(hashtags)

Un'altra opzione è regEx:

import re

inputLine = "Hey guys! #stackoverflow really #rocks #rocks #announcement"

re.findall(r'(?i)\#\w+', inputLine) # will includes #
re.findall(r'(?i)(?<=\#)\w+', inputLine) # will not include #

ci sono alcuni problemi con le risposte presentate qui.

{tag.strip ("#") per tag in tags.split () if tag.startswith ("#")}

[i [1:] for i in line.split () if i.startswith ("#")]

non funziona se hai hashtag come "# one # two #"

2 re.compile(r"#(\w+)") non funzionerà per molti linguaggi Unicode (anche usando re.UNICODE)

avevo visto più modi per estrarre l'hashtag, ma non ho trovato nessuno di loro che rispondeva a tutti i casi

quindi ho scritto un piccolo codice Python per gestire la maggior parte dei casi.per me funziona.

def get_hashtagslist(string):
    ret = []
    s=''
    hashtag = False
    for char in string:
        if char=='#':
            hashtag = True
            if s:
                ret.append(s)
                s=''           
            continue

        # take only the prefix of the hastag in case contain one of this chars (like on:  '#happy,but i..' it will takes only 'happy'  )
        if hashtag and char in [' ','.',',','(',')',':','{','}'] and s:
            ret.append(s)
            s=''
            hashtag=False 

        if hashtag:
            s+=char

    if s:
        ret.append(s)

    return set(ret)

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow