Question

I would like to extract all the just the twitter handles from the following: http://twitaholic.com/top100/followers/

All the twitter handles start with an @

So something like wget twitaholic.com/top100/followers/ | grep -oh "@" to print just the the handles, but that doesn't work (only prints the @). What's wrong?

Was it helpful?

Solution

You are using -o option of grep and only specifying one character, that is @, also you don't need the -h option.

Try this:

wget twitaholic.com/top100/followers/ | grep -o "@[^<]*"

What we are telling grep here is look for @ symbol and capture everything until you see a < symbol. This is because the line that carries the handle looks like this:

;@BarackObama<br

So you effectively need to extract text starting from @ to <.

Output:

$ wget twitaholic.com/top100/followers/ | grep -o "@[^<]*" | head -10
@katyperry
@justinbieber
@BarackObama
@ladygaga
@YouTube
@taylorswift13
@britneyspears
@rihanna
@jtimberlake
@instagram
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top