Question

I can't seem to get a regex that matches either a hashtag #, an @, or a word-boundary. The goal is to break a string into Twitter-like entities and topics so:

input = "Hello @world, #ruby anotherString" 
input.scan(entitiesRegex) 
# => ["Hello", "@world", "#ruby", "anotherString"]

To get just the words, excluding "anotherString" which is too large, is simple:

/\b\w{3,12}\b/

will return ["Hello", "world", "ruby"]. Unfortunately this doesn't include the hashtags and @s. It seems like it should work simply with:

/[\b@#]\w{3,12}\b/

but that returns ["@world", "#ruby"]. This made me realize that word boundaries are not by definition a character, so they don't fall into the category of "A single character" and, so, won't match. A few more attempts:

/\b|[@#]\w{3,12}\b/

returns ["", "", "@world", "", "#ruby", "", "", ""].

/((\b|[@#])\w{3,12}\b)/

matches the right things, but returns [[""], ["@"], ["#"], [""]] as expected, because the braces also mean capture everything enclosed.

/((\b|[@#])\w{3,12}\b)/

kind of works. It returns [["Hello", ""], ["@world", "@"], ["#ruby", "#"]]. So now all the correct items are there, they're just located at the first element of each of the subarrays. The following snippet technically works:

input.scan(/((\b|[@#])\w{3,12}\b)/).collect(&:first)

Is it possible to simplify this to match and return the correct substrings with just the regular expression not requiring the collect post-processing?

Was it helpful?

Solution

You can just use the regular expression /[@#]?\b\w+\b/. That is, optionally match a @ or #, followed by a word boundary (in #ruby, that boundary would be between # and ruby, in a normal word it would also match at the start of the word) and a bunch of word characters.

p "Hello @world, #ruby anotherString".scan(/[@#]?\b\w+\b/)
# => ["Hello", "@world", "#ruby", "anotherString"]

Furthermore, you can adjust the number of characters a matching word should have with quantifiers. You gave an example in a comment to a deleted answer to match only #ruby by using {3,4}:

p "Hello @world, #ruby anotherString".scan(/[@#]?\b\w{3,4}\b/)
# => ["#ruby"]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top