Hive query: Match string with list of keywords

https://stackoverflow.com/questions/22530463

18-06-2023
|

Domanda

I have a hive table with tweets about movies and a table with keywords mapped to movie titles

keyword example:

title                 keyword
------                -------
3 Days to Kill        3daystokill
3 Days to Kill        3 days to kill
12 Years a Slave      12YearsASlave

tweets example:

id              text
------          -------
125675146       3daystokill sucks!
125673498       3 days to kill is awesome!
239873985       I like 12 Years a Slave :)

I would like to be able to find the tweets matching the keywords for a certain movie title. For example, I want to find all the tweets that mention keywords from 3 Days to Kill (3daystokill and 3 days to kill).

I thought something like this, but the results are empty :(

SELECT k.keyword, t.text
FROM keywords k JOIN tweets t
ON t.text = CONCAT('%',k.keyword,'%')
WHERE k.title = "3 Days to Kill";

Soluzione

You are looking for like, not =:

SELECT k.keyword, t.text
FROM keywords k JOIN
     tweets t
     ON t.text like CONCAT('%', k.keyword, '%')
WHERE k.title = '3 Days to Kill';

EDIT:

I was not aware that HiveQL limited like to only wildcards at the beginning or end. One option is rlike. Another is instr():

SELECT k.keyword, t.text
FROM keywords k JOIN
     tweets t
     ON instr(t.text, k.keyword) > 0
WHERE k.title = '3 Days to Kill';

Altri suggerimenti

I am not sure if the below helps.. :( But have just given a try.

select case when
replace(keyword,' ',null) = '3daystokill'
then keyword
end
from keywords

You Can Use Hive StatisticsAndDataMining Fuction context_ngrams()

SELECT context_ngrams(sentences(lower(tweet)), 2, 100 [, 1000]) FROM twitter;

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow