find all two word phrases that appear in more than one row in a dataset

Question 1

Good news: BigQuery now supports SPLIT(). Check https://stackoverflow.com/a/24172995/132438.

This is a hack, but a hack I happen to like :).

In its current form, it only works for sentences with more than 2 words, and it only extracts the 6 first pairs. You can extend and test from here.

Try it on your data, and please report back.

SELECT pairs, COUNT(*) c FROM
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){0}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){1}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){2}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){3}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){4}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){5}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
)
WHERE pairs != title
GROUP EACH BY pairs
HAVING c > 1
LIMIT 1000

Results might contain NSFW words. The sample dataset comes from an online community that has not been "cleaned up". Abstain from running query if you are sensitive to some words.

Question 2

A very useful hack which inspired me to solve my problem, thanks.

My data is a combination of passengers and their age where age is a string of numbers:

adults ages
------ -------------
  4    "53,67,65,68"       
  4    "44,45,69,65" 
  3    "20,21,20"
  3    "30,32,62"

I wanted to add a column on each row containing the difference in age between the highest and lowest value

adults ages          agediff
------ ------------- -------
   4   "53,67,65,68" 15       
   4   "44,45,69,65" 25
   3   "20,21,20"    1
   3   "30,32,62"    32

This was done by the following, heavily inspired by the hack:

SELECT adults, ages, SUBTRACT(INTEGER(maxage),INTEGER(minage)) agediff FROM 
 (SELECT adults, ages, max(age) maxage, min(age) minage FROM
  (SELECT adults, ages, age FROM 
   (SELECT adults, ages, REGEXP_EXTRACT(ages, r'([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="3")),
   (SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="3")),
   (SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="3"))
  ),
  (SELECT adults, ages, age FROM 
   (SELECT adults, ages, REGEXP_EXTRACT(ages, r'([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="4")),
   (SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="4")),
   (SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="4")),
   (SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,\d\d\,\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="4"))
  )

)