BigQuery Query Syntax getting path of URL
-
20-12-2019 - |
Question
I have a row with URL column.
I like to break the URL into domain and path. I can do domain by using Domain(URL) in BigQuery syntax.
My question is How do I get the path of the URL ?
e.g. http://www.somedomain.com/X/Y/abc
I want to get X, Y and abc as separate columsn.
Solution
you can use REGEXP to extract what you need
SELECT Regexp_extract(URL,r'^http://www(?:[^/]*)/(.*)') as full_path,
Regexp_extract(URL,r'^http://www(?:[^/]*)/(?:[^/]*/){0}([^/]*)') as full_path0,
Regexp_extract(URL,r'^http://www(?:[^/]*)/(?:[^/]*/){1}([^/]*)') as full_path1,
Regexp_extract(URL,r'^http://www(?:[^/]*)/(?:[^/]*/){2}([^/]*)') as full_path2,
Regexp_extract(URL,r'^http://www(?:[^/]*)/(?:[^/]*/){3}([^/]*)') as full_path3,
FROM
(Select 'http://www.somedomain.com/X/Y/abc' as URL)
And regarding comparison with MS log parser.
- Log Parser runs straight on the logs flat files while in BQ you need to load it 1st.
- Log parser runs on a dedicated machine while BQ runs as a cloud service (many machine, you don't care how many...)
- You'll find that performance wise BQ does things faster and with no concern of yours in regard to the resources available for processing. (Log parses can run multi-threads only as number of available CPU Units, and consumes a lot of cache of the machine it runs on )
- the regex functions in BQ gives you all the flexibility in extracting any pattern of data from the logs.
Enjoy
OTHER TIPS
ga_sessions has hits leaf tables that breaks up your URL automatically
With your example of
http://www.somedomain.com/X/Y/abc
hits.page.pagePathLevel1 will have 'www.somedomian.com/'
hits.page.pagePathLevel2 will have '/X/'
hits.page.pagePathLevel3 will have '/Y/'
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow