Question

I have a row with URL column.

I like to break the URL into domain and path. I can do domain by using Domain(URL) in BigQuery syntax.

My question is How do I get the path of the URL ?

e.g. http://www.somedomain.com/X/Y/abc

I want to get X, Y and abc as separate columsn.

Was it helpful?

Solution

you can use REGEXP to extract what you need

SELECT Regexp_extract(URL,r'^http://www(?:[^/]*)/(.*)') as full_path,
 Regexp_extract(URL,r'^http://www(?:[^/]*)/(?:[^/]*/){0}([^/]*)') as full_path0,
 Regexp_extract(URL,r'^http://www(?:[^/]*)/(?:[^/]*/){1}([^/]*)') as full_path1,
 Regexp_extract(URL,r'^http://www(?:[^/]*)/(?:[^/]*/){2}([^/]*)') as full_path2,
 Regexp_extract(URL,r'^http://www(?:[^/]*)/(?:[^/]*/){3}([^/]*)') as full_path3,
FROM 
(Select 'http://www.somedomain.com/X/Y/abc' as URL)

And regarding comparison with MS log parser.

  • Log Parser runs straight on the logs flat files while in BQ you need to load it 1st.
  • Log parser runs on a dedicated machine while BQ runs as a cloud service (many machine, you don't care how many...)
  • You'll find that performance wise BQ does things faster and with no concern of yours in regard to the resources available for processing. (Log parses can run multi-threads only as number of available CPU Units, and consumes a lot of cache of the machine it runs on )
  • the regex functions in BQ gives you all the flexibility in extracting any pattern of data from the logs.

Enjoy

OTHER TIPS

ga_sessions has hits leaf tables that breaks up your URL automatically

With your example of

http://www.somedomain.com/X/Y/abc

hits.page.pagePathLevel1 will have 'www.somedomian.com/'
hits.page.pagePathLevel2 will have '/X/'
hits.page.pagePathLevel3 will have '/Y/'
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top