Question

Given the following data how do i pull out the numbers between the physical-blocks angle brackets?

Raw data:

"6917: <physical-blocks> 573653840</physical-blocks>"
"8954: <physical-blocks>573653841</physical-blocks>"
"8991: <physical-blocks>573653842</physical-blocks>"
"9028: <physical-blocks>573653843</physical-blocks>"
"9065: <physical-blocks>573653844</physical-blocks>"
"9102: <physical-blocks>573653845</physical-blocks>"

desired output (array of):

573653840 573653841 573653842 573653843 573653844 573653845 

I simply want to be able to extract data between <physical-blocks> and </physical-blocks>. Note; the full dataset include many strings with angle brackets - i specifically need the data between this specific set of strings.

Was it helpful?

Solution

An awk version

awk '{sub(/[^>]*>/,"");sub(/<.*/,"");$1=$1}1' file
573653840
573653841
573653842
573653843
573653844
573653845

OTHER TIPS

With GNU awk:

gawk 'RT=="</physical-blocks>"' RS='</?physical-blocks>' ORS=' ' file

If you want newline after the output then see use the one below:

$ cat file
"6917: <physical-blocks>573653840</physical-blocks>"
"8954: <physical-blocks>573653841</physical-blocks>"
"8991: <physical-blocks>573653842</physical-blocks>"
"9028: <physical-blocks>573653843</physical-blocks>"
"9065: <physical-blocks>573653844</physical-blocks>"
"9102: <physical-blocks>573653845</physical-blocks>"

$ gawk 'RT=="</physical-blocks>";END{print "\n"}' RS='</?physical-blocks>' ORS=' ' file
573653840 573653841 573653842 573653843 573653844 573653845

you can use simple lookahead and lookbehinds:

(?<=\>)(\s*)(\d*)(?=\<)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top