Question

I have a 130k line text file with patent information and I just want to keep the dates (regex "[0-9]{4}-[0-9]{2}-[0-9]{2} ") for subsequent work in Excel. For this purpose I need to keep the line structure intact (also blank lines). My main problem is that I can't seem to find a way to identify and keep multiple occurrences of date information in the same line while deleting all other information.

Original file structure:

US20110228428A1 | US |   | 7 | 2010-03-19 | SEAGATE TECHNOLOGY LLC
US20120026629A1 | US |   | 7 | 2010-07-28 | TDK CORP | US20120127612A1 | US |   | EXAMINER | 2010-11-24 |   | US20120147501A1 | US |   | 2 | 2010-12-09 | SAE MAGNETICS HK LTD,HEADWAY TECHNOLOGIES INC

Desired file structure:

2010-03-19 
2010-07-28 2010-11-24 2010-12-09 

Thank you for your help!

Was it helpful?

Solution

Search for

.*?(?:([0-9]{4}-[0-9]{2}-[0-9]{2})|$)

And replace with

" $1"

Don't put the quotes, just to show there is a space before the $1. This will also put a space before the first match in a row.

This regex will match as less as possible .*? before it finds either the Date or the end of the row (the $). If a date is found it is stored in $1 because of the brackets around. So as replacement just put a space to separate the found dates and then the found date from $1.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top