There are lots of ways to do this. Here is one way.
If the data is in a file called data
:
import re
def open_chunk(readfunc, delimiter, chunksize=1024):
"""
http://stackoverflow.com/a/17508761/190597
readfunc(chunksize) should return a string.
"""
remainder = ''
for chunk in iter(lambda: readfunc(chunksize), ''):
pieces = re.split(delimiter, remainder + chunk)
for piece in pieces[:-1]:
yield piece
remainder = pieces[-1]
if remainder:
yield remainder
with open('data', 'r') as infile:
chunks = open_chunk(infile.read, delimiter=r'(PMID.*)')
for i, (chunk, delim) in enumerate(zip(*[chunks]*2)):
chunk = chunk+delim
chunk = chunk.strip()
if chunk:
print(chunk)
print('-'*80)
# uncomment this if you want to save the chunk to a file named dataXXX
# with open('data{:03d}'.format(i), 'w') as outfile:
# outfile.write(chunk)
prints
1. Ann Intern Med. 2013 Dec 3;159(11):721-8. doi:10.7326/0003-4819-159-11-201312030-00004.
text text text texttext texttext texttext texttext texttext texttext texttext text
text texttext texttext texttext texttext texttext text
text texttext texttext texttext texttext text
PMID: 24297188 [PubMed - indexed for MEDLINE]
--------------------------------------------------------------------------------
2. Am J Cardiol. 2013 Sep 1;112(5):688-93. doi: 10.1016/j.amjcard.2013.04.048. Epub
2013 May 24.
text texttext texttext texttext texttext texttext texttext texttext texttext text
text texttext texttext texttext texttext texttext texttext texttext texttext text
PMID: 23711805 [PubMed - indexed for MEDLINE]
--------------------------------------------------------------------------------
3. Am J Cardiol. 2013 Aug 15;112(4):513-9. doi: 10.1016/j.amjcard.2013.04.015. Epub
2013 May 11.
text texttext texttext texttext texttext texttext texttext texttext texttext text
text texttext texttext texttext texttext texttext texttext texttext texttext text
PMID: 23672989 [PubMed - indexed for MEDLINE]
--------------------------------------------------------------------------------
Uncomment the last two lines to save the chunks to separate files.
Why so complicated?
For short files, you could simply read the entire file into a string and split the string using a regex. The solution above is an adaptation of that idea which can handle large files. It reads the files in chunks, finds where to split the chunks, and returns pieces as it finds them.
This problem of processing files in chunks separated by a delimiter regex pattern comes up often. So instead of writing a bespoke solution for each, it is easier to use a utility function like open_chunk
which can handle all such problems, no matter what the delimiter, and in a manner than can handle large files as well as small.