Question

I have a problem using python. I have a txt file, and it contains 500 abstracts from 500 papers, and what i want to do is to split this txt file into 500 file, with each txt file contains only 1 abstract. for now, I found out, for each abstract, there is one line at the end, starting with "PMID", so I'm thinking split the file by this line. but I'm really new to python. any idea? thanks in advance.

The txt file looks like this:

1. Ann Intern Med. 2013 Dec 3;159(11):721-8. doi:10.7326/0003-4819-159-11-201312030-00004.  
text text text texttext texttext texttext texttext texttext texttext texttext text
text texttext texttext texttext texttext texttext text
text texttext texttext texttext texttext text
PMID: 24297188  [PubMed - indexed for MEDLINE]

2. Am J Cardiol. 2013 Sep 1;112(5):688-93. doi: 10.1016/j.amjcard.2013.04.048. Epub 
2013 May 24.
text texttext texttext texttext texttext texttext texttext texttext texttext text
text texttext texttext texttext texttext texttext texttext texttext texttext text
PMID: 23711805  [PubMed - indexed for MEDLINE]

3. Am J Cardiol. 2013 Aug 15;112(4):513-9. doi: 10.1016/j.amjcard.2013.04.015. Epub 
2013 May 11.
text texttext texttext texttext texttext texttext texttext texttext texttext text
text texttext texttext texttext texttext texttext texttext texttext texttext text
PMID: 23672989  [PubMed - indexed for MEDLINE]

and so on.

Was it helpful?

Solution

There are lots of ways to do this. Here is one way. If the data is in a file called data:

import re

def open_chunk(readfunc, delimiter, chunksize=1024):
    """
    http://stackoverflow.com/a/17508761/190597
    readfunc(chunksize) should return a string.
    """
    remainder = ''
    for chunk in iter(lambda: readfunc(chunksize), ''):
        pieces = re.split(delimiter, remainder + chunk)
        for piece in pieces[:-1]:
            yield piece
        remainder = pieces[-1]
    if remainder:
        yield remainder

with open('data', 'r') as infile:
    chunks = open_chunk(infile.read, delimiter=r'(PMID.*)')
    for i, (chunk, delim) in enumerate(zip(*[chunks]*2)):
        chunk = chunk+delim
        chunk = chunk.strip()
        if chunk:
            print(chunk)
            print('-'*80)
            # uncomment this if you want to save the chunk to a file named dataXXX
            # with open('data{:03d}'.format(i), 'w') as outfile:
            #     outfile.write(chunk)

prints

1. Ann Intern Med. 2013 Dec 3;159(11):721-8. doi:10.7326/0003-4819-159-11-201312030-00004.  
text text text texttext texttext texttext texttext texttext texttext texttext text
text texttext texttext texttext texttext texttext text
text texttext texttext texttext texttext text
PMID: 24297188  [PubMed - indexed for MEDLINE]
--------------------------------------------------------------------------------
2. Am J Cardiol. 2013 Sep 1;112(5):688-93. doi: 10.1016/j.amjcard.2013.04.048. Epub 
2013 May 24.
text texttext texttext texttext texttext texttext texttext texttext texttext text
text texttext texttext texttext texttext texttext texttext texttext texttext text
PMID: 23711805  [PubMed - indexed for MEDLINE]
--------------------------------------------------------------------------------
3. Am J Cardiol. 2013 Aug 15;112(4):513-9. doi: 10.1016/j.amjcard.2013.04.015. Epub 
2013 May 11.
text texttext texttext texttext texttext texttext texttext texttext texttext text
text texttext texttext texttext texttext texttext texttext texttext texttext text
PMID: 23672989  [PubMed - indexed for MEDLINE]
--------------------------------------------------------------------------------

Uncomment the last two lines to save the chunks to separate files.


Why so complicated?

For short files, you could simply read the entire file into a string and split the string using a regex. The solution above is an adaptation of that idea which can handle large files. It reads the files in chunks, finds where to split the chunks, and returns pieces as it finds them.

This problem of processing files in chunks separated by a delimiter regex pattern comes up often. So instead of writing a bespoke solution for each, it is easier to use a utility function like open_chunk which can handle all such problems, no matter what the delimiter, and in a manner than can handle large files as well as small.

OTHER TIPS

You could try:

with open("txtfile.txt", "r") as f:  # read file
    ss = f.read(-1)

bb = ss.split("\nPMID:")  # split in blocks

# Reinsert the `PMID;`, if nedded:
bb1 = bb[:1] + [ "PMID:" + b  for b in bb]

Note that the final newline in each block is removed. The blocks can be written in to separate files.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top