Question

all. I have a csv file where I've arranged DNA sample IDs that I sent out to be sequenced in a 96-well plate. This is important to keep track of because when we get the plate back from the sequencing facility the chromatogram files are titled simply, e.g. 5-3-13-G-Templates_A01_Primer-G.ab1.

The csv is tab-delimited and looks like this: (96 wells, 12 columns [1-12], 8 rows[A-H]):

1   2   3   4   5   6   7   8   9   10  11  12
A01 A02 A03 A04 A05_Grammatophyllum_scriptum_ITS1   A06_Eulophia_euglossa_ITS1  A07_Grammatophyllum_scriptum_17SE   A08_Graphorkis_lurida_X502F A09_Cymbidium_kanran_X502F  A10_Claderia_viridiflora_X502F  A11_Grammatophyllum_scriptum_X502F  A12_Eulophia_euglossa_X502F
B01 B02 B03 B04 B05_Grammatophyllum_scriptum_ITS4   B06_Eulophia_euglossa_ITS4  B07_Grammatophyllum_scriptum_1229R  B08_Graphorkis_lurida_X1599R    B09_Cymbidium_kanran_X1599R B10_Claderia_viridiflora_X1599R B11_Grammatophyllum_scriptum_X1599R B12_Eulophia_euglossa_X1599R
C01 C02 C03 C04 C05_Acriopsis_ridleyi_ITS1  C06_Cyrtopodium_polyphyllum_ITS1    C07_Cyrtopodium_polyphyllum_17SE    C08_Graphorkis_scripta_X502F    C09_Dipodium_conduplicatum_X502F    C10_Dipodium_5431_X502F C11_Cyrtopodium_polyphyllum_X502F   C12_Oeceoclades_gracillima_X502F
D01 D02 D03 D04 D05_Acriopsis_ridleyi_641R  D06_Cyrtopodium_polyphyllum_ITS4    D07_Cyrtopodium_polyphyllum_1229R   D08_Graphorkis_scripta_X1599R   D09_Dipodium_conduplicatum_X1599R   D10_Dipodium_5431_X1599R    D11_Cyrtopodium_polyphyllum_X1599R  D12_Oeceoclades_gracillima_X1599R
E01 E02 E03 E04_Dipodium_6052_ITS1  E05_Dipodium_5431_ITS1  E06_Bromheadia_finlaysoniana_ITS1   E07_Dressleria_dilecta_X502F    E08_Cyrtopodium_falciobum_X502F E09_Acriopsis_ridleyi_X502F E10_Dipodium_6052_X502F E11_Thecostele_alata_28_X502F   E12_Thecostele_alata_32_X502F
F01 F02 F03 F04_Dipodium_6052_ITS4  F05_Dipodium_5431_ITS4  F06_Bromheadia_finlaysoniana_641R   F07_Dressleria_dilecta_X1599R   F08_Cyrtopodium_falciobum_X1599R    F09_Acriopsis_ridleyi_X1599R    F10_Dipodium_6052_X1599R    F11_Thecostele_alata_28_X1599R  F12_Thecostele_alata_32_X1599R
G01 G02 G03 G04_Dipodium_6055_ITS1  G05_Dipodium_conduplicatum_ITS1 G06_Claderia_viridiflora_ITS1   G07_Ansellia_africana_X502F G08_Grammangis_ellisii_X502F    G09_Bromheadia_finlaysoniana_X502F  G10_Dipodium_6055_X502F G11_Grammatophyllum_stapeliiflorum_X502F    G12
H01 H02 H03 H04_Dipodium_6055_ITS4  H05_Dipodium_conduplicatum_ITS4 H06_Claderia_viridiflora_641R   H07_Ansellia_africana_X1599R    H08_Grammangis_ellisii_X1599R   H09_Bromheadia_finlaysoniana_X1599R H10_Dipodium_6055_X1599R    H11_Grammatophyllum_stapeliiflorum_X1599R   H12

Instead of taking the time to rename 96 files every time I get a plate back by hand, I'm trying to take this file that I've already prepared in advance to guide me in loading the plate so I don't screw it up (wrong DNA in the wrong well), identify the position by the prefix (e.g. A06... H06), match it to the file names in a directory since they share the same cell location such that the script will iterate over the entire csv file and rename all of the files in the form: 5-3-13-G-Templates_A06_Primer-G.ab1 will become A06_Eulophia_euglossa_ITS1.ab1

I've written part of the Python script but I'm having difficulty envisioning the next step:

import csv
data = csv.DictReader(open('Template.csv', 'rU'), delimiter='\t')
for row in data:
    values = row.values()
    values.sort()
    #Provides values by row in order from left to right

This is where I'm stuck. What do I do next now that I have these lists? For loops? I'm just having problems envisioning the solution.

I suppose a part of the solution would be a bit of the following code, modified from another answer I found:

folder = r"/home/ryan/Desktop/MMEE/plateG" #Make sure only the .ab1 files are in this directory
import os
for root, dirs, filenames in os.walk(folder):
    for filename in filenames:
        fullpath = os.path.join(root, filename)
        filename_split = os.path.splitext(fullpath)
        filename_zero, fileext = filename_split
        os.rename(fullpath, SOMEVARIABLE + fileext)

The part above where I rename the file with os.rename and use "SOMEVARIABLE" where I think the name from the list above should be fed in to the file name. But how to get it there is beyond my skill level at the moment. Or maybe I'm just tired.

Any help would be appreciated. I hope this is sufficiently clear but I can provide clarification if necessary. Cheers!

Edited to add: The old filename and new filename only share the location ID, e.g. A01, B06, H12. The new filenames will be taken from the csv file so that a file named 5-3-13-G-Templates_F08_Primer-G.ab1 will pull the name from column 8, but only the one with "F08" in the title. The rows are A through H. Essentially I want to pick out the text from the location RowF,Column8 (though I don't have row headings at the moment) and apply that text to the filename with F08 in it. I thought there might be a way to match each substring A01 through H12 from the generated values list and pull the text from each one into the old filename that they're replacing since they also match with the same substrings A01 through H12.

I want the files renamed this way: (NB - A01 to D04 were blank wells so they have no other label than the ID)

5-3-13-G-Templates_E04_Primer-G.ab1 > E04_Dipodium_6052_ITS1.ab1
5-3-13-G-Templates_F04_Primer-G.ab1 > F04_Dipodium_6052_ITS4.ab1
5-3-13-G-Templates_G04_Primer-G.ab1 > G04_Dipodium_6055_ITS1.ab1
5-3-13-G-Templates_H04_Primer-G.ab1 > H04_Dipodium_6055_ITS4.ab1
5-3-13-G-Templates_A05_Primer-G.ab1 > A05_Grammatophyllum_scriptum_ITS1.ab1
5-3-13-G-Templates_B05_Primer-G.ab1 > B05_Grammatophyllum_scriptum_ITS4.ab1
...
Was it helpful?

Solution

  1. Process the CSV file, collect all new filenames and make a map from sample ID to the new name.

  2. Walk through the directory, find all the files, extract the sample IDs from their basename, and look up new names from the id_map created at 1st step. Rename according to the new names.

import csv
import os
import re

# First
data = csv.DictReader(open('csv.csv', 'r'), delimiter = "\t")
id_map = {}
for row in data:
    for name in row.values():
        # find all sample IDs as a list in the cell, should only get 1 ID
        ids = re.findall(r'[A-H][0-9]{2}', name)
        if len(ids) != 1:
            print "Confused at " + name
        id_map[ids[0]] = name

# Second
folder = 'files/'
for root, dirs, files in os.walk(folder):
    for filename in files:
        fullname = os.path.join(root, filename)
        basename, extension = os.path.splitext(filename)
        # find all sample IDs in the basename, should only get 1 ID
        ids = re.findall(r'[A-H][0-9]{2}', basename)
        if len(ids) != 1:
            print "Confused at " + os.path.join(root, filename)
        if ids[0] in id_map:
            new_name = id_map[ids[0]] + extension
            os.rename(fullname, os.path.join(root, new_name))
        else:
            print "New name for " + fullname + " not found"
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top