Question

I have researched this simple problem extensively but can't find an answer. I am trying to merge two files using pandas' pd.merge based on a common column named "JN". I believe it is treating my 'joined' (os.path.join) filename as a string instead of a dataframe/csv file. After I call the pd.merge function the error says "string indices must be integers, not str".

import pandas as pd
import os

path = r"C:/Users/St/Documents/House/m2"

dirs = os.listdir(path)

for file in dirs:
    if file.endswith("J.csv"):
        J = file
        if len(J) is 12: #some filenames are 12 chars others 11
            jroot = J[:7]
        else:
            jroot = J[:6]

for file in dirs:
    if file.endswith("2.csv"):
        W = file
        if len(W) is 12:
            root2 = W[:7]
        else:
            root2 = W[:6]

JJ = os.path.join(path, J)
WW = os.path.join(path, W)

if jroot == root2:          # if the first 7 (or 6) characters match, then merge
    JW = pd.merge(JJ, WW, on="JN")

In associated with the above pd.merge function call, I am getting this error:

TypeError: string indices must be integers, not str

I am wondering how to make it read my filename string as an actual file or dataframe. JJ and WW are the equivalent to full paths when printed out. I tried make these 'filenames' dataFrames using pd.DataFrame but wasn't able to do so.

Was it helpful?

Solution

You cannot merge two strings. I think you're confused about what os.path.join returns. It returns a string. You have to actually read in the DataFrames from the files named JJ and WW, then perform the merge.

Here's a full example of writing 2 DataFrames, reading them back with read_csv and then merging them on a column group:

In [49]: df1 = DataFrame(randn(10, 1), columns=['a'])

In [50]: df1['group'] = np.random.choice(['b', 'c'], size=len(df1))

In [51]: df2 = DataFrame(randn(10, 1), columns=['b'])

In [52]: df2['group'] = np.random.choice(['b', 'c'], size=len(df1))

In [53]: df1.to_csv('df1.csv', index=False)

In [54]: cat df1.csv
a,group
-1.590035935931282,b
0.5496398501891229,c
-0.6484689548035797,b
0.19162302248253205,b
-0.9852064283582675,c
0.5975155551821989,b
0.29443634291217047,b
-0.7929994157215382,b
-1.9546460886048795,b
0.19195457928475546,c

In [55]: df2.to_csv('df2.csv', index=False)

In [56]: cat df2.csv
b,group
-1.2874060006117918,c
1.1037959548210117,b
0.47172389260467507,c
0.12802538607490285,c
-0.8753708425917293,b
-0.09187827793091947,b
1.140204215271196,c
0.4862940170888638,b
-1.1080430563137758,b
-1.3698112665693232,c

In [57]: df1_csv = read_csv('df1.csv', index_col=None)

In [58]: df2_csv = read_csv('df2.csv', index_col=None)

In [59]: df1_csv
Out[59]:
       a group
0 -1.590     b
1  0.550     c
2 -0.648     b
3  0.192     b
4 -0.985     c
5  0.598     b
6  0.294     b
7 -0.793     b
8 -1.955     b
9  0.192     c

In [60]: df2_csv
Out[60]:
       b group
0 -1.287     c
1  1.104     b
2  0.472     c
3  0.128     c
4 -0.875     b
5 -0.092     b
6  1.140     c
7  0.486     b
8 -1.108     b
9 -1.370     c

In [61]: df3 = pd.merge(df1_csv, df2_csv, on='group')

In [62]: df3
Out[62]:
        a group      b
0  -1.590     b  1.104
1  -1.590     b -0.875
2  -1.590     b -0.092
3  -1.590     b  0.486
4  -1.590     b -1.108
5  -0.648     b  1.104
6  -0.648     b -0.875
7  -0.648     b -0.092
8  -0.648     b  0.486
9  -0.648     b -1.108
10  0.192     b  1.104
11  0.192     b -0.875
12  0.192     b -0.092
13  0.192     b  0.486
14  0.192     b -1.108
15  0.598     b  1.104
16  0.598     b -0.875
17  0.598     b -0.092
18  0.598     b  0.486
19  0.598     b -1.108
20  0.294     b  1.104
21  0.294     b -0.875
22  0.294     b -0.092
23  0.294     b  0.486
24  0.294     b -1.108
25 -0.793     b  1.104
26 -0.793     b -0.875
27 -0.793     b -0.092
28 -0.793     b  0.486
29 -0.793     b -1.108
30 -1.955     b  1.104
31 -1.955     b -0.875
32 -1.955     b -0.092
33 -1.955     b  0.486
34 -1.955     b -1.108
35  0.550     c -1.287
36  0.550     c  0.472
37  0.550     c  0.128
38  0.550     c  1.140
39  0.550     c -1.370
40 -0.985     c -1.287
41 -0.985     c  0.472
42 -0.985     c  0.128
43 -0.985     c  1.140
44 -0.985     c -1.370
45  0.192     c -1.287
46  0.192     c  0.472
47  0.192     c  0.128
48  0.192     c  1.140
49  0.192     c -1.370

Couple of other things:

Don't use is to compare objects for equality, use ==. Only in the case of small integers will this work reliably, and even then you shouldn't rely on it because that's an implementation detail of CPython.

Instead of checking the file name with str.endswith, just iterate over what you want by first globbing:

import glob

for f in glob.glob(os.path.join(path, '*J.csv')):
    if len(f) == 12:
        # do all the thingz!
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top