Pregunta

I am trying to analyse some data, but my data contains letters which require standardising. What I would like to be able to do is, for every datatable in the data (this csv data contains 3 datatables) replace the letter T or any other letter for that matter with the next highest integer for that table. The first table contains no errors, the second table contains 1 T and the third contains 2 x t's.

DatatableA,1
DatatableA,2
DatatableA,3
DatatableA,4
DatatableA,5
DatatableB,1
DatatableB,6
DatatableB,T
DatatableB,3
DatatableB,4
DatatableB,5
DatatableB,2
DatatableC,3
DatatableC,4
DatatableC,2
DatatableC,1
DatatableC,Q
DatatableC,5
DatatableC,T

I am expecting this to be a relatively easy thing to code, however whilst I know how to replace all T's with a number, within a particular column or a particular row, I do not know how to replace each T with a different number depending on the Datatable it is in. Essentially I am looking to produce the following from the above:

DatatableA,1
DatatableA,2
DatatableA,3
DatatableA,4
DatatableA,5
DatatableB,1
DatatableB,6
DatatableB,7
DatatableB,3
DatatableB,4
DatatableB,5
DatatableB,2
DatatableC,3
DatatableC,4
DatatableC,2
DatatableC,1
DatatableC,6
DatatableC,5
DatatableC,6

Here nothing happened in DatatableA, DatatableB the only T was replaced with the next highest integer in this case it was replaced with a 7, in DatatableC there was two anomalous data points which were both replaced with the next highest integer, which was a 6.

If anyone can point me in the right direction or provide a snippet of something, It would be greatly appreciated. As always constructive comments are also appreciated.

Edit in reply to elyase

I attempted to run the code:

import pandas as pd

df = pd.read_csv('test.csv', sep=',', header=None, names=['datatable', 'col'])

def replace_letter(group):
    letters = group.isin(['T', 'Q'])              # select letters
    group[letters] = int(group[~letters].max()) + 1  # replace by next max
    return group


df['col'] = df.groupby('datatable').transform(replace_letter)
print df

and i received the traceback:

Traceback (most recent call last):
  File "C:/test.py", line 11, in <module>
    df['col'] = df.groupby('datatable').transform(replace_letter)
  File "C:\Python27\lib\site-packages\pandas\core\groupby.py", line 1981, in transform
    res = path(group)
  File "C:\Python27\lib\site-packages\pandas\core\groupby.py", line 2006, in <lambda>
    slow_path = lambda group: group.apply(lambda x: func(x, *args, **kwargs), axis=self.axis)
  File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 4416, in apply
    return self._apply_standard(f, axis)
  File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 4491, in _apply_standard
    raise e
ValueError: ("invalid literal for int() with base 10: 'col'", u'occurred at index col')

Is there something I have used in correctly, I could use AEAs answer, but I have been meaning to use pandas more, as the library seems so useful for data manipulations.

¿Fue útil?

Solución

Pandas is ideal for this kind of tasks:

Read your csv:

>>> import pandas as pd

>>> df = pd.read_csv('data.csv', sep=',', header=None, names=['datatable', 'col'])
>>> df.head()
     datatable  col
0    DatatableA  1
1    DatatableA  2
2    DatatableA  3
3    DatatableA  4
4    DatatableA  5

Group, select and replace max:

def replace_letter(group):
    letters = group.isin(['T', 'Q'])              # select letters
    group[letters] = int(group[~letters].max()) + 1  # replace by next max
    return group


>>> df['col'] = df.groupby('datatable').transform(replace_letter)
>>> df

     datatable   col
0    DatatableA  1
1    DatatableA  2
2    DatatableA  3
3    DatatableA  4
4    DatatableA  5
5    DatatableB  1
6    DatatableB  6
7    DatatableB  7
8    DatatableB  3
9    DatatableB  4
10   DatatableB  5
11   DatatableB  2
12   DatatableC  3
13   DatatableC  4
14   DatatableC  2
15   DatatableC  1
16   DatatableC  6
17   DatatableC  5
18   DatatableC  6

Write to csv:

df.to_csv('result.csv', index=None, header=None)

Otros consejos

I suppose I have to answer the question asked my by own alter-ego. Seriously, does StackExchange not sanitize usernames?

Here's a solution, not guaranteeing that it's efficient or simple, but the logic is pretty simple. First you iterate your dataset and check for anything that's not an integer string and record the largest value. Then you iterate again and replace non-integer strings.

I am using StringIO as a replacement for a file just for convenience sake.

import csv
import string
from StringIO import StringIO


raw = """DatatableA,1
DatatableA,2
DatatableA,3
DatatableA,4
DatatableA,5
DatatableB,1
DatatableB,6
DatatableB,T
DatatableB,3
DatatableB,4
DatatableB,5
DatatableB,2
DatatableC,3
DatatableC,4
DatatableC,2
DatatableC,1
DatatableC,Q
DatatableC,5
DatatableC,T"""

fp = StringIO()
fp.write(raw)
fp.seek(0)

reader = csv.reader(fp)

data = []
mapping = {}
for row in reader:
    if row[0] not in mapping:
        mapping[row[0]] = float("-inf")
    if row[1] in string.digits:
        x = int(row[1])
        if x > mapping[row[0]]:
            mapping[row[0]] = x
    data.append(row)

for i, row in enumerate(data):
    if row[1] not in string.digits:
        mapping[row[0]] += 1
        row[1] = str(mapping[row[0]])

fp.close()
fp = StringIO()
writer = csv.writer(fp)
writer.writerows(data)

print fp.getvalue()
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top