Question

I have some lines in a CSV file like this:

1000001234,Account Name,0,0,"3,711.32",0,0,"18,629.64","22,340.96",COD,"20,000.00",Some string,Some string 2

If you notice, some numbers are enclosed in " " and has a thousand separator ",". I want to remove the thousand separator and the double quote enclosure. For the qoute enclosure, I'm thinking of using string.replace() but how about the comma inside the quote marks?

What's the best way of doing this in Python?

Was it helpful?

Solution

You could simply parse the CSV, make the necessary changes and then write it again.

(I haven't tested this code but it should be something like this)

import csv
reader = csv.reader(open('IN.csv', 'r'))
writer = csv.writer(open('OUT.csv', 'w')
for row in reader:
 # do stuff to the row here
 # row is just a list of items
 writer.writerow(row)

OTHER TIPS

Here is a bit of regular expression fiddling that will do the trick:

>>> import re
>>> p = re.compile('["]([^"]*)["]')
>>> x = """1000001234,Account Name,0,0,"3,711.32",0,0,"18,629.64","22,340.96",COD,"20,000.00",Some string,Some string 2"""
>>> p.sub(lambda m: m.groups()[0].replace(',',''), x)
'1000001234,Account Name,0,0,3711.32,0,0,18629.64,22340.96,COD,20000.00,Some string,Some string 2'

Removes the commas from the parts of the string that is between pairs of quotes.

If all you want is to remove double quotes and commas from a string, a couple of replaces will do it:

s = s.replace('"','').replace(',','')

A faster way is to use s.translate, but that requires a minimum of preparation:

import string
identity = string.maketrans('', '')

...

s = s.translate(identity, '",')

This removes any occurrence of double quotes or commas, and does it pretty fast too. In general, the .translate method of string objects is the best way to remove certain kinds of characters from a string (as well as possibly performing some character-to-character translation, but, by using a translate table such as the identity one I show here, the translation part may in fact be easily bypassed). Note that .translate works a bit differently for Unicode objects (and therefore for Python 3 strings, too) -- I'm giving the approach that's suitable for plain Python 2 string objects.

Here is something I just tested, you may not need pprint, I just want to use for clear output.

test.csv

1000001234,Account Name,0,0,"3,711.32",0,0,"18,629.64","22,340.96",COD,"20,000.00",Some string,Some string 2
1000001234,Account Name,0,0,"3,711.32",0,0,"18,629.64","22,340.96",COD,"20,000.00",Some string,Some string 2

Code, use csv reader, and pass each item to parseNum function to check valid digit or not.

from pprint import pprint
import csv

def parseNum(x):
    xx=x.replace(",","")
    if not xx.replace(".","").isdigit(): return x
    return "." in xx and float(xx) or int(xx)

x=[map(parseNum,line) for line in csv.reader(open("test.csv"))]

pprint(x)

Output

[[1000001234,
  'Account Name',
  0,
  0,
  3711.3200000000002,
  0,
  0,
  18629.639999999999,
  22340.959999999999,
  'COD',
  20000.0,
  'Some string',
  'Some string 2'],
 [1000001234,
  'Account Name',
  0,
  0,
  3711.3200000000002,
  0,
  0,
  18629.639999999999,
  22340.959999999999,
  'COD',
  20000.0,
  'Some string',
  'Some string 2']]

Note: If you need good precision on float numbers, replace float with Decimal

Use the csv module. It has all sorts of constants and parameters to help you set the delimiters, quotes, and everything else for the type of file you are working with. It even has a Sniffer that can help you identify the csv format of the file. In fact this is the only module I have found that can properly and easily work with csv files.

http://docs.python.org/library/csv.html

You should absolutely use the csv module. If you use a csv.reader, you only have one very small problem: testing fields to see if they're numbers, and stripping commas if they are. I've packaged it as a generator:

import csv

def read_and_fix_numbers(f):
    """Iterate over a file object that returns CSV data, stripping commas out of numbers."""
    for row in csv.reader(f):
        for field in row:
            try:
                x = float(field)
                field.replace(",", "")
            except ValueError:
                pass
            fixed.append(field)
        yield fixed

Usage:

>>> data = '1000001234,Account Name,0,0,"3,711.32",0,0,"18,629.64","22,340.96",COD,"20,000.00",Some string,Some string 2'
>>> import StringIO
>>> f = StringIO.StringIO(data)
>>> for row in read_and_fix_numbers(f):
        print row
['1000001234', 'Account Name', '0', '0', '3711.32', '0', '0', '18629.64', '22340.96', 'COD', '20000.00', 'Some string', 'Some string 2']
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top