Question

I have a CSV file which is formatted somewhat like this:

name    subname value1  value2
a       a       1       21  
a       a       2       22  
a       a       3       23  
a       a       4       24  

b       a       5       25  
b       a       6       26  
b       a       7       27  
b       a       8       28  

c       c       9       29  
c       c       10      30  
c       c       11      31  
c       c       12      32
....
etc

Using a simple CSV to json script I have managed to output each row as a valid json entry, however this is very redundant since there are so many repeated values.

I am trying to read this file and output it to a form that looks just like this:

[
{
   "name":"a", 
   "subname":"a", 
   "data": {
      "attr1":{"name":"value1", "values":[1,2,3,4]},
      "attr2":{"name":"value2", "values":[21,22,23,24]}
   }
},
{
   "name":"b", 
   "subname":"a", 
   "data": {
      "attr1":{"name":"value1", "values":[5,6,7,8]},
      "attr2":{"name":"value2", "values":[25,26,27,28]}
   }
},
{
   "name":"c", 
   "subname":"c", 
   "data": {
      "attr1":{"name":"value1", "values":[9,10,11,12]},
      "attr2":{"name":"value2", "values":[29,30,31,32]}
   }
},
....
etc
]

I know that the script should work something like this:

loop until no more rows:
skip row 1
for the next 4 rows
    {
        "name":row 1, column 1 , 
        "subname":row 1, column 2 , 
        "data": {
            "attr1":{"name":"value1", "values":[row 1 to 4, column 3]}
            "attr2":{"name":"value2", "values":[row 1 to 4, column 4]}
        }
    }

With this particular dataset there will always be this pattern (however, the actual data is has many more entries and columns). I know what I would like for output, but I am not exactly sure how to implement it.

How would I do this with python? Any suggestions and solutions are greatly appreciated.

edit: Here is the solution in straight javascript using underscore.js

var headers = this.get('headers')
var grid = this.get('grid')
var transposed = grid.transpose()
var tables = [];
var grid = 
var rows = []
keys = ["name", "subname"]

var numberOfEntries = grid.length - 2;
_(numberOfEntries).times(function(n) {keys.push("attr" + (n+1) ) } )

_.each(transposed, function(row) {
  rows.push(_.object(keys, row))
})

var names = _.uniq(grid[0])

_.each(names, function(name) {
  var entries = _.where(rows, {name: name})

  _.each(entries, function(entry) {
    var exists = _.where(tables, {name: entry.name, subname: entry.subname})
    var obj = {};
    if(exists.length > 0) {
      obj = exists[0]
    } 
    else {
      obj = {name: entry.name, subname: entry.subname, data: {}}
      tables.push(obj)        
    }

    _(numberOfEntries).times(function(n) {
      var i = n + 1;
      if( !obj.data["attr" + i] ) {
        obj.data["attr" + i ] = {"name":headers[n+2], "values": []};
      } else {
        obj.data["attr" + i].values.push(entry["attr" + i])
      }
    })
  })
})
Was it helpful?

Solution

I would iterate over each row of the CSV and use a dictionary that has already passed rows (I am assuming a combination of name/subname)

data = {}
for row in words:
    if not row["name"] + "-" + row["subname"] in data:
        data[row["name"] + "-" + row["subname"]] = {
            "name": row["name"],
            "subname": row["subname"],
            "data": {
                "attr1": {"name":"value1", "values": []},
                "attr2": {"name":"value2", "values": []}
            }
        }
    data[row["name"] + "-" + row["subname"]]["data"]["attr1"]["values"].append(row["value1"])
    data[row["name"] + "-" + row["subname"]]["data"]["attr2"]["values"].append(row["value2"])

OTHER TIPS

My approach which I find very readable would be as follows:

import csv,pprint
from itertools import groupby

with open('tsv.csv') as f:
    values = []
    reader = csv.DictReader(f)
    for group in ( list(g) for k,g in groupby(reader,lambda r: r["name"])): #group by the name column of each row
        #group looks like [ {'subname': 'a', 'value2': '25', 'value1': '5', 'name': 'b'},...]
        groupRep = {"name":group[0]["name"], #get the name from the first group
                    "subname":group[0]["subname"], #get the subname from the first group
                    "data":{
                        "attr1":{"name":"value1","values":[]}, #make place to store value1s
                        "attr2":{"name":"value2","values":[]} #make place to store value2s
                        }
                    }
        for row in group:
        #each row is a dictionary like {'subname': 'a', 'value2': '25', 'value1': '5', 'name': 'b'}
            groupRep["data"]["attr1"]["values"].append(row["value1"])
            groupRep["data"]["attr2"]["values"].append(row["value2"])
        #store the representation of the group in values
        values.append(groupRep)

Pretty Printing:

PP = pprint.PrettyPrinter()       
PP.pprint(values)

Gets:

[{'data': {'attr1': {'name': 'value1', 'values': ['1', '2', '3', '4']},
           'attr2': {'name': 'value2', 'values': ['21', '22', '23', '24']}},
  'name': 'a',
  'subname': 'a'},
 {'data': {'attr1': {'name': 'value1', 'values': ['5', '6', '7', '8']},
           'attr2': {'name': 'value2', 'values': ['25', '26', '27', '2']}},
  'name': 'b',
  'subname': 'a'},
 {'data': {'attr1': {'name': 'value1', 'values': ['9', '1', '1', '1']},
           'attr2': {'name': 'value2', 'values': ['29', '30', '31', '32']}},
  'name': 'c',
  'subname': 'c'}]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top