Scalding: parsing comma-separated data with header

https://stackoverflow.com//questions/25000142

20-12-2019
|

Question

I have data in format:

"header1","header2","header3",...
"value11","value12","value13",...
"value21","value22","value23",...
....

What is the best way to parse it in Scalding? I have over 50 columns altogether, but I am only interested in some of them. I tried importing it with Csv("file"), but that doesn't work.

The only solution that comes to mind is to parse it manually with TextLine and disregard the line with offset == 0. But I'm sure there must be a better solution.

La solution

In the end I solved it by parsing each line manually as follows:

def tipPipe = TextLine("tip").read.mapTo('line ->('field1, 'field5)) {
line: String => val arr = line.split("\",\"")
  (arr(0).replace("\"", ""), if (arr.size >= 88) arr(4) else "unknown")
}

Autres conseils

It looks like you have 88 fields (well over 22 fields) in your data-set and not just 1. Have a read of:

https://github.com/twitter/scalding/wiki/Frequently-asked-questions#what-if-i-have-more-than-22-fields-in-my-data-set

See text from above link here:

What if I have more than 22 fields in my data-set?

Many of the examples (e.g. in the tutorial/ directory) show that the fields argument is specified as a Scala Tuple when reading a delimited file. However Scala Tuples are currently limited to a maximum of 22 elements. To read-in a data-set with more than 22 fields, you can use a List of Symbols as fields specifier. E.g.

 val mySchema = List('first, 'last, 'phone, 'age, 'country)
 val input = Csv("/path/to/file.txt", separator = ",", 
 fields = mySchema) val output = TextLine("/path/to/out.txt") input.read
      .project('age, 'country)
      .write(Tsv(output))

Another way to specify fields is using Scala Enumerations, which is available in the develop branch (as of Apr 2, 2013), as demonstrated in Tutorial 6:

object Schema extends Enumeration {
   val first, last, phone, age,country = Value // arbitrary number of fields 
}

import Schema._

Csv("tutorial/data/phones.txt", separator = " ", fields = Schema)  
.read.project(first,age).write(Tsv("tutorial/data/output6.tsv"))

So while reading your file supply a schema with all 88 fields using either List or Enumeration (see in above link/quote)

For skipping the header, you can additionally supply skipHeader = true in the Csv constructor.

Csv("tutorial/data/phones.txt", fields = Schema, skipHeader = true)

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow