Question

I'm working on a project to learn cascading, and I'm stumped on this problem. Cascading doesn't seem to have anything to read the first line of each individual file in a directory, which I need to do in order to discover the content type from text analysis. I've looked through the cascading API documentation for several hours, and nothing has jumped out to me as useful, and Google doesn't produce a similar question on another forum.

I'd like to avoid running the jar for each file individually. So, instead of:

hadoop jar myApp.jar this.package.myApp inputPath/file.txt outputPath/

I'd like this:

hadoop jar myApp.jar this.package.myApp inputPath/ outputPath/
Was it helpful?

Solution

I'll answer my own question here. It turns out that using the cascading libraries to read the first line from a single file at a time in a directory is not its best use. I ended up switching to org.apache.hadoop and wrote the following code (this is all inside my main method):

String inputPath = args[0];
Path inputDir = new Path(inputPath);
FileSystem lfs = FileSystem.get(new Congifuration());
FileStatus[] files = lfs.listStatus(inputDir);
for(int x=0; x < files.length; x++){
    BufferedReader br = new BufferedReader(new InputStreamReader(lfs.open(files[x].getPath()));
    String line = br.readLine();
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top