I'm working on a project to learn cascading, and I'm stumped on this problem. Cascading doesn't seem to have anything to read the first line of each individual file in a directory, which I need to do in order to discover the content type from text analysis. I've looked through the cascading API documentation for several hours, and nothing has jumped out to me as useful, and Google doesn't produce a similar question on another forum.

I'd like to avoid running the jar for each file individually. So, instead of:

hadoop jar myApp.jar this.package.myApp inputPath/file.txt outputPath/

I'd like this:

hadoop jar myApp.jar this.package.myApp inputPath/ outputPath/
有帮助吗?

解决方案

I'll answer my own question here. It turns out that using the cascading libraries to read the first line from a single file at a time in a directory is not its best use. I ended up switching to org.apache.hadoop and wrote the following code (this is all inside my main method):

String inputPath = args[0];
Path inputDir = new Path(inputPath);
FileSystem lfs = FileSystem.get(new Congifuration());
FileStatus[] files = lfs.listStatus(inputDir);
for(int x=0; x < files.length; x++){
    BufferedReader br = new BufferedReader(new InputStreamReader(lfs.open(files[x].getPath()));
    String line = br.readLine();
}
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top