문제

I tried to reproduce simple example of using segue from https://jeffreybreen.wordpress.com/2011/01/10/segue-r-to-amazon-elastic-mapreduce-hadoop/

Cluster creation was successful

> cl <- createCluster(numInstances=2)
STARTING - 2012-05-27 14:02:08
STARTING - 2012-05-27 14:02:39
STARTING - 2012-05-27 14:03:10
STARTING - 2012-05-27 14:03:42
STARTING - 2012-05-27 14:04:13
STARTING - 2012-05-27 14:04:44
STARTING - 2012-05-27 14:05:15
STARTING - 2012-05-27 14:05:46
STARTING - 2012-05-27 14:06:17
BOOTSTRAPPING - 2012-05-27 14:06:48
BOOTSTRAPPING - 2012-05-27 14:07:19
BOOTSTRAPPING - 2012-05-27 14:07:50
BOOTSTRAPPING - 2012-05-27 14:08:21
BOOTSTRAPPING - 2012-05-27 14:08:52
BOOTSTRAPPING - 2012-05-27 14:09:23
BOOTSTRAPPING - 2012-05-27 14:09:55
WAITING - 2012-05-27 14:10:26
Your Amazon EMR Hadoop Cluster is ready for action. 
Remember to terminate your cluster with stopCluster().
Amazon is billing you!

Local simulation was OK, but running it on the cluster returned an error each time.

> myList <- NULL
> set.seed(1)
> for (i in 1:10){
  +   a <- c(rnorm(999), NA)
  +   myList[[i]] <- a
  + }
> outputLocal  <- lapply(myList, mean, na.rm=T)
> outputEmr   <- emrlapply(cl, myList, mean,  na.rm=T)
RUNNING - 2012-05-27 14:11:58
RUNNING - 2012-05-27 14:12:29
RUNNING - 2012-05-27 14:13:00
WAITING - 2012-05-27 14:13:31
Error in lines[[i]] : subgroup is out of range
> stopCluster(cl)

I like the idea of this package and I hope it will be useful in my work, but I cannot figure out how to solve this basic problem.

Version of segue 0.02

OS: Ubuntu 11.10

UPDATE: I tried to run another example test case of Pi estimation, and emrlapply returned the same error message.

UPDATE2: I updated to version 0.03 and now I could not connect to cluster. After successful start instances were tried to shut down with no effect. I terminated instances via AWS consol. So the old problem was solved but the new one appeared.

> cl <- createCluster(numInstances=2)
STARTING - 2012-06-01 22:36:10
STARTING - 2012-06-01 22:36:41
STARTING - 2012-06-01 22:37:12
STARTING - 2012-06-01 22:37:43
STARTING - 2012-06-01 22:38:14
STARTING - 2012-06-01 22:38:46
SHUTTING_DOWN - 2012-06-01 22:39:17
SHUTTING_DOWN - 2012-06-01 22:39:48
...
SHUTTING_DOWN - 2012-06-01 22:48:05
SHUTTING_DOWN - 2012-06-01 22:48:36
FAILED - 2012-06-01 22:49:07
>
도움이 되었습니까?

해결책

It appears that Amazon changed the EMR service to default to the 1.0 version of the EMR AMI if no specific version was called. Since Jan 1, the behavior had been to default to the latest version. When I made the changes to default to a recent version I then had issues with the current incarnation of Hadoop wanting output to be put in a sub-bucket on S3.

I had to upgrade the Java AWS API code to the latest version in order to make these changes.

New version of the tar ball is here: http://code.google.com/p/segue/downloads/list or you can clone the source and build it yourself, if you're into that sort of thing.

I've indexed Segue to 0.03 with this change.

EDIT: I just found that m1.small is a problem (32 bit) so I've changed the default and changed behavior to not let users specify m1.small. New version is 0.04.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top