Connection reset with playOrm when running larges amounts of data

https://stackoverflow.com/questions/13314567

28-11-2021
|

Question

I have a hadoop process that connects to a cassandra keyspace in the reduce part. Data is saved by playORM. What happens is: I am running this hadoop process and cassandra in the same machine, so playORM just connects to cassandra on localhost. When I process small amount of data, the process run completely fine, but when I process bigger amounts (just 500 000 records, in this case) I get the exception bellow. I wonder if it could be a problem in astyanax pool configuration (which is done by playORM, so I don't know how to change these settings) or if it could be a problem in playORM itself or even in my Cassandra config. Everything is running in a single host now and I think things might get worse when we configure the cluster, as many hadoop machines will be connecting to many cassandra machines.

Any hint of what might be wrong?

CF=[tablename=Localization] persist rowkey=1bd9b46a-5b66-41ae-9756-dd91f44194ea
CF=User index persist(cf=[tablename=User])=[rowkey=/User/id] (table found, colmeta not found)
CF=[tablename=User] persist rowkey=1bd9b46a-5b66-41ae-9756-dd91f44194ea
java.lang.RuntimeException: com.netflix.astyanax.connectionpool.exceptions.ConnectionAbortedException: ConnectionAbortedException: [host=localhost(127.0.0.1):9160, latency=611(611), attempts=1] org.apache.thrift.t
ransport.TTransportException: java.net.SocketException: Connection reset
        at com.alvazan.orm.layer9z.spi.db.cassandra.CassandraSession.sendChanges(CassandraSession.java:110)
        at com.alvazan.orm.logging.NoSqlRawLogger.sendChanges(NoSqlRawLogger.java:50)
        at com.alvazan.orm.layer5.nosql.cache.NoSqlWriteCacheImpl.flush(NoSqlWriteCacheImpl.java:125)
        at com.alvazan.orm.layer5.nosql.cache.NoSqlReadCacheImpl.flush(NoSqlReadCacheImpl.java:178)
        at com.alvazan.orm.layer0.base.BaseEntityManagerImpl.flush(BaseEntityManagerImpl.java:182)
        at com.s1mbi0se.dmp.da.dao.UserDao.insertOrUpdateUser(UserDao.java:24)
        at com.s1mbi0se.dmp.da.dao.UserDao.insertOrUpdateUserLocalization(UserDao.java:75)
        at com.s1mbi0se.dmp.da.service.DataAccessService.insertLocalizationForUser(DataAccessService.java:44)
        at com.s1mbi0se.dmp.module.LocalizationModule.persistData(LocalizationModule.java:218)
        at com.s1mbi0se.dmp.processor.mapred.SelectorReducer.reduce(SelectorReducer.java:60)
        at com.s1mbi0se.dmp.processor.mapred.SelectorReducer.reduce(SelectorReducer.java:1)
        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
        at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)
Caused by: com.netflix.astyanax.connectionpool.exceptions.ConnectionAbortedException: ConnectionAbortedException: [host=localhost(127.0.0.1):9160, latency=611(611), attempts=1] org.apache.thrift.transport.TTranspo
rtException: java.net.SocketException: Connection reset
        at com.netflix.astyanax.thrift.ThriftConverter.ToConnectionPoolException(ThriftConverter.java:193)
        at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:60)
        at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:27)
        at com.netflix.astyanax.thrift.ThriftSyncConnectionFactoryImpl$1.execute(ThriftSyncConnectionFactoryImpl.java:131)
        at com.netflix.astyanax.connectionpool.impl.AbstractExecuteWithFailoverImpl.tryOperation(AbstractExecuteWithFailoverImpl.java:52)
        at com.netflix.astyanax.connectionpool.impl.AbstractHostPartitionConnectionPool.executeWithFailover(AbstractHostPartitionConnectionPool.java:229)
        at com.netflix.astyanax.thrift.ThriftKeyspaceImpl.executeOperation(ThriftKeyspaceImpl.java:455)
        at com.netflix.astyanax.thrift.ThriftKeyspaceImpl.access$400(ThriftKeyspaceImpl.java:62)
        at com.netflix.astyanax.thrift.ThriftKeyspaceImpl$1.execute(ThriftKeyspaceImpl.java:115)
        at com.alvazan.orm.layer9z.spi.db.cassandra.CassandraSession.sendChangesImpl(CassandraSession.java:131)
        at com.alvazan.orm.layer9z.spi.db.cassandra.CassandraSession.sendChanges(CassandraSession.java:108)
        ... 14 more
Caused by: org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection reset
        at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
        at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
        at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
        at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
        at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
        at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
        at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
        at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:913)
        at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:899)
        at com.netflix.astyanax.thrift.ThriftKeyspaceImpl$1$1.internalExecute(ThriftKeyspaceImpl.java:121)
        at com.netflix.astyanax.thrift.ThriftKeyspaceImpl$1$1.internalExecute(ThriftKeyspaceImpl.java:118)
        at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:55)
        ... 23 more
Caused by: java.net.SocketException: Connection reset
        at java.net.SocketInputStream.read(Unknown Source)
        at java.net.SocketInputStream.read(Unknown Source)
        at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
        ... 36 more

Solution

NOTE: I think I ran into this once as well and upped the timeouts or connection pool sizes in astyanax and it went away so try that as well(though connection reset is GENERALLY the far server's fault...ie. cassandra).

Sure connection reset is typically because the other end(cassandra) closed the connection on you. To be 100% sure, if you do a wireshark, you should see which end is closing the socket.

be careful what you read on this post here...

java.net.SocketException: Connection reset

but basically, I wrote channelmanager on sourceforge before mina, netty, etc. existed. Mostly, you get -1 when other end closes socket PROPERLY.....ie. they need to send some packets. IF they just dissappear, it can result in neat exceptions like Connection reset.

I suggest fiddling with the astyanax connection pool. Look at wireshark though and google how the tcp teardown happens and see if cassandra did not tear it down properly.

If you are on linux, try netstat -anp | grep {pid} so you can see ports that your client process is using and in wireshark look for packets on those. Also, do a test to make sure astyanax is keeping it's pool correctly in tact meaning run that netstat command a few times during the process to make sure astyanax is not creating sockets and then deleting them and creating them again(as if it deleted one and then you write to it, you could get the above error)

The java nio stuff was never completely reliable under the covers.....to this day, I still have unit tests demonstrating bugs in the nio libraries on different OS.

out of curiosity how much are you flushing down the pipe too as I notice you are doing a write and the read basically failed to get status on if write was successful or not.

In the coming months, we hope to have a generic map/reduce that feeds the map/reduce code the actual entities. We finally found and are sending an offer to a new developer that will join us soon to help with the workload.

Another good post to read is this

http://kb.realvnc.com/questions/75/I%27m+receiving+the+error+%22Connection+reset+by+peer+%2810054%29%22.+

wireshark can really tell you the detail on what happened at the tcp layer. I have been meaning to look into more detail was it astyanax or cassandra's fault but have not had time.

Dean

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow