As chamadas EJB do WebLogic começam a falhar sob carga moderada com o OptionAdataException

https://stackoverflow.com/questions/2454234

20-09-2019
|

Pergunta

Nossa configuração do sistema consiste em dois servidores WebLogic 10.3: um hospeda a camada de apresentação e os outros hospedam os EJBs. O sistema funciona bem sob carga moderada por algum tempo (um a vários dias) após o qual o método EJB chama do servidor de apresentação para o servidor EJB começa a falhar com o seguinte erro:

java.rmi.RemoteException: java.rmi.UnmarshalException: error unmarshalling arguments; nested exception is: java.io.OptionalDataException

Stack Trace:

java.io.OptionalDataException
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
    at weblogic.utils.io.ChunkedObjectInputStream.readObject(ChunkedObjectInputStream.java:197)
    at weblogic.rjvm.MsgAbbrevInputStream.readObject(MsgAbbrevInputStream.java:564)
    at weblogic.utils.io.ChunkedObjectInputStream.readObject(ChunkedObjectInputStream.java:193)
    at weblogic.jndi.internal.RootNamingNode_WLSkel.invoke(Unknown Source)
    at weblogic.rmi.internal.BasicServerRef.invoke(BasicServerRef.java:589)
    at weblogic.rmi.cluster.ClusterableServerRef.invoke(ClusterableServerRef.java:230)
    at weblogic.rmi.internal.BasicServerRef$1.run(BasicServerRef.java:477)
    at weblogic.security.acl.internal.AuthenticatedSubject.doAs(AuthenticatedSubject.java:363)
    at weblogic.security.service.SecurityManager.runAs(Unknown Source)
    at weblogic.rmi.internal.BasicServerRef.handleRequest(BasicServerRef.java:473)
    at weblogic.rmi.internal.wls.WLSExecuteRequest.run(WLSExecuteRequest.java:118)

Uma vez que a primeira opcionalDataException for encontrada, todas as chamadas subsequentes falham com o mesmo resultado. Algumas fontes sugerem que isso pode estar relacionado ao cluster por porta multicast ser equivocada. No entanto, esses servidores não pertencem a um cluster.

A inicialização do servidor EJB sempre resolve temporariamente o problema, mas o problema parece ocorrer novamente após algum tempo.

Atualizar: parece que o problema é não Relacionado a um estouro no número de conexões de soquete, afinal (veja minha própria resposta abaixo). Depois de desaprovar a carga de classe de rede, executamos muito constantemente por uma semana, após o qual começamos a receber novamente o OptionalDataExceptions no servidor de apresentação (rastreamento da pilha abaixo). É muito estranho que o sistema funcione bem por uma semana e depois comece a falhar.

javax.naming.CommunicationException [Root exception is java.rmi.UnmarshalException: error unmarshalling arguments; nested exception is:
    java.io.OptionalDataException]
    at weblogic.jndi.internal.ExceptionTranslator.toNamingException(ExceptionTranslator.java:74)
    at weblogic.jndi.internal.WLContextImpl.translateException(WLContextImpl.java:439)
    at weblogic.jndi.internal.WLContextImpl.lookup(WLContextImpl.java:395)
    at weblogic.jndi.internal.WLContextImpl.lookup(WLContextImpl.java:380)
    at javax.naming.InitialContext.lookup(InitialContext.java:392)
    ...
Caused by: java.rmi.UnmarshalException: error unmarshalling arguments; nested exception is:

    java.io.OptionalDataException
    at weblogic.rjvm.ResponseImpl.unmarshalReturn(ResponseImpl.java:234)
    at weblogic.rmi.cluster.ClusterableRemoteRef.invoke(ClusterableRemoteRef.java:348)
    at weblogic.rmi.cluster.ClusterableRemoteRef.invoke(ClusterableRemoteRef.java:259)
    at weblogic.jndi.internal.ServerNamingNode_1030_WLStub.lookup(Unknown Source)
    at weblogic.jndi.internal.WLContextImpl.lookup(WLContextImpl.java:392)  
    ... 38 more
Caused by: java.io.OptionalDataException
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
    at     
    weblogic.utils.io.ChunkedObjectInputStream.readObject(ChunkedObjectInputStream.java:197)
    at weblogic.rjvm.MsgAbbrevInputStream.readObject(MsgAbbrevInputStream.java:564)
    at     
weblogic.utils.io.ChunkedObjectInputStream.readObject(ChunkedObjectInputStream.java:193)
    at weblogic.jndi.internal.RootNamingNode_WLSkel.invoke(Unknown Source)
    at weblogic.rmi.internal.BasicServerRef.invoke(BasicServerRef.java:589)
    at weblogic.rmi.cluster.ClusterableServerRef.invoke(ClusterableServerRef.java:230)
    at weblogic.rmi.internal.BasicServerRef$1.run(BasicServerRef.java:477)
    at        
weblogic.security.acl.internal.AuthenticatedSubject.doAs(AuthenticatedSubject.java:363)
    at weblogic.security.service.SecurityManager.runAs(Unknown Source)
    at weblogic.rmi.internal.BasicServerRef.handleRequest(BasicServerRef.java:473)
    at weblogic.rmi.internal.wls.WLSExecuteRequest.run(WLSExecuteRequest.java:118)
    ... 2 more

Obtemos o contexto inicial da maneira padrão:

Properties p = new Properties();
p.put(Context.INITIAL_CONTEXT_FACTORY, "weblogic.jndi.WLInitialContextFactory");
p.put(Context.PROVIDER_URL, serverPath);
Context context = new InitialContext(p);

Também as chamadas para todas as referências obtidas falham com uma opção decepta de opção semelhante. A inicialização do servidor de apresentação apenas resolve o problema temporariamente.

Solução

Finally the OptionalDataExceptions are history. In short, in our application code a complex value object (used as a return value for remote method invocations) had a HashMap datastructure as an internal field. After changing the type of this field to SynchronizedMap the OptionalDataExceptions stopped occurring. It seems that somewhere in the legacy code this Map is handled in non thread-safe way.

What is strange is that this caused no problems with WLS 8.1, but somehow caused WLS 10 enter a state where all subsequent remote method invocations (including JNDI lookups) started to fail.

Outras dicas

Finally we found the solution to this (Edit: later we found out that this was not the root cause of the issue, but a separate serious issue. For the final solution, please see the answer below). Once we started to receive the following exception we got on the tracks of the cause:

<BEA-000403> <IOException occurred on socket: Socket[addr=/x.x.x.x,port=3266,localport=7001]
 java.net.SocketException: Connection refused.
java.net.SocketException: Connection refused
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at weblogic.socket.SocketMuxer.readReadySocketOnce(SocketMuxer.java:887)
at weblogic.socket.SocketMuxer.readReadySocket(SocketMuxer.java:859)
at weblogic.socket.DevPollSocketMuxer.processSockets(DevPollSocketMuxer.java:120)
at weblogic.socket.SocketReaderRequest.run(SocketReaderRequest.java:29)
at weblogic.socket.SocketReaderRequest.execute(SocketReaderRequest.java:42)
at weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:145)
at weblogic.kernel.ExecuteThread.run(ExecuteThread.java:117)

On the presentation server, which is running on a different host than the EJB server we had the option

-Dweblogic.NetworkClassLoadingEnabled=true

to obviously enable class loading from the EJB server. What we did not know is that using this option can result in a huge number of network sockets being opened. Using netstat we were able to find out that several thousand sockets were either in CLOSE_WAIT or FIN_WAIT_2 state. It seems that all the elements in the web UI were loaded from the EJB server in addition to the classes despite the fact that the war file on the presentation server contained all these. The huge amount of sockets did not result in "too many files" error messages since Weblogic removes the ulimit for files in its startup script. Using a test server we found out that a single click on the web UI by the user opened around 30 sockets between the two servers.

We removed this option and repackaged the war on the presentation server to contain all the needed classes thus removing the need for network classloading. This resulted in a decrease in the number of socket connections between the two servers from thousands to 1.

In a summary, avoid network class loading in Weblogic if at all possible.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow