Thrift is too slow compared to direct calling function

Question 1

Two key optimizations you can set as goals:

Send all the data you already have before waiting.
Don't send a computed result across the channel if the only thing done with it is to send it straight back.

What you have described in your question is an extreme case of a "chatty protocol". The network has latency (delay). If you wait for each result before starting the next computation, most of the time is spent waiting for the network transfer, not for the actual computation. By sending another computation before receiving the first result, you can improve throughput dramatically.

So the simplest thing is to allow overlapping requests. The product of the second pair of values doesn't depend on the first result, so don't wait for the first result to arrive.

When you are dealing with local IPC, that doesn't help so much. The cost of communication isn't delay, it's message processing and thread synchronization, depending on number of requests but not so much the order.

A bigger change with larger payoff is to make each request represents a complex algorithm. For example, instead of a remote call for a multiply on two numbers, try a remote call for an entire filtering operation, where the arguments are an entire data vector or matrix, and the server will perform FFTs, multiple, inverse FFT, scale, and then pass the result back. This satisfies both the original goals: all available data is sent together, instead of singly, reducing time spend waiting. And total network traffic is reduced because intermediate results don't have to be exchanged.

A final alternative is to link code from all three languages into a single process, so that data access and function calls are direct. Many languages allow building objects that export plain "C" functions and data.

Also, virtual machines such as .NET run intermediate languages that can be generated from compilation of different source languages. With .NET you have C# (Java-like), C++/CLI (supports full C++, plus extensions for working on .NET data), and IronPython, which cover your question diagram. Plus F#, JavaScript, a Ruby variant, and on and on. The Java virtual machine is supposed to be language-specific, but people have written Clojure and other languages that compile to bytecode.

The advantage of the virtual machine technique is that it enables some cross-language optimization (.NET JIT does cross-module inlining). The disadvantage is that your performance is dictated by JIT optimizations, which generally are the lowest common denominator. C++/CLI actually is really good for bridging this gap, because it supports fully-optimized native code (including SIMD), .NET intermediate language (MSIL), and the lowest overhead layer for communicating between them (C++ "It Just Works" interop).

But you could accomplish about the same thing on the Java VM, by using JNI to interface fully-optimized C++ code for intense number crunching using SIMD.

Question 2

Your comparison is based on incorrect assumptions. The assumption is, that a cross-process call (at least) is as fast as an in-process call, which is simply not true.

This is one of the famous 8 network fallacies originated by Peter Deutsch, later extended by others that does not only apply to networks, but also to IPC on a single machine: Contrary to what you think, transport cost is NOT zero.

From what I can tell based on your limited information, your 1.5 msec per IPC roundtrip sounds not so bad to me.