Which is the fastest? A boost::multi_array or a std::vector?

Question 1

Best advice is to benchmark it by yourself.

In any case, since you seem to have constant size there are other solutions:

plain C arrays (eg. int data[X][Y][Z])
plain one dimensional C array in which you compute indices by yourself, eg X*W*H + Y*W + Z, can be handy in some situations
std::array, which is basically a C++ array with some synctactic sugar taken from STL collections
std::vector, which I guess is the first solution that can be tried
boost::multi_array, which is meant to support N dimensional arrays so it can be overkill for your purpose but probably has a better locality of data compared to a vector.

Question 2

Those library vector classes are designed to be easy to use and relatively fail-safe. They are as fast as they can be within their design, but nothing can beat doing it yourself (except maybe hand-coded assembly). For the size you're talking about (2e10 elements), I would be more concerned with efficiency than with user-friendliness. If your innermost loop does very little computation per element, you're going to find the indexing calculations dominant, which suggests doing some unrolling and pointer-stepping. (Maybe you can count on the compiler to do some unrolling, but I don't care for maybes.)

Question 3

The only way to know for sure is to try both and profile the code. However as a bunch of ideas, this is what I think you'll find.

For a large number of elements that you are dealing with (2e10+) the access to elements is not going to be as significant as the cache pressure to load those elements into the cpu cache. The prefetcher is going to be sitting there trying to load those elements, which is going to take a large proportion of the time.
Accessing 2(or 3D) non-contiguous C arrays means the CPU has to go around fetching things from different parts of memory. boost::multi_array solves that somewhat by behind the scenes storing it as a contiguous block; but it has it's own overheads for doing so. As @Jack said, plain 1D arrays with indices are best, and even then you can do things to ensure the indexing is minimal.(eg memoization)
The work you do within the loop is going to affect your timings significantly. The branch predictor is going to be the biggest contributor. If it's a simple math operation, no if/else statements you'll likely get the best performance, and the compiler will likely optimise it to SSE instructions. If you have composite types (rather than int/float/char) then you're going to have to lay them out right to optimise access.
I would almost suggest, try both, then come back with a new SO question that has your loop written and ask how to optimise that part. Almost always, the compiler can be given hints to ensure it knows your intentions.

At the end of the day, try it and see