High throughput vs low latency in HDFS

Question 1

I think what you've described is more like the difference between optimizing for different access patterns (sequential, batch vs random access) than the difference between throughput and latency in the purest sense.

When I think of a high latency system, I'm not thinking about which record I'm accessing, but rather that accessing any record at all has a high overhead cost. Accessing even just the first byte of a file from HDFS can take around a second or more.

If you're more quantitatively inclined, you can think about the total time required to access a number of records N as T(N)=aN+b. Here, a represents throughput, and b represents latency. With a system like HDFS, N is often so large that b becomes irrelevant and tradeoffs favoring a low a are beneficial. Contrast that to a low-latency data store, where often each read is only accessing a single record, and then optimizing for low b is better.

With that said, your statement isn't incorrect; it's definitely true, and it is often the case that batch access stores have high latency and high throughput, whereas random access stores have low latency and low throughput, but this is not strictly always the case.

Question 2

I'll take a swing at this one.

Low latency data access: I hit the enter key (or submit button) and I expect results within seconds at most. My database query time should be sub-second. High throughput of data: I want to scan millions of rows of data and count or sum some subset. I expect this will take a few minutes (or much longer depending on complexity) to complete. Think of more batch style jobs.

Caveats: This is really a map/reduce issue also. The Set up and processing of M/R jobs takes a bit of overhead. There are a couple of projects working now to move toward lower latency data access.

Also, HDFS stores data in blocks and distributes them across many nodes. This means that there will (almost) always be some network data transfer required to get the final answer, and that "slows" things down a bit, depending on throughput and various other factors.

Hope that helps. :)