Why speedup reduces with increase in number of pipeline stages?

Question 1

The latches themselves do have a small delay (they are after all "doing work", i.e., switching). By itself, this would only produce an asymptotic approach to a fixed peak performance value. E.g., starting from the (already unrealistically) tiny actual work time of each stage equaling the latch delay, a doubling of pipeline depth (excluding other, real-world constraints) would reduce cycle time to latch delay plus 1/2 latch delay (increasing clock speed by just over 33%) but doubling pipeline depth again would only reduce cycle time to latch delay plus 1/4 latch delay.

Even at an infinite number of pipeline stages with each stage (somehow) doing infinitesimal work, the minimum cycle time would equal one latch delay, doubling the clock speed relative to a pipeline depth where latch delay equals actual work time. On a slightly practical level, one transistor switch delay of real work is a relatively hard constraint.

However, before latch delay itself, prevents further improvement, other real-world factors limit the benefit of increased pipeline depth.

On the more physical level, excluding area and power/thermal density constraints, getting the clock signal to transition uniformly at very high precision across an entire design is challenging at such high clock speeds. Clock skew and jitter become more significant when there is less margin in work time to absorb variation. (This is even excluding variation in manufacturing or environmental conditions like temperature.)

Besides such more physical constraints, dependence constraints tend to prevent a deeper pipeline from increasing performance. While control dependencies (e.g., condition evaluation of a branch) can often be hidden by prediction, as Gabe notes in his answer, a branch misprediction can require a pipeline flush. Even at 99% prediction accuracy and one branch every ten instructions (95% and one branch every five instructions are more likely), a thousand-stage branch resolution delay (i.e., excluding stages after branch resolution and assuming branch target is available no later than branch direction) would mean half of the performance is taken by branch mispredictions.

Instruction cache misses would also be a problem. If one had perfect control flow prediction, one could use prefetching to hide the delay. This effectively becomes a part of the branch prediction problem. Also, note that increasing cache size to reduce miss rate (or branch predictor size to reduce misprediction rate) increases access latency (pipeline stage count).

Data value dependencies are more difficult to handle. If the execution takes two cycles, then two sequential instructions with a data dependence will not be able to execute back-to-back. While value prediction could, theoretically, help in some cases, it is most helpful in relatively limited cases. It is also possible for some operations to be width-pipelined (e.g., add, subtract, bitwise logical operations, and left shifts). However, such tricks have limits.

Data cache misses becoming part of this data dependence issue. Data memory addresses tend to be much more difficult to predict than instruction addresses.

This Google Scholar search provides some more detailed (and technical) reading on this subject.

Question 2

Without watching the hour-long video, I would say that the actual problem when there are a large number of stages is that pipeline stalls are worse. If you have a 14-stage pipeline and mispredict a branch, that's 14 stages that you have to fill up again before issuing another instruction.