I don't know Unity well, but I know their base layer, and if they can map over say, D3D9, D3d10 and OpenGL then their abstraction has to use a common denominator.
In which case here, the D3D10 is the most limitative, you cannot share a depth surface between render targets of different sizes. If you have the same size of screen and render targets all over the place then you can indeed bind a unique depth buffer to different render targets.
The depth buffer is not stricto sensu necessary like you have observed you can render without a depth buffer but then the result is just the rendering in the order of which the draw commands are issued. (draw call = DrawPrimitive in D3D, or glDrawBuffers and the likes) It is even guaranteed that the order will be consistent at the triangle level by the spec, even if the graphic cards are very parallel they are refused to issue primitives in parallel for the sake of consistency of drawing through different runs of one draw call.
If you use a depth buffer the objects that happen to have their draw call after objects that have been drawn at a lower depth, will overwrite these close objects (in view space) and give wrong results, the depth buffer helps to discard pixel by pixel, the pixels of an object that would have its depth (in view space) deeper than something that has already been rendered at this pixel before, but with a closer depth.
binding the depth buffer also helps performance because if every pixel in one block have minimum depth of a certain value, the primitive rasterizer knows after the exit of the vertex buffer that a whole part of your primitive will never be rendered on this block, and discards the whole block altogether. This is called early Z culling and is a huge help in performance. So it is very preferable to keep the depth buffer plugged.
The camera has no concept in low level graphics theory, it is just represented by a view matrix that is an inverse transformation applied to the whole world, to shift the world from world space to view space as part of any single vertex transformation calculation. That is why in a classic vertex shader the position is taken from object space (in the vertex buffer stream), then multiplied by an object matrix transformation, then by a view matrix transform, and then a projection matrix transform, and then the rasterizer devides everything by 'w' to make the perspective divide.
That is by exploiting this pipeline behavior that you "create" the concept of a camera. Unity must abstract all of these by exposing a camera class. Maybe the camera class even have a "texture target" member to explain where the rendering from this camera will be stored.
And yes, the rendering is made DIRECTLY to the render texture that is specified, there is no intermediary front buffer or whatever, a render target is a hardware supported feature, and needs no copy at the end of a rendering.
The render target is a whole configuration in itself, because of hardware multisampling it may actually bind multiple buffers of the size of the required resolution.
There is the color buffer, which is for example 4 RGBA surfaces in MSAA4x, the depth buffer, which is generally a fixed point 24 bit representation, and 8 bits for stencil surface. All of these surfaces represent the render target configuration and are necessary to rendering.
I hope this helps