ID3v2 Specification

Question 1

I think this might actually be a ~~mistake~~ case of bad wording in the spec. I found two diagrams in the ID3v2 Chapter Frame Addendum showing examples of complete headers. That document describes two newly introduced frame types, which are not interesting to the question at hand. But fortunately, it also contains examples of embedded 'Title/Songname/Content description'-frame (TIT2) and 'Subtitle/Description refinement'-frame (TIT3), which are both text frames*:

enter image description here

According to the diagram, the Title frame (ID: TIT2) has the following structure: First the frame header:

Frame ID       $xx xx xx xx (four characters)
Size           $xx xx xx xx
Flags          $xx xx

which is then directly followed by ID-dependent fields:

Text encoding  $xx Information    
<text string according to encoding>

This layout makes the most sense to me. If you still have doubts about the correct layout, you could check out the source of one of the existing implementations.

Sidenote: in the ID3v2.4.0 specification they changed the confusing sentence to.

Frames that allow different types of text encoding contains a text encoding description byte.

*_{Only frames that allow different types of text encoding have a text encoding description byte.
Unsurprisingly, most of these are text frames}

Question 2

A frame header is 10-byte long. 4 bytes for UID 4 bytes for length of frame (header excluded) 2 bytes for flags. Any other info will be found in the frame itself, not its header.

The wording sure is confusing.

What is meant is that where you expect to read a string, the first byte tells you what to expect. $00 means ISO-8859-1 or one byte encoding $01 means Unicode or 2-byte encoding. $01 is followed by either FF FE or FE FF to inform on which the Most Significant byte is.

I'd advise you to use an hexa editor on some mp3 files and dissect them