RFC 6184 "RTP Payload Format for H.264 Video" has answers for the raised questions. It can be both ways: 2+ NAL units per packet, and 1 NAL unit fragmented over 2+ packets.
See quotes below:
5.7.1. Single-Time Aggregation Packet (STAP)
A single-time aggregation packet (STAP) SHOULD be used whenever NAL units are aggregated that all share the same NALU-time.
and
5.8. Fragmentation Units (FUs)
This payload type allows fragmenting a NAL unit into several RTP packets. Doing so on the application layer instead of relying on lower-layer fragmentation (e.g., by IP) has the following advantages: