You should have 512 dimensional vectors at the first step, one such vector per frame, or equivalently a 512 x n matrix.
Then in the second step I don't think they use the plain built-in hierarchical clustering - which is not time aware, and will not produce intervals, plus it will scale O(n^3) which is really bad - but instead they use a customized clustering algorithm, inspired by hierarchical clustering and using Ward's linkage, but which operates on time intervals; starting with single frames, but only joining neighboring intervals, not arbitrary intervals like regular hierarchical clustering would.