Question

i'm trying to write the code of this article:"Improving Cluster Selection and Event Modeling in Unsupervised Mining for Automatic Audiovisual Video Structuring"
a part of the it is about video clustering:
"The video stream is segmented into shots based on color histograms to detect abrupt changes and progressive transitions. Each of the resulting shot is summarized by a key frame, taken in the middle of the shot, in turn represented as a RGB histogram with 8 bins per color. Bottom-up clustering relies on the Euclidean distance between the 512-dimension color histograms using Ward’s linkage."
i've done this and reached to an array of numbers like this:
1.0e+03 *

3.8334
3.9707
3.8887
2.1713
2.5616
2.3764
2.4533

that after performing the dendrogram part, the result became:

 174.0103
 175.0093
 176.0093
 177.0093
178.0093
 178.0093
179.0093

but according to a toy example that was given by authors of the article the result should be intervals like:
{47000, 50000}, {143400, 146400}, {185320, 187880},{228240, 231240}, {249440, 252000}, {346000, 349000} what is wrong here?

Was it helpful?

Solution

You should have 512 dimensional vectors at the first step, one such vector per frame, or equivalently a 512 x n matrix.

Then in the second step I don't think they use the plain built-in hierarchical clustering - which is not time aware, and will not produce intervals, plus it will scale O(n^3) which is really bad - but instead they use a customized clustering algorithm, inspired by hierarchical clustering and using Ward's linkage, but which operates on time intervals; starting with single frames, but only joining neighboring intervals, not arbitrary intervals like regular hierarchical clustering would.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top