See section 2.2.5 in the Intel optimization guide -
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
(note that this applies for Sandy-Bridge, but doesn't appear as changed for Ivy-Bridge, which has only minor micro-architectural changes over the previous generation).
So regarding your questions:
- For the L1 there's no question of inclusiveness as it doesn't have upper level caches to be inclusive-of
- The L2 cache is not inclusive, meaning that there's no guarantee that a line residing in the L1 would have to be in the L2 as well. However on most cases it's likely to be there since it was probably filled into the L2 when originally requested by the core, and has a good chance to survive longer in the L2 since it's bigger (and therefore the evictions are better spread over more sets), and filtered by the L1 (meaning less evictions usually)
Also note that if your benchmark is accessing a data-set larger than the L2, it will probably fail to sit in the L2 (especially if you access it serially and exceed the L2 by more than the size of a single way), and you'd have to fetch it from the L3.