Question

i am using bytecode analysis to get all imported classes of a classfile (with BCEL). Now, when i read the constant pool, not all imported classes are mentioned as CONSTANT_Class (see spec) but only as CONSTANT_Utf8. My question now: Am i not able to rely solely on the CONSTANT_Class-entries in the constant pool to read the imported files? do i really have to look at every entry and guess, if its a class name? This also does not seem to be correct in every situation imo. Or do i have to read through the whole bytecode? regards

Was it helpful?

Solution

No, it is not correct to use CONSTANT_Class_info entries alone to discover dependencies on other classes/interfaces. If you're parsing input files you trust or can tolerate incorrect information, you can get away with parsing the constant pool only except for one corner case. To get precise information on arbitrary input you need to parse the whole class file. (I assume by "dependencies" you mean those classes or interfaces without which loading or linking a class may result in exceptions, as described in JVMS chapter 5. This doesn't include classes obtained via Class.forName or other reflective means.)

Consider the following class.

public class Main {
    public static void main(String[] args) {
        identity(null);
    }
    public static Object identity(Foo x) {
        return x;
    }
}

javap -p -v Main.class prints:

Classfile /C:/Users/jbosboom/Documents/stackoverflow/build/classes/Main.class
  Last modified Jul 2, 2014; size 346 bytes
  MD5 checksum 2237cda2a15a58382b0fb98d6afacc7e
  Compiled from "Main.java"
public class Main
  SourceFile: "Main.java"
  minor version: 0
  major version: 52
  flags: ACC_PUBLIC, ACC_SUPER
Constant pool:
   #1 = Methodref          #3.#17         //  java/lang/Object."<init>":()V
   #2 = Class              #18            //  Main
   #3 = Class              #19            //  java/lang/Object
   #4 = Utf8               <init>
   #5 = Utf8               ()V
   #6 = Utf8               Code
   #7 = Utf8               LineNumberTable
   #8 = Utf8               LocalVariableTable
   #9 = Utf8               this
  #10 = Utf8               LMain;
  #11 = Utf8               identity
  #12 = Utf8               (LFoo;)Ljava/lang/Object;
  #13 = Utf8               x
  #14 = Utf8               LAAA;
  #15 = Utf8               SourceFile
  #16 = Utf8               Main.java
  #17 = NameAndType        #4:#5          //  "<init>":()V
  #18 = Utf8               Main
  #19 = Utf8               java/lang/Object
  #20 = Utf8               java/lang/Thread
  #21 = Class              #20            //  java/lang/Thread
  #21 = Utf8               (LBar;)LFakename;
{
  public Main();
    descriptor: ()V
    flags: ACC_PUBLIC
    Code:
      stack=1, locals=1, args_size=1
         0: aload_0
         1: invokespecial #1                  // Method java/lang/Object."<init>":()V
         4: return
      LineNumberTable:
        line 6: 0
      LocalVariableTable:
        Start  Length  Slot  Name   Signature
            0       5     0  this   LMain;

  public static java.lang.Object identity(Foo);
    descriptor: (LFoo;)Ljava/lang/Object;
    flags: ACC_PUBLIC, ACC_STATIC
    Code:
      stack=1, locals=1, args_size=1
         0: aload_0
         1: areturn
      LineNumberTable:
        line 11: 0
      LocalVariableTable:
        Start  Length  Slot  Name   Signature
            0       2     0     x   LAAA;
}

The class Foo, referenced as a parameter to the method identity, does not appear in the constant pool as a CONSTANT_Class_info entry. It does appear in the method descriptor for identity (entry #12). Field descriptors may also reference classes not appearing as CONSTANT_Class_info entries. Thus to find all the dependencies from the constant pool alone, you need to look at all UTF8 entries.

The corner case: Some UTF8 entries may exist to be referenced by CONSTANT_String_info entries. Duplicate UTF8 entries will be merged, so one UTF8 entry might be a method descriptor, a string literal, or both. If you're only parsing the constant pool, you must live with this ambiguity (probably by overapproximating and treating it as a dependency).

If you trust the input to have been produced by a well-behaved Java compiler under your control, you can parse all UTF8 entries, mindful of the string corner case, and stop reading here. If you need to defend against an attacker feeding your tool handcrafted class files (e.g., you're writing a decompiler and the attacker wants to prevent decompilation), you need to parse the entire class file. Here's a few examples of the potential problems.

  • Entry #20 names a class not used by Main. The JVM may or may not try to resolve this reference (JVMS 5.4 permits both lazy and eager loading). As the class exists, either way, no error will be raised, so this extra entry is harmless, but it will fool tools looking at the constant pool into thinking Thread is a dependency.
  • Entry #21 is an unused method descriptor referring to two fictitious classes. As this descriptor is not used, no error will be raised, but again, tools trusting the constant pool will parse it.
  • Entry #14 is a field descriptor referring to a fictitious class. This entry is actually used by the LineNumberTable attribute, but this debugging information is not checked by the JVM, so the reference is harmless but may fool tools.
  • I don't have an example for this one, but the InnerClasses attribute refers to CONSTANT_Class_info entries, and is not checked for consistency with other class files (per JVMS 4.7.6, albeit in a non-normative note). These references won't prevent loading or linking, but would confuse a tool examining the constant pool.

That's just what I came up with off the top of my head. A clever attacker going through the JVMS with a fine-tooth comb could probably find more places to add entries to the constant pool that look used but aren't. If you need precise information even in the face of an attacker, you need to parse the whole class file and understand how a JVM will use it.

OTHER TIPS

See JVMS 4.2, The Internal Form of FQ Class and Interface Names.

Nutshell: the class structures point to UTF8 entries.

(Or are you instead saying that not all referenced classes are represented by a class and name entry?)


FWIW, be wary of relying solely on this information to determine dependencies as classes can be loaded dynamically and may not appear at all.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top