To blog |

Troubleshooting verifiers: the rabbit hole goes deep

December 2, 2015 by Ago Allikmaa Filed under: Java

Java bytecode verificationThis is a story about my recent debugging experience. It all started with a bug report where our Agent seemed to fail in certain environments.

The common denominator seemed to be the Oracle JDBC Driver – in all the problematic environments the Oracle drivers seemed to have a role to play.

After an hour I had a reproducible test case at my fingertips. Instrumentation of code in a similar way to the following example:


InitialContext cxt = new InitialContext();
DataSource ds = (DataSource) cxt.lookup( "java:/comp/env/jdbc/MyDataSource" );
Connection c = ds.getConnection();
Statement s = c.createStatement("SELECT 1 FROM DUAL");

by our agent failed with a NoClassDefFoundError for certain JVM/driver version combinations:

Exception in thread "thread-1" java.lang.NoClassDefFoundError: oracle/security/pki/OraclePKIProvider
        at oracle.jdbc.pool.OracleDataSource.(OracleDataSource.java:103)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
        at java.lang.Class.newInstance(Class.java:442)
        at eu.plumbr.demo.io.jdbc.TestDataSource.makeDataSource(TestDataSource.java:275)
        ... rest of the stack trace skipped for brevity ...

So far so good. Experience has taught me that reducing the problem to a reproducible test case means that most of the work should be done. Oh boy was I wrong. The rabbit hole I was going to fall into was way deeper than it seemed.

Diving in

After initial research I discovered that the symptoms I had (a java.lang.NoClassDefFoundError about oracle.security.pki.OraclePKIProvider class being missing) did not help me one bit. After zooming in some more it became clear that the problem was triggered by an innocent-looking part of the oracle.jdbc.driver.OracleDriver source code

import oracle.security.pki.OraclePKIProvider
  
  ... skipped for brevity ...
  
  try {
    new OraclePKIProvider();
  } catch (Throwable t) {
     ;
  }
  
  ... skipped for brevity ...

Without our instrumentation in place, the code ran just fine – as the application was not using custom security providers and the oraclepki.jar embedding the OraclePKIProvider class was not present, the NoClassDefFoundError was caught and the initialization continued just fine. Apparently something went horribly wrong after instrumenting such a simple code block.

To explain what was going on, we first need to take a step back to see what the Plumbr Agent is doing inside the JVM. Our promise stands that we will link slow and failed transactions with the actual root cause in the source code. Examples for such root causes would for example be slow JDBC statements executing poorly-crafted SQL operations or Exceptions being thrown breaking the end user usage flow.

To expose such operations, we instrument certain calls, such as calls to the JDBC API or throw/catch clauses for Exceptions during the classloading time. Having our instrumentation in place allows us to expose the details about the root cause, such as the offending SQL and the line in source code it was executed in case of a slow JDBC statement or an Exception stack trace in case of a transaction failing with a 500-series HTTP response code.

Digging deeper

So, somehow instrumenting the Exceptions managed to break the simple try/catch block. Digging deeper revealed that when our instrumentations were added, the JVM went through the following steps:

  1. Plumbr Agent instrumented try-catch blocks with our own code. This instrumentation inserted our new bytecode right in-between the “catch” clause and the (stack map) frame associated with that catch.
  2. Now the bytecode verifiers kicked in. In this case, three verifiers were applied to verify whether the bytecode generated is valid:
    1. First, ASM own verifier was applied programmatically by the Plumbr Agent to make sure we have not generated invalid bytecode. ASM’s bytecode instrumentation works using the visitor pattern. It calls the corresponding visit* method on every instruction, a label (a place where the code can jump to), and a stack map frame (extra data stored for every scope in a method). In our case, we used a label as an indicator of the beginning of a catch block. Anyhow, the ASM verifier considered the generated class to be okay.
    2. Next, the Hotspot JVM launched the main verifier. The reason why JVM has two layers of verifiers is outside the scope of this post, but we can say that having this verifier present allows JIT to do its work faster during runtime. The main verifier threw a java.lang.VerifyError because it expected a stack frame starting exactly at the catch block.
    3. Having the main verifier failing, a fallback verifier was called by Hotspot. Fallback verifier would normally had considered the class to be OK. However, the verifier is built to require all the imported classes to be available, but recalling that the imported oracle.security.pki.OraclePKIProvider was not available resulted in the fallback verifier throwing a NoClassDefFoundError
  3. Since no Java code was visited between the two Hotspot verifiers, the NoClassDefFoundError thrown by the fallback verifier just “ate” the actual “java.lang.VerifyError: Expecting a stackmap frame at branch target 230” error, since in native mode only the last throwable was remembered.
  4. NoClassDefFoundError was thrown while linking the OracleDriver class, not inside the static initializer of OracleDriver where it would be caught

How come is it so complicated?

Sounds straightforward? Might be, when the truth is staring at you in the form of the four steps above. For me, it took a whole day to come to this. Why was it so hard?

First, classes rarely depend on being able to function without its imports being loadable, OracleDriver does that because they were probably well aware that Hotspot is able to handle this.

Next, the whole mess with three verifiers is just that, a mess. When you consider that

  • ASM verifier behavior is completely different from HotSpot verifier and is qualifying the generated bytecode as OK
  • Hotspot main verifier considered the code to be invalid. As expected, the main verifier threw the VerifyError
  • Hotspot fallback verifier tries to load a missing class (which the main verifier does not try to do) and thanks to our changes in bytecode throws the ClassDefNotFoundError

you see that situations like this are indeed a nightmare to debug.

But what was truly nasty was the way fallback verifier just swallowed the original error. Finding out the original error message now involved debugging native code, since it was only referenced inside the JVM in the time between running the two separate verifiers.

Solving the root cause

The solution itself was not too tough –  all we needed was to make sure that if the label visit was originally followed by a frame visit, we would insert our instructions after the frame visit, not between the frame and the label.


private static class ThrowableMethodVisitor extends MethodVisitor {
   private final List

Take-Away

This was a day in my life I am never going to get back. And all thanks to the two layers of verifiers swallowing the original java.lang.VerifyError and packaging it as a completely unrelated java.lang.NoClassDefFound. Hopefully with the help of search engines this blog post will save someone a day of their lives. Or, if not, then I guess you just learned something not-too-practical about verifiers and bytecode generation.

ADD COMMENT