Disclaimer: Since this post has been written, the Spark team released version, which we are also running at the moment. While the post describes issues we had upgrading to version , most likely all of the points mentioned below still apply, and in fact the topic of Class Loaders applies to anything that runs on the JVM.
At AppLift we use Apache Spark for quite a few data pipelines, and we are big fans of both Spark and Scala. That said, recently we were a little bit puzzled after getting exceptions while trying to upgrade to the recently released version 1.5.2 - even more so because the exceptions were stemming not from misusing the Spark API itself but rather something more fundamental: JDBC.
Here's an example:
Usually when you get exceptions like this people will tell you that you need to load your JDBC driver, by doing Class.forName("...") to dynamically load the class, which will then register itself according to the JDBC standard.
This is how it looks like inside the MySQL driver for example java.sql.DriverManager.registerDriver(new Driver());
...and of course our code was doing that already.
To fully understand what is going on, we need to take make a quick detour into Java class loaders.
Class loaders are organised in a hierarchy. The root of this hierarchy is the bootstrap class loader written in native code. It loads the core of the Java standard library when you start the JVM. You can't "see" it from inside the program: if you try to get the CL of the classes it loaded, you will get back null. The DriverManager is a part of Java, and as such is loaded by this CL.
Then below there's the extension class loader, which loads... well, Java extensions, but it's not relvant to this problem. It's child though, the sun.misc.Launcher$AppClassLoader is the main loader, which is usually loading most of the application classes on the main classpath.
Usually, but not when you use spark-submit. Spark prepares another class loader for you, an instance of org.apache.spark.util.MutableURLClassLoader by default, and it will be this one to load all your classes, including the JDBC driver if you include it in the UberJar.
The manouvre of having a separate class loader for user classes is done in an attempt to separate stuff your application is loading and what that Spark is bringing in, as there might be conflicts in dependencies versions etc.
Now, the DriverManager has a check to see if the class that is calling the getConnection method, or more specifically it's class loader, will be able to load a given driver. If not, you won't find a suitable driver even though it's registered.
Since your code is loaded with Spark's CL, and so is the Driver, it will work without issues. The problem is that when submitting your jobs with spark-submit from version 1.5.0 onwards, Spark assembly jar contains classes of Apache Commons DBCP (DataBase Connection Pool), which is loaded by AppClassLoader, and thus the pool can't load your driver, because AppClassLoader knows nothing about your classes loaded with MutableURLClassLoader.
Why does that even matter? Because if you use ScalikeJDBC, like we do, to do SQL in Scala – well they use DBCP. So Scalike will not see your driver even though it's actually registered in the manager. I know.
By the way, to see the current class loader hierarchy and the classes loaded in each CL (warning: it's a lot of classes), you can use this bit of reflective Scala code:
Example output (truncated):
and if you want to see all the (usable) drivers with their class loaders:
Driver loaded with org.apache.spark.util.MutableURLClassLoader@13eb8acfLike so:
Fine, but how can one overcome this problem? There are a couple of solutions.
•bring in the necessary driver classes through the --jars flag, as they are loaded by AppClassLoader. This is only viable if you are willing to ship a separate jar for the JDBC driver and will only work if you use the database in the Spark driver process. In executors the problem is still there.
•load the driver by using spark.executor.extraClassPath option and analogous spark.driver.extraClassPath. The same caveats apply, and you have to make the driver accessible by both driver and all executors.
•use spark.driver.userClassPathFirst and analogous executor option. This directs all calls for classloading (including Spark internals) to load classes for your jar. It basically nullifies the separation between Userland and Sparkland and is not a good idea on so many levels. You will probably be hit by a wave of cryptic exceptions from everywhere if your code is a bit more complicated than HelloWorld and uses the same libs as Spark (and Spark uses a lot of common libs, as you may suspect by now)
Okay, but let's say that you want to load this driver from your UberJar, not a separate file, without breaking the classpath. Is there no way to do this? Turns out there is. In fact, it was a problem for Spark developers too, and they made a nifty wrapper that is loaded from Spark assembly. See here: https://issues.apache.org/jira/browse/SPARK-6913 and the PR here https://github.com/apache/spark/pull/5782/files
Turns out you can use the same DriverRegistry wrapper to load your drivers!
It's private to the org.apache.spark.sql package so you need to put your wrapper class there:
Then all you need to do is instead of Class.forName(...), say DriverRegistryWrapper.register(...). Works on both driver and executors as they both get your UberJar. Neat!
We use Apache Parquet as the storage format for our historical data for purposes of BI queries. But as our needs are pretty specific we have our own Hadoop FileOutputFormat that uses some Hadoop-Parquet interop classes that are private to package org.apache.parquet.hadoop. Now that they are included in the Spark assembly too, they are already loaded by the AppClassLoader before your code gets executed. And we found out we can't access them at this point because of the principle of class loader separation:
java.lang.IllegalAccessError: tried to access class org.apache.parquet.hadoop.CodecFactory from class org.apache.parquet.hadoop.OurCustomParquetOutputFormat
There are two solutions to this problem:
•build your own Spark with your custom classes baked in the assembly
•do a bit of class loader hacking
Because we are lazy and like hacking, we decided to try the second way. It might not be the optimal one for you, and depending on what you want to do, your mileage may vary. But at least you'll see how to make class loaders work for you.
Now, the goal is the following: we want to essentially load those classes again with the same class loader that will load our custom class, so that they are all in the same CL and become accessible. However, we want only to re-load the classes that are necessary, and not, say, the protobuf Message class. That's because when your newly loaded classes will try to talk to the others, they will disagree about the Message interface, even though it's actually the same class loaded twice (sick, I know. Makes me miss Go and implicit interfaces)!
But by default, once you load a class with a typical CL, it will delegate the loading to the parents, and it's exactly what we don't want. We want a CL that loads a given list of classes itself, regardless of the parent, but delegates everything else. Fortunately, Spark comes with a class that does a similar thing: ChildFirstURLClassLoader, and requires just a bit of extending:
Then use it like this (you are in the executor so you can get SparkContext like this safely) to add your UberJar:
Now all you might need to do is to wrap this format and it's instatiation above in another FileOutputFormat and just forward calls to getRecordWriter
Then pass the wrapper format normally as you would in e.g. calling RDD.saveAsNewAPIHadoopFile.