Building a Java Virtual Machine Part 2: A method to the madness
Part 2 of a series discussing the joys and pitfalls (mostly pitfalls) of hacking together a minimal JVM in Javascript. The live code base — a very rough work in progress! — is on GitHub. Subscribe for further updates! ;-)
I finished last article by walking through the simple Java class file for Hello World. This time I’m going to talk more about Javascript, and what structures are needed for using that class data.
I called my project Kopiluwak. All self-respecting Java-adjacent projects use stupid coffee puns for their names, and this is no exception. I figured a (s)crappy effort deserved a (s)crappy name. Here’s the reference, feel free to be grossed out on your own time!
So here’s a little data structure for holding the loaded raw class data. (The KL is for kopi luwak):
function KLRawClassData(...) {
this.name = ...; // string
this.superclassName = ...; // string
this.accessFlags = ...; // bit flags
this.constantPool = ...; // 1-indexed array of tagged constants
this.interfaces = ...; // array of string names
this.fields = ...; // array of "field infos"
this.methods = ...; // array of "method infos"
this.attributes = ...; // array of "attribute infos"
}
I did it this way because I was never smart enough to really grok the prototypal object stuff in Javascript and I didn’t want to learn new things either, and I know this is a constructor form so I can say let rawClassData = new KLRawClassData(with, some, params)
so that’s how I’m gonna roll for now. Obviously I omitted the param list in the snippet above.
This structure captures the stuff I parsed out of HelloWorld.class but doesn’t do much in the way of transformation. It effectively holds a deserialized version of the class file, not a runtime-ready representation of the class. For that we’ll need to do a bit more work.
I know my HelloWorld
class doesn’t have any interfaces or fields, which is handy, and really all I care about for right now is having the code for the main
method. So I’ll start building out a runtime class object, called KLClass
. For now the basics seem to be:
- The class name, in string form.
- A reference to the superclass (or null if none).
- The full constant pool, because this is needed for runtime access
- The name and code for each method.
The superclass reference may not be necessary at this point, since I’m only going to load one class, but it will be necessary soon and it’s useful to think of these class objects as existing in a walkable hierarchy.
I can just copy the first three things from my “raw” class structure but the method information needs some teasing out.
There are three parts to the method that matter for the time being: the name, the descriptor, and the bytecode itself, which we need so we can try to execute it.
Interlude: What’s in a name?
The name of a Java method is just a string, like “main” or “println” or “doSomethingCool”. The name itself is stored within the constant pool, and pointed to via index from the method info. The name itself is easy.
But Java allows you to overload method names: multiple methods within a class that share an identical name but differ in the types or number of their arguments, for example this silly hypothetical:
int addNumbers(int a, int b, int c);
float addNumbers(float[] vals);
long addNumbers(String numstr1, String numstr2);
All three of these methods can exist on the same class at the same time, and all of them are called addNumbers
. The way that the JVM distinguishes these internally is via their descriptors, which is a simple text shorthand that encodes the method signatures.
Here are the three descriptors for the hypothetical methods above:
(III)I
([F)F
(Ljava.lang.String;Ljava.lang.String;)J
Primitive types —int
, float
, long
above (and also char
, byte
, double
, short
, boolean
) — have single-letter shorthands: I
for int
, F
for float
, etc. The only ones that are hard to guess are J
for long
and Z
for boolean
.
Object references (to classes or interfaces) use type shorthands that shove the fully-qualified name of the class or interface between an L
and a ;
. Thus, Ljava.lang.String;
is a reference to an object of class String
, as seen above.
With me so far? We’ll see these descriptor shorthands used to describe the type for individual fields and values in other contexts. But here, we see multiple ones concatenated together to describe a method signature. The form of the method signature descriptors is also easy: it’s the string of argument types inside parentheses, followed by the type of the return value. No commas and no actual names, just type information jammed together.
So, (III)I
should now make sense. This is a method that takes three int
parameters, and returns an int
value. Simple, right?
The third one looks long and complicated but it’s actually easy too. It’s just two Strings
in a row as parameters, and returns a long
(the J
).
The middle descriptor only has one unexpected character in it: the left bracket. That’s the way a descriptor indicates an array, by preceding the type contained in the array with one left bracket for each dimension of the array. So an array of float
s looks like [F
. A two-dimensional array of integers used to store a simple game board might look like [[I
. So now we can easily see that the middle descriptor accepts an array of floating point values and returns one floating point value.
These encoded type descriptors for individual values and for method signatures don’t really show up in Java-the-language, but they’re everywhere in the JVM. So much so that I created two JS objects that to parse and express these, so I don’t need to deal in the raw text for them: JType
and KLMethodDescriptor
.
Setting the methods table
Now that I have a way of capturing the possible distinctions in methods, I can uniquely identify method implementations. I’m going to glom them together with a hash symbol separator in a map of methods within the class, which gives me a unique key to use for lookup.
I can pull the bytecode for the method out of the loaded class’s method information (it’s stored as a so-called attribute of the method), and it’s just an array of bytes, which is to say an array of integer values 0–255.
So let’s give my KLClass
a methods
object, keyed by the combination of name and descriptor, and set it up like this, for each method in the class:
let methodId = methodName + '#' + methodDescriptor;
this.methods[methodId] = {
"code": bytecodeArray,
"name": methodName,
"descriptor": new KLMethodDescriptor(methodDescriptor),
"accessFlags": accessFlag
};
Note that I’ve repeated the name and descriptor in the method value object, because I may still need to reference these independently after a lookup. I’ve also carried through the access flags that were specified (this indicates whether the method is public, private, static, etc).
Because I know the JVM needs to start executing the special main
method, and because I know what its descriptor must be, I can add a method onto KLClass
that does a special lookup for it:
this.findMainEntrypoint = function() {
let mainId = "main#([Ljava.lang.String;)V";
let method = this.methods[mainId];
if (method) {
if ( /* check for public, static access */ ) {
return method;
}
}
return null;
}
The descriptor for the entry point should accept an array of String
and return void
. The void return is a special case in method descriptors. Void is not a type, but the V
character is used to mark a void return value as seen above.
If the class has a main
method that was declared as the Java standard entry point of public static void main(String[] args)
then this little function will find it.
The simple runtime class object
I’ve sketched out an object I can create within the VM that will hold basic information about a class (and can be expanded). It has enough within it now to store and lookup the methods in the class.
function KLClass(rawClassData) {
this.className = rawClassData.className;
this.superclass = null; // Not yet setup.
this.accessFlags = rawClassData.accessFlags;
this.constantPool = rawClassData.constantPool; this.methods = {}; // Setup methods
for (let i = 0; i < rawClassData.methods.length; i++) {
// ... extract code and method info ...
// ... set into this.methods as shown above
} // Utility routines
this.methodWithNameAndDescriptor = function(name, desc) {...}
this.findMainEntrypoint = function() { ... }; // as above
}
Now my overall algorithm looks like this:
- Load the class file for
HelloWorld
into a data object.✅ - Create the
KLClass
using the data object ✅ - Find the
main
method reference, which includes the bytecode. ✅ - Start executing the bytecode!
Step four coming up! Just as an appetizer, here’s the the bytecode for the entirety of the main
method, which I’m now able to easily access from the class:
B2 00 07 12 0D B6 00 0F B1
What does it all mean?!?! Next time.
I have a project on GitHub called Kopiluwak that reflects the current state of the attempt; if you’re interested you can check it out. You can also contact me at bzotto at gmail dot com (as we say in the sneaky 90s way) with your feedback. All text is copyright © Ben Zotto, 2021, with all rights reserved