Building a Java Virtual Machine Part 3: Following instructions
Part 3 of a series discussing the joys and pitfalls (mostly pitfalls) of hacking together a minimal JVM in Javascript. The live code base — a very rough work in progress! — is on GitHub. Subscribe for further updates! ;-)
OK, in the last installment I set myself up to get the bytecode for my Java Hello World main
method. All that I have to do now is execute it! (easy.) In this episode, I will start to parse “by hand” the instructions for the method, so we can see how some basic concepts work in order to make sure we’re automating them correctly. We will also make a fun brief aside to the world of text encoding.
Anyway, here is the full method, all nine bytes of it:
B2 00 07 12 0D B6 00 0F B1
Somehow, this ends up printing “Hello World” to the console (standard output). How? Let’s decode it.
Java bytecode is comprised of variable-length instructions. The first byte of each instruction is the opcode and so there are only 256 of them (actually not quite even that many). Starting from the beginning, 0xB2
is the opcode for the instruction called getstatic. If we look up this instruction we find the following in Oracle’s docs:
This is a high level instruction but it’s also pretty straightforward. This instruction is going to retrieve the value of a static field on a class. The documentation suggests to us a few important things:
- The format of each instruction is different, and this one is three bytes long.
- It impacts something called the operand stack, which is where the resulting value is pushed.
- This instruction does a lot of stuff: it looks into the constant pool, resolves several things, maybe does some initialization, and is able to throw all sorts of exceptions, right from the instruction itself!
I’m going to eventually need to handle all of these things, but setting aside all the error conditions and edge cases, what’s the minimum I need to do?
- Take the two bytes after the opcode and combine them to create a 16-bit index into the constant pool for the current class.
- We expect to find at that constant index a “field reference” which will indicate both a class name and a field name/descriptor.
- Resolve the field. Basically, find the named class object in the VM, find the named field within that object.
- Take the value of that named field, and push it onto the operand stack.
Step one: Figure out the constant pool index
This one is easy. The two bytes following the opcode are 00 07
so we need to look at index 7. (Notice that this number is a 16-bit big-endian integer. We don’t technically need to know that because the spec tells us how to combine the two bytes anyway, but it does make the hex easier to look at.)
Step two: Look up the field reference
Stick with me here, this gets a little tricky, but it’s helpful to keep in mind what we’re really looking for here: a reference to a field, which has a name and a type, and exists in some class. So we need three pieces of information (class, name, type). I have to traverse the constant pool a bit to collect them all. It’s like a really dull adventure game.
In short, the constant pool entry #7 as indicated by the instruction is a Fieldref
constant. That itself contains two indexes, referring to two other constant pool entries. One is a Class
and the other is a NameAndType
.
The Class
entry contains yet another index, to a Utf8
type constant, which is, finally, some raw string data (encoded as UTF8 bytes, more on this shortly). This string data reads java/lang/System
. There’s our class name.
The NameAndType
contains two indexes, to two Utf8
type constants for (you guessed it) the name and type of the field we care about. They are, respectively, out
and Ljava/io/PrintStream;
.
So now we have our three bits of information that form the field reference: a class (java.lang.System
), the name of a field (out
), and the descriptor of that field (which tells us it’s an object reference of type java.io.PrintStream
).
Sidebar: dots and slashes
You’ll notice that in the strings I retrieved from the constant pool, the class names use slashes instead of the traditional dot characters as namespace separators. That was confusing to me at first, but it’s just a historical quirk in the class file format that never got corrected. Basically, you just convert these to dots as you extract them.
A FieldRef in graphical form
Because I found the multiple indirection to be super confusing at first, it’s useful to see it in a diagram form:
See, that’s clearer! They’ve set up the constant pool to just make a lot of cross references. It’s sort of silly in this case because the class is tiny and only does one thing but you can imagine a large and complicated class needing to reference the same strings or classes over and over in different ways.
Step 3: Resolve the field
So where was I? Oh, right, the getstatic
instruction! I followed the index given by the instruction through the jungle of constant entries and emerged unscathed holding the complete reference to the static field I need to go get the value from! There is a class called (java.lang.)System, on which I’m told to expect to find some static field called out, which will contain a reference to a (java.io.)PrintStream. And this sounds familiar, right? Remember that single operative line of code in the HelloWorld.java program:
System.out.println("Hello World");
See what happened? The System.out
part got compiled into this getstatic
instruction, and it’s going to grab us an instance of type PrintStream
to use next.
OK, so all that I need to do is go find the System
class, and get the out
value that’s set on it.
Wait a minute. I only have this one class, the HelloWorld
class. Crap. What is the System class and where does it come from? Well, the System class is part of the standard JDK, or Java Developer Kit. It’s a framework class, in other words —part of a bundle of code that is expected to be accessible to the JVM even though it’s not “bundled” with HelloWorld itself.
That makes this field reference an external reference, and my VM clearly already has presented me with a new design requirement. Without being able to do something to retrieve this field, I can’t even complete the first instruction!
Oracle ships the JDK as a big opaque chunk alongside their standard JVM installation. They also sponsor the openJDK project on GitHub that has the code for this stuff, but I don’t want to get too far ahead of myself. Because the JDK is massive and complicated. I just want to print Hello World!
So: I’m going to fake it. And next time we’ll do that and finish steps 3 and 4! But before that…
Intermission: UTF8 and UTF16 and Strings, oh my!
Up above, we found constant string data embedded in the constant pool as UTF8 encoded data. What is that? Well, all strings need to be encoded into bytes in some form when you store them, and there have historically been loads of different ways to do this. If this topic is unfamiliar to you, please stop right now and go Joel Spolsky’s highly entertaining and informative backgrounder.
UTF8 is my favorite string encoding, and in fact it’s one of my favorite bits of programming design of any kind. The sort of system that is so elegant it feels more discovered than invented. (But it was invented, in a diner in New Jersey in 1992.)
All strings I’ll be dealing with in the JVM are Unicode strings (as in, they can represent the full range of Unicode characters) but have different encodings:
- Strings stored in the class file are a series of bytes in the UTF8 encoding.
- Strings within Java-land (ie, objects of the
java.lang.String
class) are internally kept in a UTF16 encoding. - Finally, strings within the VM itself are, of course, just normal Javascript strings. And these are actually (also) stored in a UTF16 encoding. This isn’t usually important to be aware of, unless you need to convert strings to and from native Javascript strings.
In Javascript, the String.fromCharCode
function can be used to build a native Javascript string from constituent characters. I might be tempted to write a conversion function like this:
// INCORRECTfunction stringFromUtf8BytesButDumbly(bytes) {
let jsstring = "";
for (let i = 0; i < bytes.length; i++) {
jsstring += String.fromCharCode(bytes[i]);
}
return jsstring;
}
And strangely enough, for the UTF8 strings we have encountered so far, it would work! But this is only a lucky coincidence — the standard ASCII character set looks the same in ASCII, UTF8, and UTF16 encodings, by design, and the strings we’ve seen so far are all in that range.
But it’s not true that all of the strings we’ll find in the class file are limited to ASCII characters! It’s also important to understand that the parameter to fromCharCode
is a UTF16 code unit. This snippet will break for any character outside that range because it will neither interpret the UTF8 input correctly nor transform it correctly into semantically equivalent UTF16 code units.
Doing this correctly is nontrivial in the sense that you need to do some actual bitwise processing to deal with the inbound and outbound surrogate pairs, so it requires some real logic. I found an excellent blog post and Javascript function by Rogue Amoeba that does the conversion correctly. That’s what I have used get strings out of the constant pool accurately.
We’ll get to the Java-land UTF16 string handling later on.
I have a project on GitHub called Kopiluwak that reflects the current state of the attempt; if you’re interested you can check it out. You are encouraged to contact me at bzotto at gmail dot com (as we say in the sneaky 90s way) with your feedback. All text is copyright © Ben Zotto, 2021, with all rights reserved