Building a Java Virtual Machine Part 4: Fake Objects, Fake Machines.

Part 4 of a series discussing the joys and pitfalls (mostly pitfalls) of hacking together a minimal JVM in Javascript. The live code base — a very rough work in progress! — is on GitHub. Subscribe for further updates! ;-)

Last time we got only halfway through our very first Java instruction (getstatic) but don’t feel bad! I simply had the bad fortune of writing a program whose very first instruction is one of the more complex ones in the whole JVM. Oh, sure, it could have been some sort of addition or subtraction or something easy, but the cool thing is that once we get through this one, a lot of others will seem simpler.

Getting static! (get it?). Boston Museum of Science ❤. Photo: Brendan Dolan-Gavitt/Flickr (CC BY-SA 2.0)

Let’s refresh what we’re trying to do to execute getstatic:

1. Create the 16-bit index into the constant pool for the current class.

2. Build the “field reference” of the class and field names.

3. Resolve the field. Basically, find the named class object in the VM, find the named field within that object.

4. Take the value of that named field, and push it onto the operand stack.

Steps 1 and 2: already done. For step three, we know we need to resolve ’s field, and we know that’s an external reference outside the single class (HelloWorld) that we loaded. We discussed how the class lives inside Oracle’s JDK library, and eventually we’ll need to load the real stuff, but for now I am going to cheat.

I’ll introduce a global map of loaded classes in my JVM:

This will hold a map of class names to instances. So I’ll add the first one:

let klclass = LoadClassFile(myInputClassFileBytes); 
KLLoadedClasses[klclass.name] = klclass;

And now I’ll just, uh, cook up some mock objects to shove in there too. Here’s a faked-up System class, created as part of the initialization work at the same time as the HelloWorld.class data gets loaded:

let systemRawFakeData = new KLRawClassData(
"java.lang.System", // class name
"java.lang.Object", // superclass name
0, // access flags (worry about this later)
[], // constant pool (empty)
[], // no interfaces
[ { "name": "out", // fields!
"descriptor": "Ljava.io.PrintStream;",
"access": ACC_PUBLIC | ACC_STATIC
}
],
[], // no methods
[] // no attributes
);
let systemClass = new KLClass(systemRawFakeData);
KLLoadedClasses[systemClass.name] = systemClass;

I’ve barely even begun to code and this is already a bit cumbersome, but since I built the object to be constructed with loaded raw data, it’s simplest to fake the loaded raw data here. There’s really nothing in here, except we are specifically saying that this class has one static field on it called and that the descriptor (type) of that field is going to be a reference to an object of type . We also note that it is a static field. This accords with the docs, and with what the compiled bytecode expects to find.

My original sketch for the didn’t include fields, just methods, so I need to add capacity to deal with fields. Two things are important to do here:

  1. Keep the field descriptors around for all fields of the class.
  2. For static fields (vs instance ones), I’m going to keep the actual contents of those fields within the class object.

So I’ll revise it to look something like this:

function KLClass(rawClassData) {
this.className = rawClassData.className;
this.superclass = null; // Not yet setup.
this.accessFlags = rawClassData.accessFlags;
this.constantPool = rawClassData.constantPool;
this.fields = {};
this.methods = {};
this.fieldVals = {}; // Values for static fields

// Setup fields
for (let i = 0; i < rawClassData.fields.length; i++) {
let rawField = rawClassData.fields[i];
this.fields[rawField.name] = rawField;
if ((rawFields.access & ACC_STATIC) != 0) {
this.fieldVals[rawField.name] = null;
}
}
// Setup methods...
// Utility routines...
}

OK!! Now we’ve cooked up a stand-in for , and it’s got an field. Like we want.

Working back towards our goal, let’s add some simple functions to resolve a field (and thus resolve a class):

function KLResolveClass(className) {
let klclass = KLLoadedClasses[className];
if (klclass) {
return klclass;
}
// We will eventually need to dynamically load things...
return null;
}
function KLGetStaticFieldValue(className, fieldName) {
let klclass = ResolveClass(className);
if (klclass) {
return klclass.fieldVals[fieldName];
}
return null;
}

And now I have enough logic to string together our full getstatic operation:

let fieldRefIndex = ((byte2 << 8) | byte3) >>> 0;
let fieldClassAndName = // some function (we did it by hand) ;
let value = KLGetStaticFieldValue(
fieldClassAndName.className,
fieldClassAndName.fieldName
);
...

Recall that the is produced by combining the second and third bytes of the getstatic instruction as shown in the last article, and that we figured out the field class and name manually from the constant pool, but of course, in reality we’ll have some function for that, too.

Anyway, nice!! should now contain the value of which is… let me look back at that code… well, it’s because that’s what we set for all static field values in the setup part of . But we don’t have to worry about that right now, because we have completed the requirements of the instruction! 🥳🥳🥳🥳

…Or have we? 🤨🤨🤨

The last thing we need to do is push that value onto the operand stack.

Do we know what that is yet? Well, no, we don’t, and we can’t go farther without it. And so we have to talk about the whole dang execution architecture of the Java Virtual Machine itself, which we have, to this point, conveniently just sort of ignored.

Refresher: How does a “real” computer work?

This is a suuuuuper generic diagram (by a noted artist) showing how a normal computer, like an x86 PC or an ARM smartphone works. The CPU has a bunch of registers inside of it, which are like scratch workspace, including a special “program counter” register that contains the memory address of the currently executing instruction. It also has a “stack pointer” that points to the top of the current stack in memory.

Over on the right side, you have your RAM (memory) which contains the code, static and dynamic data, and the stack.

The computer works by fetching code (instructions) from RAM, and executing those instructions: loading data into registers from RAM, doing stuff with the data, and storing data back out to RAM. That’s like the whole thing right there. The CPU helps you maintain the stack as a convenience to you, and that stack is one thing shared by all function calls.

Notice that I haven’t mentioned any particular programming language here. That’s because a normal computer doesn’t care. All programming languages get compiled down into the basic native instruction set of the CPU, whether you’re writing assembler code (!), C, or C++, or Objective-C or Swift, or Visual Basic Excel Macros, or whatever. High-level constructs like “classes” and “enums” and “buffers” and “arrays” and whatnot are just 100% not the computer architecture’s concern. The compiler you’re using emits (and links to) all the magic glue code required to make all those fancy things work, and the computer itself is very low-level and quite simple.

This is — and I cannot emphasize this enough — not what the JVM is like.

The JVM is, in a sense, the specification for a hypothetical, fictional computer architecture. But it moves the abstraction layer waaaay higher, and although parts of the JVM feel like a real sort of computer, it is very savvy about the Java langauge and its execution needs. We touched on all of this in the first installment of this series but I think it’s useful to revisit it in more detail now that we have seen some of the guts of JVM bytecode.

The execution environment of the JVM

Another masterpiece.

Here’s what the guts of the JVM I’m building really look like! There’s no CPU and RAM, as such, and anyway the whole thing lives in software to begin with so the distinction is meaningless. There is no “raw” empty memory; all of the live objects are sort of floating around in space there at the bottom containing their instance data. We also see the loaded classes floating around in the upper-right. That’s our map full of instances: each class holds the code for its methods and static data (as we’ve seen above).

This machine knows all about the high level constructs: objects, fields, arrays, the whole thing.

Now, the rest of the diagram is what I’ll call the execution environment. There is a big stack (at left), but this isn’t the same as the “stack” in a C program on an Intel chip. Each frame on the stack is one self-contained method invocation. And by self-contained, I mean that each stack frame has its very own internal stack; that’s the operand stack! It’s the workspace for that method’s instructions.

Now is a good time to mention that the JVM is what you would describe in Computer-Sciencey terms as a “stack machine.” This means that its instructions that do the real work generally operate on values at the top of the operand stack, and it has other instructions to push and pop values there. This is in contrast to the x86/ARM/normal computers which you might call “load/store” machines and which generally operate based on values in their registers. (Stack machine designs, it should be noted, are rare in the physical hardware world.)

OK, so our method frame contains its own operand stack. It also contains local variables, a numbered list (array) of them. This is a misleading name and doesn’t reflect local variables as you see in code. They act instead something like traditional registers, holding data that can get pushed onto the stack as needed. Arguments to methods are “passed” in these local variables.

Finally, the method’s frame contains a program counter that is relative to that method only. It’s simply the zero-based index into the bytecode of the method. All methods start their pc at zero.

In a traditional CPU, when a function calls another function, a bunch of CPU state ends up getting pushed onto the one big stack, including the return address and other stuff, because there’s only one stack and only one program counter. In the JVM, the whole method frame just stays frozen in time where it is when you call another method, with its operand stack and local variables and program counter remaining unchanged until you return to continue that method.

Let’s sketch out a method frame

In order to support the execution environment above, I’ll need a method frame object. It can be pretty simple. Let’s try this:

function KLMethodFrame() {
this.pc = 0;
this.operandStack = [];
this.localVariables = [];

this.method = ...; // reference to this frame's method.
}

Now we’ll keep a global (for now) big stack (as at left in the diagram above):

let KLMethodStack = [];

Now I have the final piece of the puzzle to [huge gulp of air] complete that first getstatic instruction.

A stack. Photo: IHOP. You can order these online! (This is not sponcon. They’re good pancakes.)

Push it onto the stack, baby

Recall from above that we (once again) found the external field reference that we wanted, then we resolved that class and that field, and then we got the value of the field, and now,

now,

we can:

Boom. This, of course, assumes that is the frame of the method we are executing. And it will, itself, exist on the . All fine to assume for our first, bespoke, hand-decoded, hand-executed instruction.

But we can’t do everything by hand. How does that all fit in with the execution of all the other instructions? We’ll do that next time!

I have a project on GitHub called Kopiluwak that reflects the current state of the attempt; if you’re interested you can check it out. You can also contact me at bzotto at gmail dot com (as we say in the sneaky 90s way) with your feedback. All text is copyright © Ben Zotto, 2021, with all rights reserved

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store