Building a Java Virtual Machine: How hard could it be?

Part 1 of a series discussing the joys and pitfalls (mostly pitfalls) of hacking together a minimal JVM in Javascript. The live code base — a very rough work in progress! — is on GitHub. Subscribe for further updates! ;-)

Image for post
Image for post

his past summer, I had a conversation with a friend about his need to run small Java programs in a web page. It was a pandemic summer and I had some time, so this made me wonder how hard it would be to put together a small, simple Java Virtual Machine (JVM) in Javascript. I have done some hobby work on CPU emulation, but… I haven’t actually worked in Java or Javascript since about 2010. So ha ha ha, I was completely unqualified.

My goal was to get Hello World to run, which is to say, this bit of code:

class HelloWorld extends Object {
public static void main(String[] args) {
System.out.println("Hello World");
}
}

If you compile and run this HelloWorld.java file in a terminal, you get this:

% javac HelloWorld.java 
% java HelloWorld
Hello World
%

The first step invokes the compiler (javac) which produces a file called HelloWorld.class. The second step executes the compiled class’s main method.

I was interested in the second half, executing the compiled class file. I wasn’t sure how the output part was going to work but figured I’d plow in and see how far I got.

So I opened up a scratch HTML page and a scratch JS file and figured I’d try to load this class file.

Image for post
Image for post
Buster says Java is where it’s at

What are we working with here?

Might as well start with a look at this HelloWorld.class file. It’s 425 bytes long and it looks like this:

CAFEBABE 0000003B 001D0A00 02000307 00040C00 05000601 00106A61 76612F6C 616E672F 4F626A65 63740100 063C696E 69743E01 00032829 56090008 00090700 0A0C000B 000C0100 106A6176 612F6C61 6E672F53 79737465 6D010003 6F757401 00154C6A 6176612F 696F2F50 72696E74 53747265 616D3B08 000E0100 0B48656C 6C6F2057 6F726C64 0A001000 11070012 0C001300 14010013 6A617661 2F696F2F 5072696E 74537472 65616D01 00077072 696E746C 6E010015 284C6A61 76612F6C 616E672F 53747269 6E673B29 56070016 01000A48 656C6C6F 576F726C 64010004 436F6465 01000F4C 696E654E 756D6265 72546162 6C650100 046D6169 6E010016 285B4C6A 6176612F 6C616E67 2F537472 696E673B 29560100 0A536F75 72636546 696C6501 000F4865 6C6C6F57 6F726C64 2E6A6176 61002000 15000200 00000000 02000000 05000600 01001700 00001D00 01000100 0000052A B70001B1 00000001 00180000 00060001 00000001 00090019 001A0001 00170000 00250002 00010000 0009B200 07120DB6 000FB100 00000100 18000000 0A000200 00000300 08000400 01001B00 00000200 1C

Somehow I have to process this file and bring it to life. The first thing I notice is that first 4-byte hex value, CAFEBABE. Now I remember this bit of nerd trivia — this is the magic identifier for all Java class files. Hilarious!

Here’s what the file looks like in HexFiend, showing both the hex values and ASCII equivalents over on the right:

Image for post
Image for post

Most of the file seems to be taken up with string data. I can see the names of the class and method that I wrote in there. I can also see references to the objects and methods that are invoked from the code. Looks like maybe some debug stuff (“SourceFile”, “LineNumberTable”) too. The bottom few lines doesn’t look like text, maybe that’s the code part?

Well, there’s no need to reverse-engineer the contents of this file. Oracle (née Sun) has the whole JVM specification online, so I go right to the class files part of it which has the complete documentation of this file format.

Here’s the overall structure of a class file:

ClassFile {
u4 magic;
u2 minor_version;
u2 major_version;
u2 constant_pool_count;
cp_info constant_pool[constant_pool_count-1];
u2 access_flags;
u2 this_class;
u2 super_class;
u2 interfaces_count;
u2 interfaces[interfaces_count];
u2 fields_count;
field_info fields[fields_count];
u2 methods_count;
method_info methods[methods_count];
u2 attributes_count;
attribute_info attributes[attributes_count];
}

There’s that 4-byte magic number at the top, then some version stuff, and a thing called constant pool, and then some identifying stuff, and then looks like tables of interfaces, fields, methods, and attributes. OK. I don’t see anything that says “code” here, but we can figure it out. Presumably that’s part of the methods.

Now is a good time to remember what a Java class is, exactly.

Java is an object-oriented programming (OOP) langauge. It wasn’t the first by a mile, but it was commercially successful and really cranked the dial pretty far over towards “pure” OOP. Whereas C++ has a strong lineage back to C, and allows you to mess around with memory and write procedural functions if you wanted to, Java is all classes and methods and polymorphism and encapsulation, oh my! There’s no such thing as free-floating Java code that’s not inside a method, inside of a class. No naked variables in the global namespace. And with the exception of primitive types (like integers), all values in the language are object references to instances of classes.

So. A class is the definition of an object. Like a stamp pattern to make individual instances (objects) of that type. A class contains methods, which are functions specific to that class and that operate on data which lives on a particular instance. The data that lives within an object is stored in fields (instance variables). Both fields and methods can also be static, meaning they are not part of any particular instance but accessed through the class itself. Our main method, for example, is a static method of the class HelloWorld.

Each class has a superclass. This is a parent class from which it inherits fields and methods, in a chain of inheritance all the way up to Object, which is the “root” of the class hierarchy and doesn’t have a superclass. A class can also optionally implement one or more interfaces, which just means that it advertises supporting a specific set(s) of methods declared elsewhere.

This is not a comprehensive description of the Java type system, but it’s enough to explain why the basic unit of a compiled Java program is a binary encoding of a single class. There’s no “global” code or storage in Java, so all programs of any complexity are just a bunch of class files. My program consists of only one, which should contain everything I need to know to bring the class HelloWorld to life. Let’s look at it again.

Back to the bits: a class file in detail

Here’s how HelloWorld.class breaks down:

Image for post
Image for post

The first few bytes are the magic number and some file version numbers. The lion’s share of the data is taken up by the constant pool, which corresponds to all the string data, plus other stuff. We’ll get back to that. The few bytes of metadata indicate which class this is, and what its superclass is. There are no interfaces or fields contained in this file (the short string of zeroes after the metadata indicates zero-length tables), and that makes sense because I didn’t say that it implements any interfaces, and I also didn’t create any fields within the class. After those empty tables follows the methods section, then a small attributes table, and that’s the whole thing.

What’s a constant pool?

Image for post
Image for post
Hello world!

The constant pool holds are bits of constant data: string literals used in your code (our “Hello World” string is in there) as well as others used internally for the loading and execution of the class (you can see the name of the “main” method in there, too). There can be non-string data in there too, for initial values of variables, etc. Basically it’s just a sequential table, and it’s used both at load time (the name of the loaded class and its superclass are indexes into the constant pool, for example) as well as at runtime (loading constant values, references to other classes and methods). It’s a “pool” because the same index can be reused to pool together multiple references to the same piece of data.

OK, so I parse through the constant pool. Each entry has a tag that identifies what kind of constant it is, and these different kinds have different formats. To begin with, it’s enough to just add in handling for each kind that you hit in the one class file I care about. Eventually I’ll handle all of them.

Code and the rest of it

I wrote just one method in that original Java snippet: public static void main(String[] args). But in the class file, there are two methods, the one called main and another method called <init>. I won’t go into the structure of the methods table but it contains the name and descriptor (method signature) of each method, along with other metadata like access flags, exception handling information, and other stuff. I’ll get to those other bits down the road. Also, of course, the code itself for each method.

So the main method is in there, and so is this <init>. As you have probably guessed, that latter is a special name for a constructor. Constructors have no names in the Java language itself, so the compiler uses this special name <init> for an instance constructor. It seems a little odd that there would be one here, since I didn’t write one. I guess I’ll figure that out later.

The final thing in this file is the table of attributes, which is to say attributes which apply to the class as a whole. In this case, there’s one attribute, it’s called SourceFile and indicates that the file name containing the source code that was compiled into this class was called HelloWorld.java. This is meant for debugging so that you can construct useful stack traces later if you want to. (Notice that here, too, both the strings “SourceFile” and “HelloWorld.java” exist within the constant pool, and the attributes table refers to those string constants by index.)

Anyway, that’s it. There’s no magic to this, at least not so far. We have a meager 425 bytes in front of us which includes the full definition of the class we wrote, the code for the method(s) contained in it, some debug data.

Now we just have to dig out the code and run it… right??

Image for post
Image for post

Intermission: What actually is a Java Virtual Machine?

Traditional emulation of real hardware CPUs works, in a basic sense, by loading up a binary stream of compiled machine instructions — and then running a loop to fetch an instruction, execute the bit of logic it requires, and then move to the next instruction. And so on. Real hardware CPUs have instructions that each do a small, precisely-defined task, and don’t know anything about the program or the environment they’re executing. Management of memory, to say nothing of higher-level object structures, is left to the compiled application.

The designers of Java, at Sun Microsystems in the 1990s, used this idea as sort of like a loose inspiration. And then they riffed on it like smooth jazz, and came up with something sort of like that, but sort of not: The JVM does not reflect a real machine, but instead hypothesizes a mythical “computer” which is designed specifically to support Java (the language). A JVM executes Java bytecode which is a series of encoded instructions as in traditional machine language. So a JVM has instructions for comparisons, jumps, arithmetic. But it also includes object lifecycle management, and instructions as specific as accessing fields within objects and complex logic to determine e.g. which method override is the correct one to invoke.

Other high level compiled languages (e.g. C++, Swift) put this complexity inside the compiler and some sort of native runtime library, which all get glommed together into the executable. In Java, the VM executes the instructions but also plays the role of the runtime library, so the bytecode is quite compact. The JVM itself needs to be built to run on each desired platform, of course, but once it’s there, every Java application exists in a single platonic form.

What this means is that the specification for the Java Virtual Machine is necessarily pretty complicated. I bought a paperback copy off of Amazon, because owning physical computer books make you look serious (and also, old). This book is 584 pages long and contains no pictures. This is, understand, the spec just for the virtual machine. The Java Language itself has its own spec, which is 792 pages long.

Image for post
Image for post
It turns out this JVM thing may not be just a single-afternoon project. ¯\_(ツ)_/¯

The dream of the 90s is alive in Austin

(This section title is a reference to both the foundational Portlandia sketch and the recent announcement that Oracle, which acquired Sun, is relocating to Texas. Ha ha!)

Image for post
Image for post
“Sun Ray” thin-client computer. (Mark Boyce/CC BY-SA 3.0)

If you’re young and have never known a world without cell phones and influencers, you may be puzzled by why this virtual machine idea would be compelling. I swear to you, it was all the rage in the 1990s. Sun called it Write-Once-Run-Anywhere. As the internet emerged from academic obscurity to something with broad potential, one of the first key use cases for Java was providing interactivity within web pages using something called “applets”. For the first few years after the web’s invention, “pages” were just static affairs rendered with HTML. Bandwidth was slow and the web was used on a variety of computer platforms. If you wanted to do anything interactive, Java seemed like a great solution: your browser would have a JVM bundled with it, so it could download the (quite compact!) Java code from a web site in one quick go and then execute it locally in a frame within the web page. Java applets had a brief couple years of utility, but it turned out that Adobe’s Flash emerged as the more popular solution to complex web interaction, until eventually Javascript and HTML5 obsoleted Flash, too, and today all of the stuff that applets were designed to do just gets done within the actual web page itself.

Paging Alanis Morissette! We are building a Java Virtual Machine, using Javascript to run Java code inside a web page, which was the whole point of applets in the first place, but without applets. (Alanis is, also, a 90s reference.)

That’s a wrap for part 1, folks. I’ve looked through and parsed the class file, which is the only real input the project should need to accept… I think. We know what the JVM is, more or less, and so next time I’ll dive in and get ready to run the code.

This is the first in what I hope to be a quickly-written and published series about implementing a JVM in Javascript. I have a project on GitHub called Kopiluwak that reflects the current state of the attempt (currently well beyond just loading class files); if you’re interested you can check it out. You can also contact me at bzotto at gmail dot com (as we say in the sneaky 90s way) with your feedback. All text is copyright © Ben Zotto, 2021, with all rights reserved.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store