3.3. Java Processors
performance speedup of the JIT compiler can be combined. Such a processor must support the architectural features specified for the JVM. A Java processoris an execution model that implements the JVM in silicon to directly execute Java bytecodes. Java processors can be tailored specifically to the Java environment by providing hardware support for such features as stack processing, multithreading,and garbage collection. Thus, a Java processor can potentially deliver much better performance for Java applications than a general-purpose processor. Java processors appear to be particularly well-suited for cost-sensitive embedded computing applications.
Some of the design goals behind the JVM definition were to provide portability, security, and small code size for the executable programs. In addition, it was designed to simplify the task of writing an interpreter or a JIT compiler for a specific target processor and operating system. However, these design goals result in architectural features for the JVM that pose significant challenges in developing an effective implementation of a Java processor [O’Connor and Tremblay 1997]. In particular, there are certain common characteristics of Java programs that are different from traditional procedural programming languages. For instance, Java processors are stack-based and must support the multithreading and unique memory management features of the JVM. These unique characteristics suggest that the architects of a Java processor must take into consideration the dynamic frequency counts of the various instructions types to achieve high performance.
The JVM instructions fall into several categories, namely, local-variable loads and stores, memory loads and stores, integer and floating-point computations, branches, method calls and returns, stack operations, and new object creation. Figure 3 shows the dynamic frequencies of the various JVM instruction categories measured in the Java benchmark programsLinPack, CaffeineMark, Dhrystone, Symantec, JMark2.0, and JavaWorld (benchmark program details are provided in Section 5). These dynamic instruction frequencies were obtained by instrumenting the source code of the Sun 1.1.5 JVM interpreter to countthe number of times each type of bytecode was executed.
As seen in this figure, the most frequent instructions are local variable loads (about 30–47% of the total instructions executed), which move operands from the local variables area of the stack to the top of the stack. The high frequency of these loads suggests that optimizing local loads could substantially enhance performance. Method calls and returns are also quite common in Java programs. As seen in Figure 3, they constitute about 7% of the total instructions executed. Hence, optimizing the method call and return process is expected to have a relatively large impact on the performance of Java codes. Since method calls occur through the stack, a Java processor also should have an efficient stack implementation.
Another common operation supported by the JVM is the concurrent execution of multiple threads. As with any multithreading execution, threads will often need to enter synchronized critical sections of code to provide mutually exclusive access to shared objects. Thus, a Java processor should provide architectural support for this type of synchronization. A Java processor must also provide support for memory management via the garbage collection process, such as hardware tagging of memory objects, for instance.
PicoJava-I [McGham and O’Connor 1998; O’Connor and Tremblay 1997] is a configurable processor core that supports the JVM specification. It includes a RISC-style pipeline executing the JVM instruction set. However, only the most common instructions that most directly impact the program execution are implemented in hardware. Some moderately complicated but performance critical instructions are implemented through microcode. The remaining instructions aretrapped and emulated in software by the processor core. The hardware design is thus simplified since the complex, but infrequently executed, instructions do not need to be directly implemented in hardware.
In the Java bytecodes, it is common for an instruction that copies data from a local variable to the top of the stack to precede an instruction that uses it. The picoJava corefoldsthese two instructions into one by accessing the local variable directly and using it in the operation. Thisfolding operation accelerates Java bytecode execution by taking advantage of the singlecycle random access to the stack cache. Hardware support for synchronization is provided to the operating system by using the low-order 2-bits in shared objects as flags to control access to the object.
The picoJava-2 core [Turley 1997], which is the successor to the picoJavaI processor, augments the bytecode instruction set with a number of extended instructions to manipulate the caches, control registers, and absolute memory addresses. These extended instructions are intended to be useful for non-Java application programs that are run on this Java execution core. All programs, however, still need to be compiled to Java bytecodes first, since these bytecodes are the processor’s native instruction set. The pipeline is extended to 6 stages compared to the 4 stages in the picoJava-I pipeline. Finally, thefolding operation is extended to include two local variable accesses, instead of just one.
Another Java processor is the Sun microJava 701 microprocessor [Sun Microsystems. MicroJava-701 Processor]. It is based on the picoJava-2 core and is supported by a complete set of software and hardware development tools. The Patriot PSC1000microprocessor [Patriot Scientific Corporation] is a general-purpose 32-bit, stack-oriented architecture. Since its instruction set is very similar to the JVM bytecodes, the PSC1000 can efficiently execute Java programs.
Another proposed Java processor [Vijaykrishnan et al. 1998] provides architectural support for direct object manipulation, stack processing, and method invocations to enhance the execution of Java bytecodes. This architecture uses a virtual address object cachefor efficient manipulation and relocation of objects. Three cache-based schemes—thehybrid cache, the hybrid polymorphic cache, and the two-level hybrid cache—have been proposed to efficiently implement virtual method invocations. The processor usesextended foldingoperations similar to those in the picoJava-2 core. Also, simple, frequently executed instructions are directly implemented in the hardware, while more complex but infrequent instructions are executed via a trap handler.
TheJava ILP processor [Ebcioglu etal. 1997] executes Java applications on an ILP machine with a Java JIT compiler hidden within the chip architecture. The first time a fragment of Java code is executed, the JIT compiler transparently converts the Java bytecodes into optimized RISC primitives for a Very Long Instruction Word (VLIW) parallel architecture. The VLIW code is saved in a portion of the main memory not visible to the Java architecture. Each time a fragment of Java bytecode is accessed for execution, the processor’s memory is checked to see if the corresponding ILP code is already available. If it is, then the execution jumps to the location in memory where the ILP code is stored. Otherwise, the compiler is invoked to compile the new Java bytecode fragment into the code for the target processor, which is then executed.
While Java processors can deliver significant performance speedups for Java applications, they cannot be used efficiently for applications written in any other language. If we want to have better performance for Java applications, and to execute applications written in other languages as well, Java processors will be of limited use. With so many applications written in other languages already available, it may be desirable to have general-purpose processors with enhanced architectural features to support faster execution of Java applications.