Monday, August 28, 2023

Java 22: Panama FFM Provides Massive Performance Improvements for Native Strings

 

Java 22: Panama FFM Provides Massive Performance Improvements for Native Strings

The Panama Foreign Function and Memory (FFM) API is slated to be finalized in Java 22 and will then be a part of the public Java API. One thing that is perhaps less known is the significant performance improvements made by FFM in certain areas in 22. In this short article, we will be looking at benchmarking string conversion in FFM for Java 21 and Java 22 compared to using old JNI calls.

C and Java Strings

The Java Native Interface (JNI) has been used historically as a means to bridge Java to native calls before FFM was available. Both schemes entail converting strings back and forth between C and Java String data structures. As you might remember, C strings are just a bunch of bytes that are zero-terminated whereas Java Strings use a backing array with a known length.

When calling native function that takes one or more strings and/or returns a string (both are relatively common), the performance of converting strings back and forth becomes important.

Benchmarks

We have run some string benchmarks (see the source code here) on a AMD Ryzen 9 3900X 12-Core Processor machine and preliminary results indicate blistering performance for FFM string conversion in Java 22:

Benchmark                           (size)  Mode  Cnt    Score   Error  Units
ToJavaStringTest.jni_readString          5  avgt   30   86.520 ? 1.842  ns/op
ToJavaStringTest.jni_readString         20  avgt   30   97.151 ? 1.459  ns/op
ToJavaStringTest.jni_readString        100  avgt   30  143.853 ? 1.287  ns/op
ToJavaStringTest.jni_readString        200  avgt   30  189.867 ? 2.337  ns/op
ToJavaStringTest.panama_readString       5  avgt   30   21.380 ? 0.351  ns/op
ToJavaStringTest.panama_readString      20  avgt   30   36.250 ? 0.520  ns/op
ToJavaStringTest.panama_readString     100  avgt   30   43.368 ? 0.544  ns/op
ToJavaStringTest.panama_readString     200  avgt   30   53.442 ? 2.048  ns/op


Benchmark                         (size)  Mode  Cnt    Score   Error  Units
ToCStringTest.jni_writeString          5  avgt   30   47.450 ? 0.832  ns/op
ToCStringTest.jni_writeString         20  avgt   30   56.208 ? 0.422  ns/op
ToCStringTest.jni_writeString        100  avgt   30  108.341 ? 0.459  ns/op
ToCStringTest.jni_writeString        200  avgt   30  157.119 ? 1.669  ns/op
ToCStringTest.panama_writeString       5  avgt   30   45.361 ? 0.717  ns/op
ToCStringTest.panama_writeString      20  avgt   30   47.742 ? 0.554  ns/op
ToCStringTest.panama_writeString     100  avgt   30   47.580 ? 0.673  ns/op
ToCStringTest.panama_writeString     200  avgt   30   49.060 ? 0.694  ns/op

Needless to say, the ToJavaStringTest runs are converting a C string to a Java string whereas the ToCStringTest runs convert a Java string to a C string. The size indicates the number of bytes of the original string. Java strings were coded in UTF-8.

As can be seen, we can expect FFM to convert C strings to Java strings more than three times faster with FFM in Java 22. In the other direction, performance will be about the same for small strings but for larger strings (where it matters more), the speedup factor will be ever-increasing. For example, for strings of length 200, the speedup factor is more than three times.

Note

It should be noted that the benchmarks are not purely about string conversion as a JNI call also incurs a small state transition penalty for each call. The Java to C string performs a memory allocation (for the string bytes). While this could be avoided, it was included in the benchmark as that is what happens with JNI’s GetStringUTFChars.

Diagrams

Here are two diagrams outlining the performance benefits of FFM in comparison with JNI. The diagram also includes FFM in Java 21 to highlight the recent performance improvements made in 22 (Lower is Better):



Diagram 1, shows the performance of converting a C string to a Java String.



Diagram 2, shows the performance of converting a Java string to a C String.

Future Improvements

FFM allows us to use custom allocators and so, if we make several calls, we can reuse memory segments thereby improving performance further. This is not possible with JNI.

It is also possible that we will see even better FFM performance in future Java versions once the Vector API becomes a final feature.

JDK Early-Access Builds

Run your own code on an early access JDK today by downloading a JDK Early-Access Build.

Note

At the time of writing this article, the performance improvements are not merged in the Java 22 mainline yet. You can however build your own snapshot version with the performance improvements mentioned above by cloningithub.com/openjdk/panama-foreign

Resources

Acknowledgments

This article was written by me (Per Minborg) and Maurizio Cimadamore.

Wednesday, August 2, 2023

Java: New Draft JEP: "Computed Constants"

Java: JEP Draft: "Computed Constants"

We finally made the draft JEP "Computed Constants" public and I can’t wait to tell you more about it! ComputedConstant objects are superfast immutable value holders that can be initialized independently of when they are created. As an added benefit, these objects may in the future be even more optimized via "condensers" that eventually might become available through project Leyden.

Background

Oftentimes, we use static fields to hold objects that are only initialized once:

// ordinary static initialization
private static final Logger LOGGER = Logger.getLogger("com.foo.Bar");
...
LOGGER.log(...);

The LOGGER variable will be unconditionally initialized as soon as the class where it is declared is loaded (loading occurs upon the class being first referenced).

One way to prevent all static fields in a class from being initialized at the same time is to use the class holder idiom allowing us to defer initialization until we actually need the variable:

// Initialization-on-demand holder idiom
Logger logger() {
    class Holder {
         static final Logger LOGGER = Logger.getLogger("com.foo.Bar");
    }
    return Holder.LOGGER;
}
...
logger().log(...);

While this works well in theory, there are significant drawbacks: 

  • Each constant that needs to be decoupled would need its own holding class (adding static footprint overhead) 
  • Only works if the decoupled constants are independent 
  • Does only work for static variables and not for instance variables and objects

Another way is to use the double-checked locking idiom that can also be used for deferring initialization. This works for both static variables, instance variables and objects:

// Double-checked locking idiom
class Foo {

    private volatile Logger logger;

    public Logger logger() {
        Logger v = logger;
        if (v == null) {
            synchronized (this) {
                v = logger;
                if (v == null) {
                    logger = v = Logger.getLogger("com.foo.Bar");
                }
            }
        }
        return v;
    }
}
...
foo.logger().log(...);

There is no way for the (current) JVM to determine that the logger is monotonic in the sense that it can only change from null to a value once and then will always remain. So, the JVM is unable to apply constant folding and other optimizations. Also, because logger needs to be declared volatile there is a small performance penalty paid for each access.

The ComputedConstant class comes to the rescue here and offers the best of two worlds: Flexible initialization and good performance!

Computed Constant

Here is how ComputedConstant can be used with the logger example:

class Bar {
    // 1. Declare a computed constant value
    private static final ComputedConstant<Logger> LOGGER =
            ComputedConstant.of( () -> Logger.getLogger("com.foo.Bar") );

    static Logger logger() {
        // 2. Access the computed value
        //    (evaluation made before the first access)
        return LOGGER.get();
    }
}

This is similar in spirit to the class-holder idiom, and offers the same performance, constant-folding, and thread-safety characteristics, but is simpler and incurs a lower static footprint since no additional class is required.

Benchmarks

I’ve run some benchmarks on my Mac Pro M1 ARM-based machine and preliminary results indicates excellent performance for static ComputedConstant fields:

Benchmark      Mode  Cnt  Score   Error  Units
staticHolder   avgt   15  0.561 ? 0.002  ns/op
doubleChecked  avgt   15  1.122 ? 0.003  ns/op
constant       avgt   15  0.563 ? 0.002  ns/op // static ComputedConstant

As can be seen, a ComputedConstant has the same performance as the static holder (but with no extra class footprint) and much better performance than a double-checked locking variable.

Collections of ComputedConstant

So far so good. However, the hidden gem in the JEP is the ability to obtain Collections of ComputedConstant elements. This is achieved using a factory method that provides not a single ComputedConstant (with its provider) but a whole List of ComputedConstant elements that is handled by a single providing mapper that can initialize all the elements in the list. This allows a large number of variables to be handled via a single list, thereby saving space compared to having many single constants and initialization lambdas (for example).

Like a ComputedConstant<V> variable, a List<ComputedConstant<V>> variable is created by providing an element mapper - typically in the form of a lambda expression, which is used to compute the value associated with the i-th element of the List when the element value is first accessed:

class Fibonacci {
    static final List<ComputedConstant<Integer>> FIBONACCI =
            ComputedConstant.of(1_000, Fibonacci::fib);

    static int fib(int n) {
        return (n < 2)
                ? n
                : FIBONACCI.get(n - 1) + FIBONACCI.get(n - 2);
    }

    int[] fibs = IntStream.range(0, 10)
            .map(Fibonacci::fib)
            .toArray(); // { 0, 1, 1, 2, 3, 5, 8, 13, 21, 34 }

}

Note how there’s only one field of type List<ComputedConstant<Integer>> to initialize - every other computation is performed on-demand when the corresponding element of the List FIBONACCI is accessed.

When a computation depends on more sub-computations, it induces a dependency graph, where each computation is a node in the graph, and has zero or more edges to each of the sub-computation nodes it depends on. For instance, the dependency graph associated with fib(5) is given below:

               ___________fib(5)___________
              /                            \
        ____fib(4)____                ____fib(3)____
       /              \              /              \
     fib(3)         fib(2)         fib(2)          fib(1)
    /      \       /      \       /      \
  fib(2)  fib(1) fib(1)  fib(0) fib(1)  fib(0)

The Computed Constant API allows modeling this cleanly, while still preserving good constant-folding guarantees and integrity of updates in the case of multi-threaded access.

Benchmarks Collections

These benchmarks were run on the same platform as above and show collections of ComputedConstant elements enjoy the same performance benefits as the single ones do:

Benchmark      Mode  Cnt  Score   Error  Units
staticHolder   avgt   15  0.570 ? 0.005  ns/op // int[] in a holder class
doubleChecked  avgt   15  1.124 ? 0.044  ns/op
constant       avgt   15  0.562 ? 0.005  ns/op // List<ComputedConstant>

Again, the ComputedConstant clocks in at native static array speed while providing much better flexibility as to when initialized.

Instance Performance

The performance for instance variables and objects is superior to holders using the double-checked idiom showed above as can be seen in the benchmarks below:

Benchmark      Mode  Cnt  Score   Error  Units
doubleChecked  avgt   15  1.259 ? 0.023  ns/op
constant       avgt   15  0.728 ? 0.022  ns/op // ComputedConstant

So, ComputedConstant is more than 40% faster than the double-checked holder class tested on my machine.

Note: Instance performance is subject to review.

Where is it?

At the time of writing this article, ComputedConstant is not yet available in the mainline JDK repository. Check out the next section for a link to the proposed source code.

Acknowledgments

Parts of the text in this article were written by Maurizio Cimadamore