Monday, October 9, 2023

My Presentations at Devoxx BE 2023

Presentations at Devoxx BE October 2023

I did two presentations at Devoxx and also participated in the "Ask the Java Architects" presentation.



Here are the videos of the presentations:

Ask the Java Architects By Sharat Chander, Alan Bateman, Stuart Marks, Viktor Klang, Brian Goetz, and Per Minborg

Here is the code I used in my presentations:

Thanks to all who provided feedback and those who attended the presentations. See you next year! 

Thursday, September 14, 2023

Java Records are "Trusted" and Consequently Faster

 

Java Records are "Trusted" and Consequently Faster

Did you know Java records are trusted by the Hotspot VM in a special way? This makes their speed superior in some aspects compared to regular Java classes. In this short article, we will take a look at constant folding of instance fields and how this can bolster the performance of your Java application.

Background

Suppose we want to model an immutable point:

public interface Point {
    int x();
    int y();
}

Before record classes were introduced in Java, data classes had to be "manually" coded using a regular Java class like this:

public final class RegularPoint implements Point {

    private final int x;
    private final int y;

    public RegularPoint(int x, int y) {
        this.x = x;
        this.y = y;
    }

    @Override
    public int x() {
        return x;
    }

    @Override
    public int y() {
        return y;
    }

    // Implementations of toString(), hashCode() and equals()
    // omitted for brevity

}

With records, it became much easier and, we also got reasonable default implementations of the methods toString()hashCode() and equals():

public record RecordPoint(int x, int y) implements Point {}

As an extra bonus, there is an emerging property for records that makes them eligible for constant folding optimizations if used in a static context. Read more about this in the following chapters.

Setup

Suppose we keep track of the unique origin point in a static variable like this:

public static final Point ORIGIN = new RecordPoint(0, 0);

Further, assume we have a method that determines if a given Point is at the origin point:

public static boolean isOrigin(Point point) {
        return point.x() == ORIGIN.x() &&
               point.y() == ORIGIN.y();
}

We could then write a small program that demonstrates the principles:

public class Demo {

    public static final Point ORIGIN = new RecordPoint(0, 0);

    public static void main(String[] args) {
        analyze(new RegularPoint(0, 0));
        analyze(new RegularPoint(1, 1));
    }

    public static void analyze(Point point) {
        System.out.format("The point %s is %s at the origin.%n",
                point, isOrigin(point) ? "" : "not");
    }

    public static boolean isOrigin(Point point) {
        return point.x() == ORIGIN.x() &&
               point.y() == ORIGIN.y();
    }

}

When run, the code above will produce the following output:

The point RegularPoint{x=0, y=0} is at the origin.
The point RegularPoint{x=1, y=1} is not at the origin.

We could easily replace the use of new RegularPoint(…​) with new RecordPoint(…​) in the code above, and we would get a similar output:

The point RecordPoint[x=0, y=0] is at the origin.
The point RecordPoint[x=1, y=1] is not at the origin.

It appears the two implementation variants of the interface Point work as expected. But how is the performance of code affected by switching from regular Java classes to records?

Benchmarks

Here is a benchmark that can be used to measure the effects of using records over regular classes:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(value=3)
public class Bench {

    private static final RegularPoint REGULAR_ORIGIN = new RegularPoint(0, 0);
    private static final RecordPoint RECORD_ORIGIN = new RecordPoint(0, 0);

    private List<RegularPoint> regularPoints;
    private List<RecordPoint> recordPoints;

    @Setup
    public void setup() {
        regularPoints = IntStream.range(0, 16)
                .mapToObj(i -> new RegularPoint(i, i))
                .toList();

        recordPoints = IntStream.range(0, 16)
                .mapToObj(i -> new RecordPoint(i, i))
                .toList();
    }

    @Benchmark
    public void regular(Blackhole bh) {
        for (RegularPoint point: regularPoints) {
            if (point.x() == REGULAR_ORIGIN.x() && point.y() == REGULAR_ORIGIN.y()) {
                bh.consume(1);
            } else {
                bh.consume(0);
            }
        }
    }

    @Benchmark
    public void record(Blackhole bh) {
        for (RecordPoint point: recordPoints) {
            if (point.x() == RECORD_ORIGIN.x() && point.y() == RECORD_ORIGIN.y()) {
                bh.consume(1);
            } else {
                bh.consume(0);
            }
        }
    }

    public static void main(String[] args) throws Exception {
        org.openjdk.jmh.Main.main(args);
    }

}

When run on a Mac M1 laptop, the following results emerged (lower is better):

Benchmark      Mode  Cnt   Score   Error  Units
Bench.regular  avgt   15  10.424 ± 0.257  ns/op
Bench.record   avgt   15   9.412 ± 0.181  ns/op

As can be seen, records are about 10% faster than regular classes in this benchmark.

Here is what it looks like in a graph:

Graph 1, shows the performance of regular and record classes.

Under the Hood

Looking at why records can be faster in cases like the above, there is a clue in the class ciField.cpp which tells the Hotspot compiler which instance fields should be "trusted" when performing constant folding. The class also gives away some clues about other Java classes that benefit from the same optimizations. One example is the Foreign Function & Memory API that is slated to be finalized in Java 22 and where, for example, all classes implementing the various MemoryLayout variants all are eligible for constant folding optimizations.

The C++ class above is not available to regular Java programs but, by switching to records, we may directly reap the benefits of constant folding for instance fields.

As a final note, it should be said that modifying records fields (which are private and final) using, for example, Unsafe is …​ well …​ unsafe and would produce an undefined result. Don’t do that!

UPDATE:
Unsafe and reflection provide special protection against tampering with records making it very difficult to update record fields via backdoors. For example, trying to obtain the field offset for a record component via 
Unsafe will result in an UnsupportedOperationExcption being thrown.

Conclusion

Records offer a convenient way of expressing data carriers. As an added benefit, they also provide improved performance compared to regular Java classes in some applications.

Monday, August 28, 2023

Java 22: Panama FFM Provides Massive Performance Improvements for Native Strings

 

Java 22: Panama FFM Provides Massive Performance Improvements for Native Strings

The Panama Foreign Function and Memory (FFM) API is slated to be finalized in Java 22 and will then be a part of the public Java API. One thing that is perhaps less known is the significant performance improvements made by FFM in certain areas in 22. In this short article, we will be looking at benchmarking string conversion in FFM for Java 21 and Java 22 compared to using old JNI calls.

C and Java Strings

The Java Native Interface (JNI) has been used historically as a means to bridge Java to native calls before FFM was available. Both schemes entail converting strings back and forth between C and Java String data structures. As you might remember, C strings are just a bunch of bytes that are zero-terminated whereas Java Strings use a backing array with a known length.

When calling native function that takes one or more strings and/or returns a string (both are relatively common), the performance of converting strings back and forth becomes important.

Benchmarks

We have run some string benchmarks (see the source code here) on a AMD Ryzen 9 3900X 12-Core Processor machine and preliminary results indicate blistering performance for FFM string conversion in Java 22:

Benchmark                           (size)  Mode  Cnt    Score   Error  Units
ToJavaStringTest.jni_readString          5  avgt   30   86.520 ? 1.842  ns/op
ToJavaStringTest.jni_readString         20  avgt   30   97.151 ? 1.459  ns/op
ToJavaStringTest.jni_readString        100  avgt   30  143.853 ? 1.287  ns/op
ToJavaStringTest.jni_readString        200  avgt   30  189.867 ? 2.337  ns/op
ToJavaStringTest.panama_readString       5  avgt   30   21.380 ? 0.351  ns/op
ToJavaStringTest.panama_readString      20  avgt   30   36.250 ? 0.520  ns/op
ToJavaStringTest.panama_readString     100  avgt   30   43.368 ? 0.544  ns/op
ToJavaStringTest.panama_readString     200  avgt   30   53.442 ? 2.048  ns/op


Benchmark                         (size)  Mode  Cnt    Score   Error  Units
ToCStringTest.jni_writeString          5  avgt   30   47.450 ? 0.832  ns/op
ToCStringTest.jni_writeString         20  avgt   30   56.208 ? 0.422  ns/op
ToCStringTest.jni_writeString        100  avgt   30  108.341 ? 0.459  ns/op
ToCStringTest.jni_writeString        200  avgt   30  157.119 ? 1.669  ns/op
ToCStringTest.panama_writeString       5  avgt   30   45.361 ? 0.717  ns/op
ToCStringTest.panama_writeString      20  avgt   30   47.742 ? 0.554  ns/op
ToCStringTest.panama_writeString     100  avgt   30   47.580 ? 0.673  ns/op
ToCStringTest.panama_writeString     200  avgt   30   49.060 ? 0.694  ns/op

Needless to say, the ToJavaStringTest runs are converting a C string to a Java string whereas the ToCStringTest runs convert a Java string to a C string. The size indicates the number of bytes of the original string. Java strings were coded in UTF-8.

As can be seen, we can expect FFM to convert C strings to Java strings more than three times faster with FFM in Java 22. In the other direction, performance will be about the same for small strings but for larger strings (where it matters more), the speedup factor will be ever-increasing. For example, for strings of length 200, the speedup factor is more than three times.

Note

It should be noted that the benchmarks are not purely about string conversion as a JNI call also incurs a small state transition penalty for each call. The Java to C string performs a memory allocation (for the string bytes). While this could be avoided, it was included in the benchmark as that is what happens with JNI’s GetStringUTFChars.

Diagrams

Here are two diagrams outlining the performance benefits of FFM in comparison with JNI. The diagram also includes FFM in Java 21 to highlight the recent performance improvements made in 22 (Lower is Better):



Diagram 1, shows the performance of converting a C string to a Java String.



Diagram 2, shows the performance of converting a Java string to a C String.

Future Improvements

FFM allows us to use custom allocators and so, if we make several calls, we can reuse memory segments thereby improving performance further. This is not possible with JNI.

It is also possible that we will see even better FFM performance in future Java versions once the Vector API becomes a final feature.

JDK Early-Access Builds

Run your own code on an early access JDK today by downloading a JDK Early-Access Build.

Note

At the time of writing this article, the performance improvements are not merged in the Java 22 mainline yet. You can however build your own snapshot version with the performance improvements mentioned above by cloningithub.com/openjdk/panama-foreign

Resources

Acknowledgments

This article was written by me (Per Minborg) and Maurizio Cimadamore.

Wednesday, August 2, 2023

Java: New Draft JEP: "Computed Constants"

Java: JEP Draft: "Computed Constants"

We finally made the draft JEP "Computed Constants" public and I can’t wait to tell you more about it! ComputedConstant objects are superfast immutable value holders that can be initialized independently of when they are created. As an added benefit, these objects may in the future be even more optimized via "condensers" that eventually might become available through project Leyden.

Background

Oftentimes, we use static fields to hold objects that are only initialized once:

// ordinary static initialization
private static final Logger LOGGER = Logger.getLogger("com.foo.Bar");
...
LOGGER.log(...);

The LOGGER variable will be unconditionally initialized as soon as the class where it is declared is loaded (loading occurs upon the class being first referenced).

One way to prevent all static fields in a class from being initialized at the same time is to use the class holder idiom allowing us to defer initialization until we actually need the variable:

// Initialization-on-demand holder idiom
Logger logger() {
    class Holder {
         static final Logger LOGGER = Logger.getLogger("com.foo.Bar");
    }
    return Holder.LOGGER;
}
...
logger().log(...);

While this works well in theory, there are significant drawbacks: 

  • Each constant that needs to be decoupled would need its own holding class (adding static footprint overhead) 
  • Only works if the decoupled constants are independent 
  • Does only work for static variables and not for instance variables and objects

Another way is to use the double-checked locking idiom that can also be used for deferring initialization. This works for both static variables, instance variables and objects:

// Double-checked locking idiom
class Foo {

    private volatile Logger logger;

    public Logger logger() {
        Logger v = logger;
        if (v == null) {
            synchronized (this) {
                v = logger;
                if (v == null) {
                    logger = v = Logger.getLogger("com.foo.Bar");
                }
            }
        }
        return v;
    }
}
...
foo.logger().log(...);

There is no way for the (current) JVM to determine that the logger is monotonic in the sense that it can only change from null to a value once and then will always remain. So, the JVM is unable to apply constant folding and other optimizations. Also, because logger needs to be declared volatile there is a small performance penalty paid for each access.

The ComputedConstant class comes to the rescue here and offers the best of two worlds: Flexible initialization and good performance!

Computed Constant

Here is how ComputedConstant can be used with the logger example:

class Bar {
    // 1. Declare a computed constant value
    private static final ComputedConstant<Logger> LOGGER =
            ComputedConstant.of( () -> Logger.getLogger("com.foo.Bar") );

    static Logger logger() {
        // 2. Access the computed value
        //    (evaluation made before the first access)
        return LOGGER.get();
    }
}

This is similar in spirit to the class-holder idiom, and offers the same performance, constant-folding, and thread-safety characteristics, but is simpler and incurs a lower static footprint since no additional class is required.

Benchmarks

I’ve run some benchmarks on my Mac Pro M1 ARM-based machine and preliminary results indicates excellent performance for static ComputedConstant fields:

Benchmark      Mode  Cnt  Score   Error  Units
staticHolder   avgt   15  0.561 ? 0.002  ns/op
doubleChecked  avgt   15  1.122 ? 0.003  ns/op
constant       avgt   15  0.563 ? 0.002  ns/op // static ComputedConstant

As can be seen, a ComputedConstant has the same performance as the static holder (but with no extra class footprint) and much better performance than a double-checked locking variable.

Collections of ComputedConstant

So far so good. However, the hidden gem in the JEP is the ability to obtain Collections of ComputedConstant elements. This is achieved using a factory method that provides not a single ComputedConstant (with its provider) but a whole List of ComputedConstant elements that is handled by a single providing mapper that can initialize all the elements in the list. This allows a large number of variables to be handled via a single list, thereby saving space compared to having many single constants and initialization lambdas (for example).

Like a ComputedConstant<V> variable, a List<ComputedConstant<V>> variable is created by providing an element mapper - typically in the form of a lambda expression, which is used to compute the value associated with the i-th element of the List when the element value is first accessed:

class Fibonacci {
    static final List<ComputedConstant<Integer>> FIBONACCI =
            ComputedConstant.of(1_000, Fibonacci::fib);

    static int fib(int n) {
        return (n < 2)
                ? n
                : FIBONACCI.get(n - 1) + FIBONACCI.get(n - 2);
    }

    int[] fibs = IntStream.range(0, 10)
            .map(Fibonacci::fib)
            .toArray(); // { 0, 1, 1, 2, 3, 5, 8, 13, 21, 34 }

}

Note how there’s only one field of type List<ComputedConstant<Integer>> to initialize - every other computation is performed on-demand when the corresponding element of the List FIBONACCI is accessed.

When a computation depends on more sub-computations, it induces a dependency graph, where each computation is a node in the graph, and has zero or more edges to each of the sub-computation nodes it depends on. For instance, the dependency graph associated with fib(5) is given below:

               ___________fib(5)___________
              /                            \
        ____fib(4)____                ____fib(3)____
       /              \              /              \
     fib(3)         fib(2)         fib(2)          fib(1)
    /      \       /      \       /      \
  fib(2)  fib(1) fib(1)  fib(0) fib(1)  fib(0)

The Computed Constant API allows modeling this cleanly, while still preserving good constant-folding guarantees and integrity of updates in the case of multi-threaded access.

Benchmarks Collections

These benchmarks were run on the same platform as above and show collections of ComputedConstant elements enjoy the same performance benefits as the single ones do:

Benchmark      Mode  Cnt  Score   Error  Units
staticHolder   avgt   15  0.570 ? 0.005  ns/op // int[] in a holder class
doubleChecked  avgt   15  1.124 ? 0.044  ns/op
constant       avgt   15  0.562 ? 0.005  ns/op // List<ComputedConstant>

Again, the ComputedConstant clocks in at native static array speed while providing much better flexibility as to when initialized.

Instance Performance

The performance for instance variables and objects is superior to holders using the double-checked idiom showed above as can be seen in the benchmarks below:

Benchmark      Mode  Cnt  Score   Error  Units
doubleChecked  avgt   15  1.259 ? 0.023  ns/op
constant       avgt   15  0.728 ? 0.022  ns/op // ComputedConstant

So, ComputedConstant is more than 40% faster than the double-checked holder class tested on my machine.

Note: Instance performance is subject to review.

Where is it?

At the time of writing this article, ComputedConstant is not yet available in the mainline JDK repository. Check out the next section for a link to the proposed source code.

Acknowledgments

Parts of the text in this article were written by Maurizio Cimadamore