Thursday, January 26, 2023

Java 21: Performance Improvements Revealed

 In Java 21, old code might run significantly faster due to recent internal performance optimizations made in the Java Core Libraries. In this article, we will take a closer look at some of these changes and see how much faster your favorite programming language has become. Buckle up, for we are about to run at full speed!

Background

When converting primitive values such as int and long values back and forth to certain external representations, such as a file, the internal class java.io.Bits is used. In previous Java versions, conversion in this class was made using explicit bit shifting as shown hereunder:

static long getLong(byte[] b, int off) {

     return ((b[off + 7] & 0xFFL)      ) +

            ((b[off + 6] & 0xFFL) <<  8) +

            ((b[off + 5] & 0xFFL) << 16) +

            ((b[off + 4] & 0xFFL) << 24) +

            ((b[off + 3] & 0xFFL) << 32) +

            ((b[off + 2] & 0xFFL) << 40) +

            ((b[off + 1] & 0xFFL) << 48) +

            (((long) b[off])      << 56);

}

When taking a closer look, it can be seen that the code will extract a long from a backing byte array by successively extracting a byte value and left-shifting it various steps and then summing the bytes together. 

As the lowest-index byte is the most significant (i.e. it is shifted to the left the most), extraction is made in big-endian order (also called “network order”). There are eight similar steps in the algorithm, where each step is on a separate line, and each step comprises six sub-operations:

  1. Add a constant to the provided off parameter
  2. Extract a byte value at an index from the provided b array including checking index bounds
  3. Convert the byte value to a long (as an AND operation with another long on the LHS is imminent)
  4. Perform an AND operation with the long value 0XFF
  5. Shift the result to the left a number of steps
  6. Accumulate the resulting value (via the + operation)

Hence, there are eight times six operations in total (= 48 operations) that need to be performed. In reality, Java is able to optimize these operations slightly, for example by leveraging CPU instructions that can perform several operations in a single step.

Calling getLong() from an outer loop entails checking index bounds many times as it is difficult to hoist boundary checking outside the outer loop due to the method’s complexity.

Improvements in Java 21

In Java 21, conversions are made with VarHandle constructs instead and the class java.io.Bits was moved and renamed to jdk.internal.util.ByteArray so that other classes from various packages could benefit from it too. Here is what the ByteArray::getLong method looks like in Java 21:

private static final VarHandle LONG = 

        MethodHandles.byteArrayViewVarHandle(long[], ByteOrder.BIG_ENDIAN);


static long getLong(byte[] b, int off) {

     return (long) LONG.get(b, off);

}

Here, It looks like only one operation is made. However, in reality, there are several things going on under the covers of the VarHandle::get operation. On platforms using little-endian (which is almost 100% of the user base), the byte order needs to be swapped. Also, index bounds must be checked. 

The cast (long) is needed in order to prevent auto-boxing/un-boxing for the return value of the LONG VarHandle. The inner workings of VarHandle objects and their coordinates are otherwise beyond the scope of this article.

As VarHandles are first-class citizens of the Java language, significant effort has been put into making them efficient. One can only assume the byte-swapping operations are optimized for the platform at hand. Also, the array bounds checking can be hoisted outside the many sub-steps so only one check is needed.

In addition to internal boundary-check hoisting, The VarHandle construct makes it easier for Java to further hoist boundary checks outside an outer loop compared to the older, more complex, implementation used in pre-Java 21.

Almost all methods in Bits/ByteArray got rewritten, not only getLong(). So, both reading and writing short, int, float, long, and, double values are now much faster.

Affected Classes and Impact

The improved java.util.ByteArray class is used directly by the following Core Library classes:

  • ObjectInputStream
  • ObjectOutputStream
  • ObjectStreamClass
  • RandomAccessFile


Even though it appears the direct usage of ByteArray is limited, there is an enormous transitive use of these classes. For example, the three Object Stream classes above are used extensively in conjunction with serialization. 

This means, in many cases, Java serialization is much faster now!

The RandomAccessFile class is used internally in the JDK for graphics and sound input/output as well as Zip and directory handling. 

More importantly, there is a large number of third-party libraries that relies on these improved classes. They and all applications that are using them will automatically benefit from these improvements in speed. No change in your application code is needed. It just runs faster!


Raw Benchmarks

The details of the first benchmarks shown hereunder are described in this pull request. The actual change of Bits was made via this pull request.

I have run the benchmarks under Linux x64, Windows x64, and Mac aarch64. Note that this implied running them on different hardware, so these results can’t be compared across operating systems. In other words, Mac aarch64 is not necessarily faster than Linux x64.

I’ve run the tests using the above ByteArray::readLong method and I’ve used an outer loop with two iterations writing long values into an array. The more iterations in the outer loop, the more pronounced advantages we get with the VarHandle access. One reason for this is likely the C2 compiler is able to hoist out boundary checks outside the outer loop.


image

Graph 1 shows the improvement in speed in Bits for various platforms.


Serialization Benchmarks

So, given the performance increase in ByteArray looks awesome, what will be the practical effect on serialization given all the other things that need to happen during the serialization process?

Consider the following classes that contain all the primitive types (except boolean):

static final class MyData implements Serializable {

    byte b;

    char c;

    short s;

    int i;

    float f;

    long l;

    double d;


    public MyData(byte b, char c, short s, int i, float f, long l, double d) {

        this.b = b;

        this.c = c;

        this.s = s;

        this.i = i;

        this.f = f;

        this.l = l;

        this.d = d;

    }

}


record MyRecord(byte b,

                char c,

                shorts,

                int i,

                float f,

                long l,

                double d) implements Serializable {}



where the complete PrimitiveFieldSerializationBenchmark is available here. Running these benchmarks that serialize instances of the classes above on my laptop (macOS 12.6.1, MacBook Pro (16-inch, 2021) M1 Max) produced the following result:


Baseline (20-ea+30-2297)

Benchmark                           Mode  Cnt  Score   Error  Units

SerializeBenchmark.serializeData    avgt    8  7.283 ± 0.070  ns/op

SerializeBenchmark.serializeRecord  avgt    8  7.275 ± 0.201  ns/op


Java 21

SerializeBenchmark.serializeData    avgt    8  6.793 ± 0.132  ns/op

SerializeBenchmark.serializeRecord  avgt    8  6.733 ± 0.032  ns/op


This is good news! Our classes now serialize more than 5% faster.

Graph 2 shows the improvement in serialization for two classes.


Future Improvements

There are several other classes in the JDK that look similar and that might benefit from the same type of performance improvements once they are optimized with VarHandle access. 

Caring for old code is a trait of good stewardship!


Actual Application Performance Increase

How much faster will your applications run under Java 21 in reality if you use one or more of these improved classes (directly or indirectly)? There is only one way to find out: Run your own code on Java 21 today by downloading a JDK 21 Early-Access Build


Wednesday, January 18, 2023

Java 20: An Almost Infinite Memory Segment Allocator

Wouldn’t it be cool if you could allocate an infinite amount of memory? In a previous article, I elaborated a bit on how to create memory-mapped files which could be sparse. In this article, we will learn how this can be leveraged as an under-carriage for providing a memory-allocating arena that can return an almost infinite amount of native memory without ever throwing an OutOfMemoryError


Arena

An Arena controls the lifecycle of native memory segments, providing both flexible allocation and timely deallocation.


There are two built-in Arena types in Java 20:


  • A Confined Arena (available via the Arena::openConfined factory)
  • A Shared Arena (available via the Arena::openShared factory)


As the names imply, memory segments obtained from a Confined Arena can only be used by the thread that initially created the Arena, whereas memory segments from a Shared Arena can be used by any thread. Both types will allocate pure unmapped native memory.


The InfiniteArena Class

By creating a new class as shown hereunder, we can provide an implementation that differs from the built-in Arena types in the way that it will provide memory-mapped memory instead of pure native memory.


The class is using sparse files to reduce required file space on platforms where this is supported.


...


import static java.nio.channels.FileChannel.MapMode.READ_WRITE;

import static java.nio.file.StandardOpenOption.*;

import static java.util.Objects.requireNonNull;


public final class InfiniteArena implements Arena {


    private static final Set<OpenOption> OPTS =

            Set.of(CREATE_NEW, SPARSE, READ, WRITE);


    private final String fileName;

    private final AtomicLong cnt;

    private final Arena delegate;


    public InfiniteArena(String fileName) {

        this.fileName = requireNonNull(fileName);

        this.cnt = new AtomicLong();

        this.delegate = Arena.openShared();

    }



    @Override

    public MemorySegment allocate(long byteSize, long byteAlignment) {

        try {

            try (var fc = FileChannel.open(

                    Path.of(fileName + "-" + cnt.getAndIncrement()), OPTS)) {

                return fc.map(READ_WRITE, 0, byteSize, delegate.scope());

            }

        } catch (IOException e) {

            throw new RuntimeException(e);

        }

    }


    @Override

    public SegmentScope scope() {

        return delegate.scope();

    }


    @Override

    public void close() {

        delegate.close();

    }


    @Override

    public boolean isCloseableBy(Thread thread) {

        return delegate.isCloseableBy(thread);

    }


}


As seen above, the parameter byteAlignment in Arena::allocate is ignored in anticipation that mapped memory is super aligned by default on all supporting platforms and that byteAlignment is relatively low. Obviously, in a production system, this has to be handled in a more strict way.


On my machine (macOS 12.6.1) and using the examples below, mapped memory addresses are always aligned to at least 2^14 = 16,384-byte boundaries.


Using the InfiniteArena

Here is an example of how the InfiniteArena can be used in an application:


public static void main(String[] args) {


    try (Arena arena = new InfiniteArena("my-mapped-memory")) {

        MemorySegment s0 = arena.allocate(1L << 40);

        // Do nothing with s0


        MemorySegment s1 = arena.allocate(1L << 40);

        // Fill the region 1024 to 1024+256-1 with the value 2

        s1.asSlice(1024, 256)

                .fill((byte) 2);


        MemorySegment s2 = arena.allocate(16);

        // Write a String to the segment

        s2.setUtf8String(0, "Hello World");

    }

}


In the try-with-resources block, we create an InfiniteArena with a base file name "my-mapped-memory" to be used for the backing mapped files. The Arena is then used to allocate three native MemroySegment instances, where the first two ones are of size 1 TiB and the last is only 16 bytes. Note that, since we are not touching the first segment s0 and is only using a small portion of the second segment s1, the required physical disk space is minimal for these segments.


Lastly,  a small segment s2 of 16 bytes is created in which we put the all-familiar “Hello World”  string.


Inspecting the Files

After the code completes, we can inspect the lingering files:


% ls -lart

...

-rw-r--r--   1 pminborg  staff  1099511627776 Jan  9 17:18 my-mapped-memory-0

-rw-r--r--   1 pminborg  staff  1099511627776 Jan  9 17:18 my-mapped-memory-1

-rw-r--r--   1 pminborg  staff             16 Jan  9 17:18 my-mapped-memory-2



As can be seen, the file names are created as my-mapped-memory-X where X is the sequence number of created MemorySegment instances; 0, 1, …  The first two are large (1 TiB as expected). We can inspect the actual disk usage of all the files:


% du -h my-mapped-memory-*

  0B    my-mapped-memory-0

 16K    my-mapped-memory-1

4.0K    my-mapped-memory-2


Here is how they look in detail:


% hexdump -C my-mapped-memory-0 

00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|



% hexdump -C my-mapped-memory-1

00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

*

00000400  02 02 02 02 02 02 02 02  02 02 02 02 02 02 02 02  |................|

*

00000500  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|



Note 1: “*” indicates the lines are the same (except the address) until there is another address indication.

Note 2: The hexdump command takes a long time to complete for 1 TiB files so only the first lines of output are shown above.


Here is how the smaller my-mapped-memory–2 file looks like:


% hexdump -C my-mapped-memory-2

00000000  48 65 6c 6c 6f 20 57 6f  72 6c 64 00 00 00 00 00  |Hello World.....|

00000010


Cool! Being able to inspect used memory post-mortem could prove invaluable insights in many cases.


Drawbacks and Future Improvements

As a file needs to be created upon every allocation, this will slow down allocation speed compared to normal allocation. Also, if all space is actually being used, a sparse file would require more resources than if it was a non-sparse file. It would be trivial to modify the code above to use non-sparse files.


In the implementation above, the allocated files will remain after the Arena has been closed and indeed even after the JVM exits. This allows the inspection and tracing of memory segments allocated as exemplified above. It is a small thing to add cleaning up lingering files if that is needed. For example, if the program needs to be re-run again without it complaining about files that already exist. It is also possible to use a unique name each time the application runs to avoid file name collisions.


Mapped files can be much larger than the physical RAM but that comes with the price of potentially swapping in and out virtual memory should the accessed memory not be resident.


What’s Next?

Try out the InfiniteArena today by downloading a JDK 20 Early-Access Build. Do not forget to pass the --enable-preview JVM flag or your code will not run. 



Monday, January 9, 2023

Java 20: Colossal Sparse Memory Segments

Did you know you can allocate memory segments that are larger than the physical size of your machine’s RAM and indeed larger than the size of your entire file system? Read this article and learn how to make use of mapped memory segments that may or may not be “sparse” and how to allocate 64 terabytes of sparse data on a laptop.


Mapped Memory

Mapped memory is virtual memory that has been assigned a one-to-one mapping to a portion of a file. The term “file” is quite broad here and may be represented by a regular file, a device, shared memory or any other thing that the operating system may refer to via a file descriptor.


Accessing files via mapped memory is often much faster than accessing a file via the standard file operations like read and write. Because mapped memory is operated on directly, some interesting solutions can also be constructed via atomic memory operations such as compare-and-set operations, allowing very efficient inter-thread and inter-process communication channels. 


Because not all parts of the mapped virtual memory must reside in real memory at the same time, a mapped memory segment might be much larger than the physical RAM in the machine it is running in. If a portion of the mapped memory is not available when accessed, the operating system will temporarily suspend the current thread and load the missing page after which operation may resume again.


Other advantages of mapped files are; they can be shared across processes running different JVMs and, the files remain persistent and can be inspected using any file tool like hexdump.


Setting up a Mapped Memory Segment


The new Foreign Function and Memory feature that previews for the second time in Java 20 allows large memory segments to be mapped to a file. Here is how you can create a memory segment of size 4 GiB backed by a file.


Set<OpenOption> opts = Set.of(CREATE, READ, WRITE);

try (FileChannel fc = FileChannel.open(Path.of("myFile"), opts);

     Arena arena = Arena.openConfined()) {


    MemorySegment mapped = 

            fc.map(READ_WRITE, 0, 1L << 32, arena.scope());


    use(mapped);


} // Resources allocated by "mapped" is released here via TwR



Sparse Files

A sparse file is a file where information can be stored in an efficient way if not all portions of the file are actually used. A file with large unused “holes” is an example of such a file whereby only the used sections are actually stored in the underlying physical file. In reality, however, the unused holes also consume some resources albeit much less than their used counterparts.

Figure 1, Illustrates a logical sparse file where only actual data elements are stored in the physical file.


As long as the sparse file is not filled with too much data, it is possible to allocate a sparse file that is much larger than the available physical disk space. For example, it is possible to allocate an empty 10 TB memory segment backed by a sparse file on a filesystem with very little available capacity. 


It should be noted that not all platforms support sparse files.


Setting up a Sparsely Mapped Memory Segment 


Here is an example of how to create and access the contents of a file via a memory-mapped MemorySegment whereby the contents is sparse. For example, expanding the real underlying data in the file as needed automatically:


Set<OpenOption> sparse = Set.of(CREATE_NEW, SPARSE, READ, WRITE);

try (var fc = FileChannel.open(Path.of("sparse"), sparse);

     var arena = Arena.openConfined()) {


     memorySegment mapped = 

             fc.map(READ_WRITE, 0, 1L << 32, arena.scope());


    use(mapped);

} // Resources allocated by "mapped" is released here via TwR


Note: The file will appear to consist of 4 GiB of data but in reality the file does not use any (apparent) file-system space at all:


pminborg@pminborg-mac ntive % ll sparse 

-rw-r--r--  1 pminborg  staff  4294967296 Nov 14 16:12 sparse


pminborg@pminborg-mac ntive % du -h sparse 

  0B sparse



Going Colossal

The implementation of sparse files varies across the many platforms that are supported by Java and consequently, various sparse-file properties will vary depending on where an application is deployed.


 I am using a Mac M1 under macOS Monteray (12.6.1) with 32 GiB RAM and 1 TiB storage (of which 900 GiB are available). 


I was able to map a single sparse file of up to 64 TiB using a single mapped memory segment on my machine (using its standard settings):


  4 GiB -> ok as demonstrated above

  1 TiB -> ok

 32 TiB -> ok

 64 TiB -> ok

128 TiB -> failed with OutOfMemoryError


It is possible to increase the amount of mappable memory but this is out of the scope for this article. In real applications, it is better to have smaller portions of a sparse file mapped into memory rather than mapping the entire sparse file in one chunk. These smaller mappings will then act as “windows” into the larger underlying file.


 Anyhow, this looks pretty colossal:


-rw-r--r--   1 pminborg  staff  70368744177664 Nov 22 13:34 sparse


Creating the empty 64 TiB sparse file took about 200 ms on my machine.


Unrelated Observations on Thread Confinement

As can be seen above, it is possible to access the same underlying physical memory from different threads (and indeed even different processes) with file mapping despite being viewed through several distinct thread-confined MemorySegment instances.


What’s Next?

Try out mapped segments today by downloading a JDK 20 Early-Access Build. Do not forget to pass the --enable-preview JVM flag or your code will not run.