Minborg

Minborg
Minborg

Thursday, December 30, 2021

Java: Creating Terabyte Sized Queues with Low-Latency

Queues are often fundamental components in software design patterns. But, what if there are millions of messages received every second and multi-process consumers need to be able to read the complete ledger of all messages? Java can only hold so much information before the heap becomes a limiting factor with high-impacting garbage collections as a result, potentially preventing us from fulfilling targeted SLAs or even halting the JVM for seconds or even minutes.


This article covers how to create huge persisted queues while retaining predictable and consistent low latency using open-source Chronicle Queue.

The Application

In this article, the objective is to maintain a queue of objects from market data feeds (e.g. the latest price for securities traded on an exchange). Other business areas such as sensory input from IOT devices or reading crash-recording information within the automotive industry could have been chosen as well. The principle is the same.


To start with, a class holding market data is defined:


public class MarketData extends SelfDescribingMarshallable {

    int securityId;

    long time;

    float last;

    float high;

    float low;


    // Getters and setters not shown for brevity


}


Note: In real-world scenarios, great care must be taken when using float and double for holding monetary values as this could otherwise cause rounding problems [Bloch18, Item 60]. However, in this introductory article, I want to keep things simple.


There is also a small utility function MarketDataUtil::create that will create and return a new random MarketData object when invoked:


static MarketData create() {

    MarketData marketData = new MarketData();

    int id = ThreadLocalRandom.current().nextInt(1000);

    marketData.setSecurityId(id);

    float nextFloat = ThreadLocalRandom.current().nextFloat();

    float last = 20 + 100 * nextFloat;


    marketData.setLast(last);

    marketData.setHigh(last * 1.1f);

    marketData.setLow(last * 0.9f);

    marketData.setTime(System.currentTimeMillis());


    return marketData;

}


Now, the objective is to create a queue that is durable, concurrent, low-latency, accessible from several processes and that can hold billions of objects.

The Naïve Approach

Armed with these classes, the naïve approach of using a ConcurrentLinkedQueue can be explored:


public static void main(String[] args) {

    final Queue<MarketData> queue = new ConcurrentLinkedQueue<>();

    for (long i = 0; i < 1e9; i++) {

        queue.add(MarketDataUtil.create());

    }

}


This will fail for several reasons:


  1. The ConcurrentLinkedQueue will create a wrapping Node for each element added to the queue. This will effectively double the number of objects created.

  2. Objects are placed on the Java heap, contributing to heap memory pressure and garbage collection problems. On my machine, this led to my entire JVM becoming unresponsive and the only way forward was to kill it forcibly using “kill -9”.

  3. The queue cannot be read from other processes (i.e. other JVMs).

  4. Once the JVM terminates, the content of the queue is lost. Hence, the queue is not durable.


Looking at various other standard Java classes, it can be concluded that there is no support for large persisted queues.

Using Chronicle Queue

Chronicle Queue is an open-source library and is designed to meet the requirements set forth above.  Here is one way to set it up and use it:


public static void main(String[] args) {

    final MarketData marketData = new MarketData();

    final ChronicleQueue q = ChronicleQueue

            .single("market-data");

    final ExcerptAppender appender = q.acquireAppender();


    for (long i = 0; i < 1e9; i++) {

        try (final DocumentContext document =

                     appender.acquireWritingDocument(false)) {

             document

                    .wire()

                    .bytes()

                    .writeObject(MarketData.class, 

                            MarketDataUtil.recycle(marketData));


        }

    }

}



Using a MacBook Pro 2019 with a 2.3 GHz 8-Core Intel Core i9, north of 3,000,000 messages per second could be inserted using only a single thread. The queue is persisted via a memory-mapped file in the given directory “market-data”. One would expect a MarketData object to occupy 4 (int securityId) + 8 (long time) + 4*3 (float last, high and low) = 24 bytes at the very least.


In the example above, 1 billion objects were appended causing the mapped file to occupy 30,148,657,152 bytes which translates to about 30 bytes per message. In my opinion, this is very efficient indeed.


As can be seen, a single MarketData instance can be reused over and over again because Chronicle Queue will flatten out the content of the current object onto the memory mapped file, allowing object reuse. This reduces memory pressure even more. This is how the recycle method works:

static MarketData recycle(MarketData marketData) {

    final int id = ThreadLocalRandom.current().nextInt(1000);

    marketData.setSecurityId(id);

    final float nextFloat = ThreadLocalRandom.current().nextFloat();

    final float last = 20 + 100 * nextFloat;


    marketData.setLast(last);

    marketData.setHigh(last * 1.1f);

    marketData.setLow(last * 0.9f);

    marketData.setTime(System.currentTimeMillis());


    return marketData;

}

Reading from a Chronicle Queue

Reading from a Chronicle Queue is straightforward. Continuing the example from above, the following shows how the first two MarketData objects can be read from the queue:


public static void main(String[] args) {

    final ChronicleQueue q = ChronicleQueue

            .single("market-data");

    final ExcerptTailer tailer = q.createTailer();


    for (long i = 0; i < 2; i++) {

        try (final DocumentContext document =

                     tailer.readingDocument()) {

            MarketData marketData = document

                    .wire()

                    .bytes()

                    .readObject(MarketData.class);

            System.out.println(marketData);

        }

    }

}


This might produce the following output:


!software.chronicle.sandbox.queuedemo.MarketData {

  securityId: 202,

  time: 1634646488837,

  last: 45.8673,

  high: 50.454,

  low: 41.2806

}


!software.chronicle.sandbox.queuedemo.MarketData {

  securityId: 117,

  time: 1634646488842,

  last: 34.7567,

  high: 38.2323,

  low: 31.281

}


There are provisions to efficiently seek the tailer’s position, for example, to the end of the queue or to a certain index.

What’s Next?

There are many other features that are out of scope for this article. For example, queue files can be set to roll at certain intervals (such as each day, hour or minute) effectively creating a decomposition of information so that older data may be cleaned over time. There are also provisions to be able to isolate CPUs and lock Java threads to these isolated CPUs, substantially reducing application jitter.

Finally, there is an enterprise version with replication of queues across server clusters paving the way towards high availability and improved performance in distributed architectures. The enterprise version also includes a variety of other features such as encryption, time zone rolling and asynchronous appenders.

Resources

Open-source Chronicle Queue

Chronicle Queue Enterprise

[Bloch18] Joshua Bloch, Effective Java, Third Edition, ISBN 0-13-468599-7, 2018


Monday, December 20, 2021

Java: Why a Set Can Contain Duplicate Elements

 

Java: Why a Set Can Contain Duplicate Elements

In low-latency applications, the creation of unnecessary objects is often avoided by reusing mutable objects to reduce memory pressure and thus the load on the garbage collector. This makes the application run much more deterministically and with much less jitter. However, care must be taken as to how these reused objects are used or else unexpected results might manifest themselves, for example in the form of a Set containing duplicate elements such as [B, B].

HashCode and Equals

Java’s built-in ByteBuffer provides direct access to heap and native memory using 32-bit addressing. Chronicle Bytes is a 64-bit addressing open-source drop-in replacement allowing much larger memory segments to be addressed. Both these types provide a hashCode() and an equals() method that depends on the byte contents of the objects’ underlying memory segment. While this can be useful in many situations, mutable objects like these should not be used in most of Java’s built-in Set types and not as a key in most built-in Map types.


Note: In reality, only 31 and 63 bits may be used as an effective address offset (e.g. using int and long offset parameters)

Mutable Keys

Below, a small code example is presented illustrating the problem with reused mutable objects. The code shows the use of Bytes but the very same problem exists for ByteBuffer.


Set<CharSequence> set = new HashSet<>();

Bytes<?> bytes = Bytes.from("A");

set.add(bytes);


// Reuse

bytes.writePosition(0);


// This mutates the existing object already

// in the Set

bytes.write("B");


// Adds the same Bytes object again but now under

// another hashCode()

set.add(bytes);


System.out.println(“set = “ + set);


The code above will first add an object with “A” as content meaning that the set contains [A]. Then the content of that existing object will be modified to “B”, which has the side effect of changing the set to contain [B] but will leave the old hash code value and the corresponding hash bucket unchanged (effectively becoming stale). Lastly, the modified object is added to the set again but now under another hash code leading to the previous entry for that very same object will remain!


As a result, rather than the perhaps anticipated [A, B], this will produce the following output:


set = [B, B]


ByteBuffer and Bytes Objects as Keys in Maps

When using Java’s ByteBuffer objects or Bytes objects as keys in maps or as elements in sets, one solution is using an IdentityHashMap or Collections.newSetFromMap(new IdentityHashMap<>()) to protect against the mutable object peculiarities described above. This makes the hashing of the objects agnostic to the actual byte content and will instead use the System.identityHashCode() which never changes during the object's life.


Another alternative is to use a read-only version of the objects (for example by invoking ByteBuffer.asReadOnlyBuffer()) and refrain from holding any reference to the original mutable object that could provide a back-door to modifying the supposedly read-only object’s content.


Chronicle Map and Chronicle Queue

Chronicle Map is an open-source library that works a bit differently than the built-in Java Map implementations in the way that objects are serialized and put in off-heap memory, opening up for ultra-large maps that can be larger than the RAM memory allocated to the JVM and allows these maps to be persisted to memory-mapped files so that applications can restart much faster.


The serialization process has another less known advantage in the way that it actually allows reusable mutable objects as keys because the content of the object is copied and is effectively frozen each time a new association is put into the map. Subsequent modifications of the mutable object will therefore not affect the frozen serialized content allowing unrestricted object reuse.


Open-source Chronicle Queue works in a similar fashion and can provide queues that can hold terabytes of data persisted to secondary storage and, for the same reason as Chronicle Map, allows object reuse of mutable elements.

Conclusions
It is dangerous to use mutable objects, such as Bytes and ByteBuffer where the hashCode() depends on the content of the object, in some Map and Set implementations. 


An IdentityHashMap protects against corruption of maps and sets due to object mutation but makes these structures agnostic to the actual byte contents.


Read-only versions of previously modified memory segment objects might provide an alternate solution.


Chronicle Map and Chronicle Queue allow unrestricted use of mutable objects, opening up the path to deterministic low-latency operations.

Resources

Chronicle homepage

Chronicle Bytes on GitHub (open-source)

Chronicle Map on GitHub (open-source)

Chronicle Queue on GitHub (open-source)


Wednesday, December 8, 2021

Why General Inheritance is Flawed and How to Finally Fix it

 

Why General Inheritance is Flawed and How to Finally Fix it


By leveraging composition and the final keyword in the right way, you can improve your programming skills and become a better Java programmer.  


General inheritance, whereby a public class is extended over package boundaries, provides a number of challenges and drawbacks and should be avoided in almost all cases. Classes and methods can be made final meaning that subclassing is disallowed which effectively prevents inheritance. While this may sound like a strange thing to do in an object-oriented language like Java, it does carry significant benefits for a large number of class types.


But, when should a class or method be final and just why is general inheritance problematic?

Immutable Classes

Immutable classes are classes whose state can not be observed to change from the outside world. This gives immutable objects the advantage of being inherently thread-safe and they can be reused indefinitely.


Java’s built-in String class is an example of an immutable class. It does have an internal state that is very likely to change the first time hashCode() is called, but this internal state cannot be observed by an outside caller (unless resorting to reflection).


Immutable classes shall always be declared final or else subclasses could compromise the immutability contract, simply by adding and exposing a mutable state. 


For the sake of completeness, it is worth mentioning that an immutable class should declare all its fields as private, final and ensure exclusive access to any mutable sub-component (such as an array), for example using defensive copying.

Non-instantiable Classes (aka Utility Classes)

A non-instantiable class is often informally referred to as a “utility class” and contains only static methods (and perhaps static fields). Static methods are not class methods but rather global functions attached to a “carrier-class”.  Ideally, non-instantiable classes should be immutable concerning their (static) state (if any).


These methods should be called using their carrier-class name followed by the method name (e.g. Collections.emptyList()). Subclassing a non-instantiable utility can result in non-intuitive behavior and is likely a source of confusion as the methods cannot be overridden anyhow, only replaced as illustrated hereunder:


public class FooUtil {


    static void print() {

        lower();

    }


    static void lower() {

        System.out.println("lower foo");

    }


}



public class BarUtil extends FooUtil {


    static void lower() {

        System.out.println("lower bar");

    }


}


Invoking BarUtil::print will produce “lower foo” and not “lower bar” meaning that BarUtil::lower did not override FooUtil::lower. However, if BarUtil::lower was called directly, it would have printed “lower bar”.


Therefore, non-instantiable classes should generally be declared final


As a side note, non-instantiable classes should have a single default constructor declared private to prevent instantiation of the non-instantiable class (as the name implies).

Methods Called by a Constructor

Methods called by a constructor of a class should always be final, either by declaring the entire class final or by declaring these methods final. Failure to do this may open up a leak of an object (e.g. “this”) that is only partially initialized and thus is likely in an illegal state. Such a leak may, for example, occur by the not-yet-initialized instance registering itself with a listener. These errors are likely hard to identify if they make it out in the open.

General Inheritance

The use/non-use of general inheritance has sparked opinionated discussions for quite some time.


Back in the early days, inheritance was often thought to be the general way of code reuse. As it later turned out, inheritance outside a package could lead to unsatisfiable and erroneous behaviour unless special care is put into providing classes that are suitable to extend across package boundaries [Bloch18, Item18]. 


Furthermore, general inheritance breaks encapsulation [Snyder80] because the superclass implementation might change over time which might cause a subclass to fail even though no changes were made. This problem might be avoided if one commits to never change the superclass, effectively making the superclass a large monolithic fossil API commitment for eternal times. In all fairness, this argument can also be raised against classes using composition even though there are fewer ways problems can leak into the code. So, this is not an argument for finalization but rather a more fundamental problem with code reuse.


Inheritance could produce unintended effects due to self-use, whereby an overridable method calls another overridable method in the base class: Imagine a class that extends ArrayList and that is supposed to keep track of the number of elements ever added to the class. If we override add() bumping the counter by one and override addAll(Collection) adding Collection.size() to the counter after which the corresponding super method is called,  then we are in for a surprise: 


Because ArrayList::addAll happens to self-use ArrayList::add to individually add the elements, additions via addAll() will count twice. Furthermore, there is no guarantee that this behavior will stay the same over time unless it is documented. Maybe there will be a more performant way of bulk-adding elements in the future whereby elements are directly inserted in the backing array without calling add()?


Another common problem with self-use is when a subclass overrides a method that is supposed to call one or several other methods but the programmer forgets to call the super method. A related problem is the problem of deciding if an overriding method should call the super method at the beginning or at the end of the overridden method (or indeed somewhere in between).  A solution to some of these problems could be to declare the top method final in the base class and provide overridable protected “hook methods” that can be overridden in a more controlled fashion. 


General inheritance also opens up potential security vulnerabilities: Suppose an ArrayList was extended to ensure only objects fulfilling a certain predicate could be added (e.g. they must be in a valid state). Then, in a later release, a new way of adding elements was introduced via the base class AbstractList. This new way will now become visible in the supposedly safeguarded class, effectively providing a back-door for adding illegal objects to the list. 


Another problem is “propagating exposure” as exemplified by  Arrays.asList(“a”, “b”) which returns a “fixed-size list” (but ought to return an unmodifiable List and here an immutable List as the elements themselves are all immutable). As it turns out, elements in the returned List may now not only be replaced via an Iterator but also via the List::replaceAll,a  method added in JDK 8 after the inception of Arrays::asList.


An additional class of problems might arise if a subclass adds a new method to the ones of the base class. If at a later stage, a method with the same signature is added to the base class, then this method will be coincidentally overridden by the subclass. This is likely not the intended behavior at all. If a method with the same name and parameters are added but with a different return type, then the code will likely fail to compile. So in the general case, it is not possible to ever add methods in a non-final public class as there is no control of how the class is subclassed.


Yet another problem could be incidental inheritance. The JDK itself has several problematic inheritances whereby classes were incidentally inherited because it was apparently “convenient“ and not because class B indeed was class A. For example, Stack extends the old Vector class for no good principal reason. This prevents Stack from evolving to a more efficient and performant implementation.


To summarize, a class that is supposed to be generally inherited is very hard to ever change and must [Bloch18, Item19]:

  • Document its self-use of overridable methods

  • Potentially providing hooks in the form of judiciously chosen protective methods

  • Be accompanied by tests using subclasses

  • Not provide a constructor that invokes overridable methods

  • Not allow serialization to invoke overridable methods



Inheriting also creates constraints and problems if hashCode()/equals() are overridden. If we have a base class called Fruit, then is an Apple with the same color as a Pear equal? Can an instance of SevilleOrange ever be equal to a BergamontOrange instance?  Generally, it is not easy to decide these kinds of questions. It is important to remember that any subclass should either override none of these methods or should override them both.


It should be noted that exposing a public non-final class in a public API by definition means that it opens up for inheritance across package boundaries as user-land code can place extending classes in any packet. Since split packages are strongly discouraged or might even be entirely forbidden depending on the use of JPMS, subclassing such a class implies subclassing over package boundaries.


One way of avoiding all these things is to declare classes final and use composition instead of inheritance, effectively abandoning inheritance across packages. This often provides a much cleaner API whereby only interfaces can be exposed and concrete classes do not leak out in the API. This way, any superclass used is only package-private and can, by convention or definition, never be used externally.


Composition with delegation protects against most of the problems mentioned above including unintended self-use, security holes via extra methods in base classes, signature collisions,  incidental inheritance, need of subclass testing, accidental leak of “this” and many other problems. In the past, it was feared that this would lead to reduced performance but this is simply not the case.


Inheritance in Java is, for good reasons, restricted to one superclass which naturally limits the scalability of the concept. Composition, on the other hand, allows an arbitrary number of delegates to be used.


A small drawback with composition could materialize in combination with the use of certain callbacks. However, this problem can be avoided if proper provisions are put in. In other words, if a component (used in composition) registers itself with a listener, then the listener will invoke the component itself and not the composing class.

Sealed Classes

In more recent Java versions, the concept of sealed classes (JEP 409) was introduced. Before this, the final keyword was a boolean property: either a class was extensible (within its declared access type) or it was not. Sealed classes introduce a more granular mechanism whereby it can be said that a Fruit can either be an Apple, Pear or Orange but nothing more. This is fundamentally a more generalized form of final. The amount of effort put into the Java languages with features like this indicates a class extensibility is an important property. Interestingly, a permitted class in a sealed interface must specify whether itself is final, non-final or permits subsequent subclasses. 


API Commitments Imposed by Inheritance

In this article, the class Stack was mentioned as a failed inheritance implementation. It basically introduces the methods push(), pop(), peek(), empty() and search(). But, as it inherits from Vector, we also get all the methods/classes from List, AbstractList, RandomAccess, Cloneable and SerializableAbstractList, which in turn, inherits from AbstractCollection which implements Collection.


This increases the API weight by orders of magnitudes and I am perfectly certain the Java designers are regretting their incidental inheritance 25 years down the line. If Stack was just an interface and there was a static method available that provided a new empty Stack, things would look much better.


Classes that are Serializable or subject to other serialization mechanisms are often particularly problematic as the binary (or other) format more often than not limits the way implementations can ever evolve over time.


As seen above and in previous clauses, a public non-final class cannot ever change in many cases.

Should Inheritance Across Package Boundaries Ever be Used?

This is a matter of opinion. 


Many times, it is better to use composition. In simpler cases delivering functions to a concrete class’ constructor providing tailored functionality would be preferable over allowing subclassing and overriding methods. To give an example of this, instead of overriding a handler method, a method handler could be provided via the constructor to a non-extensible class.


If, after very careful consideration, one arrives at the conclusion that one should provide an extensible class (across packages), then all the constraints above must be taken into careful consideration. Just allowing subclassing by default is a right-out mistake, particularly for library and API designers. Instead, classes should be marked final by default, and only after careful review and testing, opening up for subclassing could be regarded.

A Final Note

As I moved away from using inheritance across packages and switched to exposing just interfaces, many other advantages became apparent. It becomes much easier to keep internal considerations… well internal.


Composition whereby potentially several components can be used in a single class provides more code reuse capability than inheritance, albeit requiring a bit more code ceremony in the using class. It can also simplify testing of the code and provides better test coverage with much fewer and less brittle tests.


It also fits very well with the module system (JPMS).  Providing components as pure services, for example, using Java’s ServiceLoader, adds flexibility while minimizing the API footprint. This makes it easier to learn and use the API and provides much more flexibility to evolve libraries over time. 


Finally, it all makes sense...

References

[Bloch18]

Bloch, Joshua., Effective Java, Third Edition, ISBN 0-13-468599-7, 2018


[Snyder80]

Snyder, Allan. “Encapsulation and Inheritance in Object-Oriented Programming Languages”. In Object-Oriented Programming Systems, Language and Applications Proceedings, 35-45, New-York, NY ACM Press.