Monday, December 20, 2021

Java: Why a Set Can Contain Duplicate Elements


Java: Why a Set Can Contain Duplicate Elements

In low-latency applications, the creation of unnecessary objects is often avoided by reusing mutable objects to reduce memory pressure and thus the load on the garbage collector. This makes the application run much more deterministically and with much less jitter. However, care must be taken as to how these reused objects are used or else unexpected results might manifest themselves, for example in the form of a Set containing duplicate elements such as [B, B].

HashCode and Equals

Java’s built-in ByteBuffer provides direct access to heap and native memory using 32-bit addressing. Chronicle Bytes is a 64-bit addressing open-source drop-in replacement allowing much larger memory segments to be addressed. Both these types provide a hashCode() and an equals() method that depends on the byte contents of the objects’ underlying memory segment. While this can be useful in many situations, mutable objects like these should not be used in most of Java’s built-in Set types and not as a key in most built-in Map types.

Note: In reality, only 31 and 63 bits may be used as an effective address offset (e.g. using int and long offset parameters)

Mutable Keys

Below, a small code example is presented illustrating the problem with reused mutable objects. The code shows the use of Bytes but the very same problem exists for ByteBuffer.

Set<CharSequence> set = new HashSet<>();

Bytes<?> bytes = Bytes.from("A");


// Reuse


// This mutates the existing object already

// in the Set


// Adds the same Bytes object again but now under

// another hashCode()


System.out.println(“set = “ + set);

The code above will first add an object with “A” as content meaning that the set contains [A]. Then the content of that existing object will be modified to “B”, which has the side effect of changing the set to contain [B] but will leave the old hash code value and the corresponding hash bucket unchanged (effectively becoming stale). Lastly, the modified object is added to the set again but now under another hash code leading to the previous entry for that very same object will remain!

As a result, rather than the perhaps anticipated [A, B], this will produce the following output:

set = [B, B]

ByteBuffer and Bytes Objects as Keys in Maps

When using Java’s ByteBuffer objects or Bytes objects as keys in maps or as elements in sets, one solution is using an IdentityHashMap or Collections.newSetFromMap(new IdentityHashMap<>()) to protect against the mutable object peculiarities described above. This makes the hashing of the objects agnostic to the actual byte content and will instead use the System.identityHashCode() which never changes during the object's life.

Another alternative is to use a read-only version of the objects (for example by invoking ByteBuffer.asReadOnlyBuffer()) and refrain from holding any reference to the original mutable object that could provide a back-door to modifying the supposedly read-only object’s content.

Chronicle Map and Chronicle Queue

Chronicle Map is an open-source library that works a bit differently than the built-in Java Map implementations in the way that objects are serialized and put in off-heap memory, opening up for ultra-large maps that can be larger than the RAM memory allocated to the JVM and allows these maps to be persisted to memory-mapped files so that applications can restart much faster.

The serialization process has another less known advantage in the way that it actually allows reusable mutable objects as keys because the content of the object is copied and is effectively frozen each time a new association is put into the map. Subsequent modifications of the mutable object will therefore not affect the frozen serialized content allowing unrestricted object reuse.

Open-source Chronicle Queue works in a similar fashion and can provide queues that can hold terabytes of data persisted to secondary storage and, for the same reason as Chronicle Map, allows object reuse of mutable elements.

It is dangerous to use mutable objects, such as Bytes and ByteBuffer where the hashCode() depends on the content of the object, in some Map and Set implementations. 

An IdentityHashMap protects against corruption of maps and sets due to object mutation but makes these structures agnostic to the actual byte contents.

Read-only versions of previously modified memory segment objects might provide an alternate solution.

Chronicle Map and Chronicle Queue allow unrestricted use of mutable objects, opening up the path to deterministic low-latency operations.


Chronicle homepage

Chronicle Bytes on GitHub (open-source)

Chronicle Map on GitHub (open-source)

Chronicle Queue on GitHub (open-source)

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.