Data Consistency and Conflict Resolution
With Bidirectional mode, the WAN replication service balances high performance with eventual data consistency through synchronization among Orchestrators. Two additional strategies are employed for data consistency: element versioning and optimistic local modifications.
Note: | When a key is getting updated simultaneously at high frequency from multiple regions, it is possible that even with conflict resolution, the key values will diverge. Therefore we recommend in bi-directional configurations to avoid conflicts when possible, in particular to use distinct caches (only updated in specific region), then replicated to other region. This will allow data to be replicated and available in multiple regions, but avoid any data conflict. |
Element versioning
Element versioning is the core component of data consistency and conflict resolution. All modifications (puts/updates/deletes) to an element cause an internal version to be incremented once successfully applied. Versioning ensures there are no out-of-order updates applied to the Master cache (no duplicates, no backward counting, and no gaps), and versioning is used to resolve simultaneous conflicting updates.
Conflict resolution is invoked for two updates arriving for the same key with the same version, to determine which update is applied and which is rejected. The decision is made based upon the order in which the Master cache processes the updates. The first processed update wins.
The diagram below shows concurrent updates from Regions 2 and 3 on the same key (Key1) being resolved by the Master cache in Region 1. Region 2 (Key1=aaa) wins because its update arrives at the Master cache first. This results in a repair of Region 3, where Key1=bbb is repaired to Key1=aaa.
For the same version of the element, the Master cache repairs the second-to-arrive value to match the first-to-arrive value
Optimistic local modifications with Master repair
To improve performance, local modifications (put/remove) are optimistically applied to the cache. Subsequent reads of the cache by the local application will return the expected modified value.
If there is a race to modify an element in the Master cache, the Master cache can force a repair of the Replica's value. A Master repair means that one of the racing modifications was rejected, and the Master cache has sent a repair message that causes the local element to now have the winning state. This can mean that a put may be overwritten with the winning value, or a losing remove may cause the local remove to be reverted.
In the following diagram, Region 1 performs a put (xy=1 at version 1), while Region 2 simultaneously performs put (xy=2 also at version 1). Because Region 1 contains the Master cache, it replicates xy=1:v1 to Region 2. And when the Master cache rejects the put from the Replica cache, it automatically resolves and sends a "repair" of the value back to Region 2.
Process of a Master cache repairing a Replica cache
After a repair, if there are any rejected values in any local caches in the region, they will be cleared with the existing process for non-replicated caches. For more information about expiration and eviction, refer to "Managing Data Life" in the BigMemory Max Configuration Guide.