Collaboration Between Cluster Nuclei

The Adabas nuclei in a cluster frequently need to collaborate with one another to keep the database consistent while each of them reads and updates data in the database. The nuclei share recently used data in the global cache area, synchronize their operations using locks in the global lock area, and send messages to one another to perform joint actions that multiple or all nuclei must be informed about or participate in.

If a system hosting a cluster nucleus is temporarily unable to let that nucleus execute normally, for example because it is overloaded, a collaboration process in the cluster may stall if the nucleus cannot respond to intracluster messages in a timely fashion. In such cases, one unhealthy nucleus may impact the processing performed by the other nuclei in the cluster, an effect known as “sympathy sickness.” Adabas Parallel Services has configuration options to eliminate this negative effect for two important collaboration processes.

This document covers the following topics:


Buffer Flush Independence

When cluster nuclei process update commands, they store the updated ASSO and DATA blocks in the global cache area, either immediately (if the LRDP parameter equals zero) or by the end of the associated transaction (with deferred publishing, LRDP>0). Later, a buffer flush writes all updated blocks from the global cache to the database. Any nucleus in the cluster may perform a buffer flush when needed, but only one at a time.

Before a buffer flush writes updated ASSO and DATA blocks to the database, all protection data for the updates in those blocks must be written to the WORK data sets, so that at all times, in the event that one or more of the nuclei fail (terminate abnormally), the following session autorestart can use that protection data to back out any incomplete updates. In earlier versions of Adabas Parallel Services, when one nucleus performed a buffer flush, every other nucleus in the cluster needed to collaborate by writing its latest protection data to its WORK data set, in order to ensure the protection data is available if the nucleus fails during or after the buffer flush. If one nucleus was slow or unable to respond to the request from the buffer flush, the flush would stall. Without a functioning buffer flush, updated blocks waiting to be flushed out would over time fill up the global cache and then all update processing would stall, too.

Adabas Parallel Services Version 8.4 introduced the global parameter CLUPUBLPROT, which allows a buffer flush to complete even without the collaboration of the other nuclei in the cluster. When CLUPUBLPROT is set to YES, each cluster nucleus “publishes” its protection data before it writes related updated ASSO or DATA blocks to the global cache. A nucleus “publishes” protection data by writing it either to the WORK data set or to the global cache as well. This makes the protection data available to the other nuclei in the cluster. Then, if the nucleus is slow or unable to collaborate with a buffer flush performed by a peer nucleus, the buffer flush can read the latest protection data of the unresponsive nucleus from the cache and write it to WORK by itself. This way, the buffer flush is independent of the ability of the other nuclei to collaborate and can complete even without collaboration.

When CLUPUBLPROT is set to YES, cluster nuclei publish their protection data either in the global cache or on the WORK data set before they publish related updated ASSO or DATA blocks in the cache. There is a performance aspect to consider. All protection data must eventually be written to WORK, so writing it to the cache first is overhead. On the other hand, writing data to the cache area is much faster than an I/O to disk; issuing a WORK I/O every time updated ASSO or DATA blocks are to be written to the cache can slow down the processing of update commands significantly.

The parameter CLUWORK1CACHE specifies how many different WORK blocks a cluster nucleus may keep in the global cache at the same time (when CLUPUBLPROT=YES). CLUWORK1CACHE limits the amount of cache space used for protection data and implicitly regulates the use of cache writes versus WORK I/Os for publishing protection data. If CLUWORK1CACHE is specified as zero, protection data is only written to the WORK data set and never to the global cache. This may lead to a significant number of additional WORK writes and is not recommended. A large number for CLUWORK1CACHE may lead to a significant number of additional cache writes for WORK blocks. A number in-between may lead to some additional WORK writes and some additional cache writes. Finding the best balance for overall performance may require trials with different CLUWORK1CACHE settings.

In summary, specifying CLUPUBLPROT=YES (with CLUWORK1CACHE set appropriately) is recommended to make the nuclei in a cluster more independent of one another when they perform buffer flushes.

Update Command Synchronization

If an Adabas nucleus fails (terminates abnormally), the following session autorestart recovers from the failure by undoing incomplete updates or transactions that may have been written to the database before the failure and redoing complete transactions that may not have been written to the database. This happens either during an online recovery process triggered by a nucleus failure in a cluster (if one or more nuclei in the cluster are still active) or when the next nucleus starts (if no nucleus stayed active).

Historically, the session autorestart logic relied on, and took advantage of, the presence of a point in time before the nucleus failure where none of the nuclei in the cluster was processing any update commands. During normal processing, the nuclei in a cluster took care to create such points in time by delaying the selection of new update commands for processing until all active update commands have finished. They did this regularly after every buffer flush. The process to get to a point where no update command is in progress in the entire cluster (called “update command synchronization”) requires the collaboration of all nuclei.

Adabas Version 8.4 introduced the global parameter UPDATECONTROL, which offers the choice to do without these update command synchronization processes. When UPDATECONTROL is set to NODELAY, the nuclei in the cluster do not delay the start of new update commands after buffer flushes. In the case of a nucleus failure, the following session autorestart does no longer rely on the presence of points in time where no update processing was in progress. This eliminates the regular update command synchronization processes in which all nuclei in the cluster must collaborate.

In summary, specifying UPDATECONTROL=NODELAY (in conjunction with INDEXUPDATE=ADVANCED, which is a prerequisite) is recommended to make the nuclei in a cluster more independent of one another after each buffer flush.