Client-Side Connection Management
In the TCStore client, three (3) connection health mechanisms are used: receipt of server responses to operations using the connection, a "connection health checker", and the client-side of the leasing mechanism used by the server.
I/O Operation Error Handling
If, during a read or write over the TCP connection to the server, the client experiences an error that does not indicate the TCP connection is intentionally closed, the client attempts to establish a new connection to the server (or a another configured peer in the stripe) within the scope of the current TCStore operation. From the application's point of view, the operation is not interrupted but just takes longer than usual. During this reconnect phase, connection attempts are repeated at specified intervals and continue until (1) a connection is established or (2) the client's lease expires. This level of reconnect is separate from the TCStore connection resiliency described below.
During this reconnect phase, multiple connection attempts may be made. How many attempts are made and at what frequency is governed by internally established values. If the client's lease expires during the reconnect phase, attempts to reconnect are halted and TCStore connection resiliency capabilities (described below) come into play.
If the active server to which the client was connected fails and a former passive becomes active, the interval designated by the /tc-config/servers/client-reconnect-window property is in force. A client establishing a connection to the new active server within (1) the time remaining in the client's lease and (2) the interval designated by client-reconnect-window, can resume operations without interruption. If either the client's lease or the client-reconnect-window expires, TCStore connection resiliency capabilities (described below) come into play.
Connection Health Checker
The "connection health checker" uses a "ping/response" mechanism during periods when the client is idle to ensure the client remains connected to the server. If the server does not respond to the pings, the server is considered (by the client) "unresponsive"; the client closes its side of the connection and the TCStore connection resiliency capabilities come into play.
Connection Leasing
A TCStore client also relies on the leasing mechanism. As described above, a lease is granted by the server and must be renewed by the client before the lease expires - within the interval specified by the /tc-config/plugins/service/connection-leasing server configuration property. If lease renewal fails, the client considers the server unavailable and closes its side of the connection causing operations pending on that connection to be interrupted. At this point, the TCStore connection resiliency capabilities come into play.
TCStore Connection Resiliency
TCStore connection resiliency comes into action when an unrequested connection closure is observed on the client. This includes:
lease expiration (described above) as observed in the client,
connection closure forced by lease expiration on the server,
connection rejection that occurs by a late reconnect attempt following a server fail-over (
client-reconnect-window expiration), and
network conditions that manifest as a closed connection.
The connection resiliency code suspends TCStore operations using the now-closed connection and attempts to reconnect with the cluster using alternate servers if necessary. While the time allowed for each connection attempt is controlled by the ClusteredDatasetManagerBuilder.withConnectionTimeout value, connection attempts are repeated until a connection is successfully established or the reconnection time limit (controlled by the ClusteredDatasetManagerBuilder.withReconnectTimeout value) is exceeded. Regardless of the withReconnectTimeout setting, at least one (1) reconnection attempt is made.
By default, operations in TCStore wait for a reconnection FOREVER (withReconnectTimeout = 0) unless:
1. the connection is closed (by closing the associated DatasetManager) OR
2. the reconnection is interrupted (by interrupting the client application thread attempting the reconnection).
StoreOperationAbandonedException
If the client reconnects, suspended operations resume with the exception of operations for which a server request was made prior to observing the connection closure. For these "in-flight" operations, a StoreOperationAbandonedException is thrown to indicate the status of the operation is unknown. The application must take its own steps to determine if the operation completed, needs to be or can be repeated, or must be abandoned.
Table 1. StoreOperationAbandonedException
Once a reconnection is made, operations awaiting the reconnection will either observe a StoreOperationAbandonedException or normal operation completion. Which of these is observed depends on what can be asserted (internally) about the state of the operation: 1. If a message has been presented to the server but has not been responded to, there is no way for the TCStore client code to determine if the operation message reached the server or, if it reached the server, the state of the operation initiated by that message. In this case, a StoreOperationAbandonedException is thrown. 2. If the operation was attempted while reconnect is underway, the operation will be retried (internally). When a client receives a StoreOperationAbandonedException, it is up to the client to determine whether or not the operation can be recovered and, if so, what the recovery action must be. If application resilience is desired, the application must handle a StoreOperationAbandonedException which may be emitted from any TCStore operation that requires server interactions. |
StoreReconnectFailedException
If the withReconnectTimeout time limit expires or the DatasetManager is closed while reconnecting, all operations suspended for that connection and any future operations against the affected DatasetManager are terminated with a StoreReconnectFailedException.
Table 2. StoreReconnectFailedException
If a StoreReconnectFailedException is thrown, the affected server connection, the DatasetManager for which the connection was obtained, and any objects obtained from that DatasetManager are now effectively dead -- the connection cannot be recovered and the DatasetManager is unusable. If the client wishes to continue operations, the DatasetManager needs to be closed and a new DatasetManager instance obtained. |
StoreReconnectInterruptedException
If the reconnecting thread is interrupted, that thread will observe a StoreReconnectInterruptedException; reconnection attempts will be picked up by another thread with a pending operation, if any.
Table 3. StoreReconnectInterruptedException
A StoreReconnectInterruptedException is thrown if the client application thread under which the reconnect is being perform is interrupted using Thread.interrupt(). Unlike the StoreReconnectFailedException, the DatasetManager is not yet unusable - the reconnect procedure is picked up by another thread performing a TCStore operation against the affected Dataset. This interruption may be handled similarly to the StoreOperationAbandonedException - the interrupted operation is not canceled, it is simply no longer tracked - it may have completed and the response from the server just not arrived. |
The StoreOperationAbandonedException, StoreReconnectFailedException, and StoreReconnectInterruptedException are unchecked exceptions (subclasses of the Java RuntimeException). Applications for which operational resilience is desired and that access a clustered Dataset need to handle at least the StoreOperationAbandonedException for any activity for which resilience is desired.