Client-Side Connection Management
In the TCStore client, three (3) connection health mechanisms are used: receipt of server responses to operations using the connection, a "connection health checker", and the client-side of the leasing mechanism used by the server.
I/O Operation Error Handling
If, during a read or write over the TCP connection to the server, the client experiences an error that does not indicate the TCP connection is intentionally closed, the client attempts to establish a new connection to the server (or a another configured peer in the stripe) within the scope of the current TCStore operation. From the application's point of view, the operation is not interrupted but just takes longer than usual. During this reconnect phase, connection attempts are repeated at specified intervals and continue until (1) a connection is established or (2) the client's lease expires. This level of reconnect is separate from the TCStore connection resiliency described below.
During this reconnect phase, multiple connection attempts may be made. How many attempts are made and at what frequency is governed by internally established values. If the client's lease expires during the reconnect phase, attempts to reconnect are halted and TCStore connection resiliency capabilities (described below) come into play.
If the active server to which the client was connected fails and a former passive becomes active, the interval designated by the client-reconnect-window property is in force. A client establishing a connection to the new active server within (1) the time remaining in the client's lease and (2) the interval designated by client-reconnect-window, can resume operations without interruption. If either the client's lease or the client-reconnect-window expires, TCStore connection resiliency capabilities (described below) come into play.
Connection Health Checker
The "connection health checker" uses a "ping/response" mechanism during periods when the client is idle to ensure the client remains connected to the server. If the server does not respond to the pings, the server is considered (by the client) "unresponsive"; the client closes its side of the connection and the TCStore connection resiliency capabilities come into play.
Connection Leasing
A TCStore client also relies on the leasing mechanism. As described above, a lease is granted by the server and must be renewed by the client before the lease expires - within the interval specified by the client-lease-duration server configuration property. If lease renewal fails, the client considers the server unavailable and closes its side of the connection causing operations pending on that connection to be interrupted. At this point, the TCStore connection resiliency capabilities come into play.
TCStore Connection Resiliency
TCStore connection resiliency comes into action when an unrequested connection closure is observed on the client. This includes:
lease expiration (described above) as observed in the client,
connection closure forced by lease expiration on the server,
connection rejection that occurs by a late reconnect attempt following a server fail-over (
client-reconnect-window expiration), and
network conditions that manifest as a closed connection.
The connection resiliency code suspends TCStore operations using the now-closed connection and attempts to reconnect with the cluster using alternate servers if necessary. While the time allowed for each connection attempt is controlled by the ClusteredDatasetManagerBuilder.withConnectionTimeout value, connection attempts are repeated until a connection is successfully established or the reconnection time limit (controlled by the ClusteredDatasetManagerBuilder.withReconnectTimeout value) is exceeded. Regardless of the withReconnectTimeout setting, at least one (1) reconnection attempt is made.
By default, operations in TCStore wait for a reconnection FOREVER (withReconnectTimeout = 0) unless:
1. the connection is closed (by closing the associated DatasetManager) OR
2. the reconnection is interrupted (by interrupting the client application thread attempting the reconnection).
If the client reconnects, suspended operations resume with the exception of operations for which a server request was made prior to observing the connection closure. For these "in-flight" operations, a StoreOperationAbandonedException is thrown to indicate the status of the operation is unknown. The application must take its own steps to determine if the operation completed, needs to be or can be repeated, or must be abandoned.
StoreOperationAbandonedException
Once a reconnection is made, operations awaiting the reconnection will either observe a StoreOperationAbandonedException or normal operation completion. Which of these is observed depends on what can be asserted (internally) about the state of the operation: 1. If a message has been presented to the server but has not been responded to, there is no way for the TCStore client code to determine if the operation message reached the server or, if it reached the server, the state of the operation initiated by that message. In this case, a StoreOperationAbandonedException is thrown. 2. If the operation was attempted while reconnect is underway, the operation will be retried (internally). When a client receives a StoreOperationAbandonedException, it is up to the client to determine whether or not the operation can be recovered and, if so, what the recovery action must be. If application resilience is desired, the application must handle a StoreOperationAbandonedException which may be emitted from any TCStore operation that requires server interactions. |
If the withReconnectTimeout time limit expires and the underlying cause of the reconnect failure is transient in nature (e.g. a server which is temporarily offline) and hence is expected to resolve itself at a future time, a StoreReconnectTimeoutException is thrown.
StoreReconnectTimeoutException
If a StoreReconnectTimeoutException is thrown, the DatasetManager for which the connection was obtained, and any objects derived from it, all remain viable. Clients are free to retry operations using those objects. The expectation is that future attempts to reconnect will ultimately succeed once the underlying transient issue, which had prevented earlier reconnection attempts from succeeding, is no longer active. |
If the withReconnectTimeout time limit expires and either the DatasetManager is closed while reconnecting or the underlying cause of the failed reconnection is due to a permanent issue that cannot or likely will not be resolved (e.g. another client having deleted a dataset against which the current client is attempting an operation or the target server's security protocol having been changed rendering it incompatible with the reconnecting client), then all operations for that connection are suspended and any future operations against the affected DatasetManager are terminated with a StoreReconnectFailedException.
StoreReconnectFailedException
If a StoreReconnectFailedException is thrown, the affected server connection, the DatasetManager for which the connection was obtained, and any objects obtained from that DatasetManager are now effectively dead -- the connection cannot be recovered and the DatasetManager is unusable. If the client wishes to continue operations, the DatasetManager needs to be closed and a new DatasetManager instance obtained. |
If the reconnecting thread is interrupted, that thread will observe a StoreReconnectInterruptedException; reconnection attempts will be picked up by another thread with a pending operation, if any.
StoreReconnectInterruptedException
A StoreReconnectInterruptedException is thrown if the client application thread under which the reconnect is being perform is interrupted using Thread.interrupt(). Unlike the StoreReconnectFailedException, the DatasetManager is not yet unusable - the reconnect procedure is picked up by another thread performing a TCStore operation against the affected Dataset. This interruption may be handled similarly to the StoreOperationAbandonedException - the interrupted operation is not canceled, it is simply no longer tracked - it may have completed and the response from the server just not arrived. |
The StoreOperationAbandonedException, StoreOperationTimeoutException, StoreReconnectFailedException, and StoreReconnectInterruptedException are unchecked exceptions (subclasses of the Java RuntimeException). Applications for which operational resilience is desired and that access a clustered Dataset need to handle at least the StoreOperationAbandonedException for any activity for which resilience is desired.