Session Persistence and Recovery

Enabling persistent sessions in Com-plete improves its availability and the comfort for the end user. In the event of a failure of any of the involved components (hardware, operating system, VTAM, Com-plete, or any other component), and subsequent normal or abnormal termination of Com-plete, user sessions are kept alive and are recovered automatically when Com-plete restarts on the same or even on a different LPAR.

Persistent sessions support in Com-plete is based upon two main pillars:

  • Persistent VTAM sessions, and

  • Natural application recovery.

When Com-plete starts up and connects to VTAM, it is notified if any failed-persistent sessions are present. Com-plete recovers these sessions using recovery data stored in an XCF structure, performs automatic user log-in, restores the Com-Pass menu levels if appropriate, and restarts the application program which was last active before the failure or shutdown. If this application program supports recovery (Natural does), then it recovers automatically to the state of the last checkpoint (terminal-write) before the failure or shutdown, and waits for the next input from the terminal user. Thus all the user notices of the failure is a certain delay, but no interrupt of his session.


Prerequisites

The following product versions and features are a minimum requirement:

  • VTAM from Communications Server z/OS 1.4 and upwards (for persistent sessions support).

  • Natural 4.2.2 and a Natural Roll Server (for Natural application recovery).

  • The LPAR(s) where Com-plete runs must be members of a Sysplex connected by a Cross-system Coupling Facility (XCF).

Correlation with Generic VTAM Resources

In general, there is no correlation between session persistence and the use of a generic VTAM resource. Each Com-plete instance has its own VTAM ACB, and VTAM session persistence is based on the name of this ACB (Com-plete sysparm VTAMAPPL) which must be different for all Com-plete instances. This means, if one Com-plete instance fails, the remaining members of the generic VTAM resource group cannot takeover sessions from the failing Com-plete. Instead, a Com-plete with the same ACB name must be restarted in order to recover the failed-persistent sessions.

Single vs. Multi Node Considerations

This is probably the most important and at the same time the most difficult decision to make in the context of session persistence and recovery, because you need to weigh the desired (or required) degree of reliability against cost and performance impact. On one hand, one of the worst things to happen in a z/OS environment is a severe failure that results in an IPL being required, as this may mean downtime of the LPAR for an hour or more. So you want to be able to restart your critical Com-plete on a different LPAR immediately if the LPAR it was on needs to be IPL-ed. On the other hand, in order to be able to recover sessions on a different LPAR, you must configure the Natural Roll Server to always write to disk. This may impact heavily on the response times of a busy system.

Setup for Using Persistent VTAM Sessions with Com-plete

In order to enable persistent VTAM sessions for Com-plete, the following steps are necessary:

  • Setup an XCF list structure for Com-plete for storing session recovery data. Specify the name of this structure in Com-plete sysparm XCF-STRUCTURE=name.

  • Specify Com-plete sysparm VTAM-PERSIST=YES.

For single node persistent sessions (SNPS), no changes to Com-plete's VTAM application deck (APPL definition) are necessary.

Multinode Persistent Sessions (MNPS)

In addition to the above, PERSIST=MULTI must be specified on Com-plete's VTAM application deck for multinode persistent sessions. This is only supported (by z/OS Communications Server) for modelled application decks. For example, change the last character in the name of the application to a wildcard character; this turnes the application into a modeled application.

Example:

COMPLET? APPL  AUTH=(ACQ,PASS),PERSIST=MULTI

Additionally, it seems that the VTAM node must be a NN or ICN with APPN support; an EN or Subarea-only VTAM node will not support this feature. The z/OS Communications Server documentation is somewhat obscure on this subject; be sure to refer to the latest versions of the IBM documentation (at least z/OS 1.8), and contact IBM directly if you need help setting up your VTAM definitions for MNPS.

Warning:

This applies to both SNPS and MNPS. If you start Com-plete with VTAM-PERSIST=NO set or defaulted after it was up with VTAM-PERSIST=YES, then it might fail to connect to VTAM, and you will receive the following error message:

COMVTM1002-* OPEN failed, ACB error=118 (X'76')

This indicates that, from VTAM's perspective, the application (Com-plete) is in a recovery pending state. In order to get rid of the problem, issue the following operator commands:

VARY NET,ID=acbname,INACT,F
VARY NET,ID=acbname,ACT

Note that this will terminate any sessions pending recovery, but a

MODIFY jobname,VTAM,START

operator command (or starting Com-plete over) should now have Com-plete open its VTAM ACB successfully.

Setting up the XCF List Structure for Com-plete

Com-plete maintains the recovery data for user sessions and active applications in a Coupling Facility list structure. In order to set this up, you must define the structure in your Coupling Facility policy. The space required for the structure is about 1-4 Kbyte per user VTAM session. The actual size required depends on highly variable factors like the number of Com-Pass levels in use in each session, the lengths of parameter strings used when starting an application program, etc., so an exact calculation is impossible.

The following example defines a structure that should be sufficient for 1000 parallel VTAM user sessions.

Example:

STRUCTURE NAME(COM_LIST1)           
          SIZE(04096) INITSIZE(1024)
          PREFLIST(XCF1,XCF2)       
          REBUILDPERCENT(10)

In order to initialize the structure, activate the Coupling Facility policy containing the definition. For Com-plete to use the structure, specify its name in the Com-plete sysparm XCF-STRUCTURE, e.g., XCF-STRUCTURE=COM_LIST1.

Multiple Com-plete instances running as peers in a generic VTAM resource group should share the same XCF list structure. Only in this case session migration is possible among these Com-pletes.

The only data stored in this structure is that required for session recovery. This means that after an orderly shutdown of all Com-pletes that use the structure, it can be safely reallocated.

Application Recovery

Persistent sessions can be used without application recovery, however, in this case in the event of session recovery each user session is restarted into its application’s main menu, rather than the last active screen.

In order to enable application recovery, specify sysparm APPLICATION-RECOVERY=YES. Note that this makes sense only in conjunction with persistent VTAM sessions, since the only way to recover an application is from a recovered session.

At the time these lines were written, the only application environment to support recovery is Natural. Natural recovery under Com-plete requires a Natural Roll Server, which must be defined to Natural using the Natural profile parameter SUBSID. Please, refer to the Natural documentation for details on how to setup a Roll Server.

Important:
If you want to be able to recover sessions on another LPAR, you must install the Natural Roll Server on each of the involved LPARs using common roll files, and connect all the Roll Servers to a common XCF structure.
None of Com-plete’s own utilities currently supports application recovery; restarting a Com-plete utility after session recovery merely restarts it from scratch with the same parameter string it was started with originally. UEDIT is restarted with the R option, suggesting recovery of the file being edited from the UEDIT work file.

Session Migration Between Com-plete Instances

Com-plete uses a single address space architecture for terminal communication and application execution. A certain degree of workload balancing among multiple Com-plete instances running on one or more LPARs can be achieved by using a common generic VTAM resource name. Once a session has been established with one of the peer Com-pletes, the session remains there. In theory, the peer Com-pletes could permanently monitor response times and shift user sessions around when differences reach a certain level. However, for the time being, Software AG has decided against this for the following reasons:

  1. Session termination and re-establishment is relatively expensive; inducement for session shifting would typically arise when a system’s resources are at their limit, so that session shifting in this situation would be likely to worsen the situation instead of improving it.

  2. The ultimate way for managing critical workloads is the Workload Manager (WLM). Since version 6.4 Com-plete supports WLM performance goals for transactions executing in Com-plete, and Software AG recommends using this feature. If Com-plete did its own performance-based workload shifting, this would be likely to counteract WLM’s endeavors.

The only exception that actually allows migrating sessions from one Com-plete to another is the Com-plete operator command VTAM EVACUATE, see section VTAM in the Computer Operator Commands documentation.

Note:
Session migration requires AUTH=PASS to be specified on Com-plete's VTAM application deck (APPL definition).

Session Handling at Com-plete Shutdown

When Com-plete is set up to support persistent sessions, then shutting down Com-plete by means of EOJ or /STOP operator command does not terminate any active VTAM user sessions. These sessions change state into “failed-persistent” and will be recovered the next time Com-plete starts. In order to force termination of all sessions at shutdown, issue a VTAM NOPERSIST command (abbreviated: VT NOP) prior to EOJ or /STOP.

Session termination by the user (either by graceful shutdown or by closing the terminal emulation window) is always noted by VTAM and is executed regardless of the state of the session (active or failed-persistent).

The auto-logoff timeout terminates the session if Com-plete is active, but not when it’s down.

Note:
The maximum time a session can remain in failed-persistent state is 24 hours. If Com-plete is not restarted within this timeframe, then all failed-persistent sessions are terminated by VTAM.

Restrictions

Natural applications meeting any of the following criteria are not currently recoverable:

  • The application has an open connection to DB2.

  • The application has one or more Natural-for-VSAM files open (Com-plete SD-files don’t count).

The following must be at recovery time as they were at the time of the shutdown/failure:

  • Com-plete thread size (THSIZEABOVE)

  • Natural nucleus and front-end modules – note that these must not be relinked or have maintenance applied before recovery!

The Com-Plete thread size and the Natural nucleus and front-end modules must also be identical among all Com-pletes of a generic resource group if sessions are to migrate among them.

Hints for Testing

Given the complexity of the setup for session persistence and recovery, thorough testing is a must before you can be sure that it will work in an emergency situation. Com-plete provides some features that help you avoid having to shutdown Com-plete many times for testing recovery.

Single Com-plete

Within a single Com-plete you can use the VTAM STOP (abbreviated: V STO) operator command to put all VTAM sessions into a failed-persistent state, if session persistence is enabled. VTAM START (abbr.: V STA) recovers them.

Note:
If you issue VTAM STOP from UCTL in a VTAM session, you won’t be able to issue VTAM START from there because your session will also have become failed-persistent. Therefore, it may make sense issuing these commands from a Telnet session at a tn3270 port inside Com-plete, or from a 3270-Bridge HTTP session, or using the MVS system command /MODIFY from a console outside Com-plete.

Multiple Com-plete instances

If you have two or more identical Com-pletes (see section Restrictions above for what must be identical), then you can test application recovery without going through suspension and recovery of VTAM sessions.

In a Com-Pass session, you can suspend your application(s), and type

UPASS appl-ID

in the Com-Pass command line in order to have your session transferred to another Com-plete. appl-ID can be either the ACB name (sysparm VTAMAPPL) of an active Com-plete, or a generic resource name. Once in the “new” Com-plete, you recover your application(s) by recalling them from the Com-Pass stack.

In a non-Com-Pass session, the above cannot be accomplished (because you cannot stack). You can from another session (or from the console using /MODIFY) issue the operator command

VTAM EVACUATE userID

(abbreviated: V EVA userID) in order to transfer the session to another Com-plete in the same generic resource group. Specifying a target appl-ID is not supported by the VTAM EVACUATE command. Of course, you can use this technique also for transferring a Com-Pass session – it will be transferred maintaining the current application, without stacking it in Com-Pass.