Troubleshooting
Out of Memory
One of the common reasons for a realm to attempt a graceful shutdown is an OutOfMemoryException.
Cause
The realm server is under heavy load where the current allocated maximum heap resource is not enough to handle this high load or volume.
Reliable channels keep all their events in memory. If the events on these reliable channels are larger (in total size) than the amount of memory available to the realm server's JVM then it will run out of memory.
Due to a memory leak issue in the realm server.
Solution
Consider using other channel types (for example, persistent for storing to disk)
Purge events from channels either programmatically or by using channel attributes such as TTL or capacity.
Servers with a large number of resources and handling large message loads may legitimately need a large amount of memory to operate. If high memory usage is due to high load, one suggestion is to increase the UM maximum heap size to a higher value so Universal Messaging has enough heap to handle the load. This value can be changed in an entry called "wrapper.java.maxmemory" in Server_Common.conf (under UniversalMessaging/server/<InstanceName>/bin ), and then you need to restart the Universal Messaging realm.
If the realm server is running on the RedHat Enterprise Linux 6 OS (RHEL 6), execute the command below to capture the output to a
debug.out file during the time when memory is continuing to grow:
pmap <realm_server_pid> > ./debug.out
For example:
pmap 17845 > ./debug.out
Next, review the debug.output file from the above command. If there are many of the entries with [anon] as shown in the example below, this indicates the cause of the high memory usage is due to the Linux/JVM bug which has been discussed in the following articles that provide more information and a work-around solution.
0000000000400000 4K r-x-- java0000000000600000 4K rw--- java
0000000002201000 4896K rw--- [ anon ] 00000005c0000000 8390912K rw--- [ anon ]
00000007c0240000 1046272K ----- [ anon ] 00007f6ab8000000 27108K rw--- [ anon ]
00007f6ab9a79000 38428K ----- [ anon ] 00007f6ac8000000 65492K rw--- [ anon ]
00007f6acbff5000 44K ----- [ anon ]
If the cause of high memory is due to JVM/Linux as described above, the system variable MALLOC_ARENA_MAX on the OS needs to be reviewed and adjusted to correct this issue. A suggestion is to set MALLOC_ARENA_MAX=1 as a work-around to correct this growing memory.
Mitigation
Monitor the realm server's JVM heap memory usage and take corrective actions.
The following information needs to be gathered apart from the regular details, if further root cause analysis is needed:
1. Execute a pmap command as below to capture the output
pmap <realm_server_pid> > ./debug.out
For example: pmap 17845 > ./debug.out
2. Take the heap dump of the realm server during the time when memory is high. From the command line, navigate to the <InstallDir>/jvm/jvm/bin directory and execute the following command to generate the heap dump:
jmap -dump:format=b,file=<full_path>\<file_name>.hprof <RealmServer_pid>
For example, on Windows:
jmap -dump:format=b,file=C:\tmp\realm9300.hprof 13276
Unix:
./jmap -dump:format=b,file=/temp/realm9300.hprof 13276
java.lang.OutOfMemoryError Direct buffer memory
There is a case where the realm server will shut down when the direct memory configuration is not sufficient for the realm server to use while handling the load or transactions.
1. Review the nirvana.log file (under the <DataDir> directory) and verify to see if there is an error such as "java.lang.OutOfMemoryError Direct buffer Memory" before the realm server went down, for example:
[Tue Jul 25 12:20:07 PDT 2017],UserManager: User abc@<host>
Logged Out using nsp, Reason : java.lang.OutOfMemoryError Direct buffer memory,
session established for 26192 seconds, Session Id = 3c7fd6a000000000 ID:
<host>:37532
2. Verify the realm configuration to see whether realm server is configured for use with direct buffering or heap buffering, since this issue only happens when the realm server is configured to use direct buffering. From Enterprise Manager, under the "Config" tab, navigate to "Connection Config", and check "UseDirectBuffering". If it set to "true", this means the realm server will allocate DirectByteBuffers to use for network I/O (otherwise the realm server will use HeapByteBuffers instead).
If "UseDirectBuffering" is set to true, and the realm server stopped with the above error in nirvana.log, this indicates there are two possibilities that can cause this issue:
The available direct memory which is allocated for this server is not sufficient.
Or, the direct memory is sufficient, however the garbage collection (GC) does not run frequently enough to free the memory.
3. Next, to find out how much memory is currently allocate for Direct Memory, open Server_Common.conf (under <InstallDir>/UniversalMessaging/server/<InstanceName>/bin ) and check the current value of "MaxDirectoryMemorySize". For example, the following entry sets the size to 1 gigabyte:
wrapper.java.additional.17=-XX:MaxDirectMemorySize=1G
4. To monitor the buffer usage, from Enterprise Manager, navigate to the "Monitoring" > "Threads" > "Pools" tab, then monitor the "Buffers Created" and "Buffers Reused" values to get a better idea of the buffer usage.
Cause
Direct memory configuration is not sufficient for the realm server to use while handling the load or transactions.
Solution
There are three suggested options that can be used to resolve this issue depending on the customer's environment, configuration, and architecture.
Increase the value for MaxDirectMemorySize (configured in Server_Common.conf ) as long as there is enough memory (RAM) available on the system.
Or, decrease the maximum heap memory setting (wrapper.java.maxmemory in Server_Common.conf). This will cause more frequent GC sweeps so it can release the allocated direct memory.
Or, disable to use of direct memory. This can be done by setting "Config" > "Connection Config" > "UseDirectBuffering"=false in the Enterprise Manager.
Note:
This will adversely affect the performance of the UM to some extent. It also requires to increase the max heap size. The heap size should be at least equal to the current heap size plus the direct memory size.
Mitigation
Monitor the memory usage of the realm server process and take corrective action.
The following information needs to be gathered apart from the regular details, if further root cause analysis is needed:
The buffer usage from Enterprise Manager. Navigate to the tab "Monitoring" > "Threads" > "Pools", then monitor the "Buffers Created" and "Buffers Reused" values to get a better idea of the buffer usage.
Too many open files
The realm server can shut down due to an environment error, data/RealmSpecific/channels.nst_new (Too many open files).
Cause
The realm server has exceeded the maximum file descriptors limit on the operating system.
Solution
Increase the ulimit values based on the following requirements
~2 socket connection per client
1 per channel or queue
~30 for realm server state synchronization
Mitigation
Set appropriate ulimit values based on the number of clients, channels/queues
Cluster Not Forming
Cause
The cluster cannot form if more than 50% of the realm servers forming the cluster are unavailable or not reachable
Inter-realm server communication or network is broken
Solution
Make sure that at least 2 realm servers in a 3 node cluster are online and reachable
Make sure that all the realm servers are reachable from each other
Contact Software AG Technical Support if none of the above works.
Mitigation
Monitor the cluster state
Realm node fails to start
The realm server uses the Tanuki Java Service wrapper to start. Behind the scenes, the Tanuki wrapper tracks the health of the JVM process by pinging it. If it does not get a response from the JVM started process within a certain preset time limit, the Tanuki wrapper throws the message "Startup failed: Timed out waiting for singal from JVM". Then, the wrapper will either try to restart the JVM process again if "Automatic JVM restarts" is enabled, or the Tanuki wrapper will issue a shutdown if "Automatic JVM Restarts" is disabled.
1. Review UMRealmService.log : Navigate to the <InstallDir>/UniversalMessaging/server/<InstanceName>/bin directory, open UMRealmService.log and search for "Timed out waiting for signal from JVM". For example, the messages when the realm server fails to come up may look something like this:
wrapper | Startup failed: Timed out waiting for signal from JVM.
wrapper | JVM did not exit on request, termination requested.
wrapper | JVM received a signal SIGKILL (9).
wrapper | JVM process is gone.
wrapper | JVM exited after being requested to terminate.
wrapper | Automatic JVM Restarts disabled. Shutting down.
wrapper | <-- Wrapper Stopped
2. Check Server_Common.conf under the <InstallDir>/UniversalMessaging/server/<InstanceName>/bin directory to see if the entry "wrapper.startup.timeout" exists and is set to any value. If this entry does not exist, the default value set by Tanuki is 30 seconds.
Cause
The realm server could not start within the timeout setting of Tanuki wrapper.
Typical causes could be:
Memory or CPU starvation
Heavy swap usage
Solution
1. The quick solution is to add (if it does not exist) or update the entry in Server_Common.conf in the <InstallDir>/UniversalMessaging/server/<InstanceName>/bin directory, and set the timeout to a higher value (for example 300 seconds), then restart the realm server.
wrapper.startup.timeout=300
2. Once the realm server is up, review nirvana.log to find out the reason for the longer startup time
Mitigation
None
Realm node fails to recover
A server which fails to recover will retry automatically and can often recover given enough time. Realms that are offline for a long time may take a long time to recover, as all information has to be re-synchronized from the master. Monitor the cluster state as noted in one of the previous sections. As long as the recovery is progressing and not looping on any particular message, the realm should be left to complete its recovery.
Cause
A common cause of failure to recover is data corruption. This will appear in the log files as repeated occurrences of 'Cluster recovery failed... retrying' and 'going into recovery for channel' for the same channel.
For example:
[Wed Feb 18 10:20:49 EST 2017],Cluster recovery failed...retrying:
/cfp/core/maps/moduleHeartBeatMap First EID:540811 Last EID:541706
with 17 events First EID:516482 Last EID:541706 with 19 events
[Wed Feb 18 10:20:49 EST 2017],Going into recovery for channel
[/cfp/core/maps/moduleHeartBeatMap] because eid 541706 Status EID 541706
Channel Events: 17 Status Events:19
Storage file(s) can get corrupted for various reasons:
Disk space ran out and the information cannot be stored to the file properly, or only partial information is stored which leads to corrupted information in the file(s).
Disk drive issues which causes the file to be damaged or not accessible.
Synchronization issues between storage files throughout the cluster.
For example, below are a few examples of the errors to indicate the storage file is corrupted or has some kind of problem which prevent the realm server from coming up. In the other case, the error may not be exactly the same, but if the same error with the same filename keeps repeating, this hints that the file is probably corrupted or has some issue.
1. The following exception indicates that there is an issue with the file NirvanaClusterQueueCommitChannelb7cd0d8d1f8b5e.mem, which prevents the realm server from starting:
[Fri Apr 28 21:33:09 BST 2017],Mixed-Store>
/UniversalMessaging/server/my_server/data/
RealmSpecific/NirvanaClusterQueueCommitChannelb7cd0d8d1f8b5e.mem:
Exception Follows:
Fri Apr 28 21:33:09 BST 2017],Socket Stream reach EOF
java.io.EOFException: Socket Stream reach EOF at
com.pcbsys.foundation.io.fEventInputStream.readComplete(fEventInputStream.java:393)
at com.pcbsys.foundation.io.fEventInputStream.readByteArray(fEventInputStream.java:300)
at com.pcbsys.nirvana.base.events.nBasePublishEvent.performRead(nBasePublishEvent.java:430)
2. The following messages repeat every 1 minute in the log, and this indicates that there is an issue with the file /namedsubqueues/xyz/top/um/productESB/NamedSubscriber4611686018427387905, which prevents the realm server from coming up.
[Sat Apr 22 00:38:24 BST 2017],Restarting restore for
/namedsubqueues/xyz/top/um/productESB/NamedSubscriber4611686018427387905
[Sat Apr 22 00:38:24 BST 2017],Cluster> Cluster Recovery started for
/namedsubqueues/xyz/top/um/productESB/NamedSubscriber4611686018427387905...
[Sat Apr 22 00:38:25 BST 2017],Restarting restore for
/namedsubqueues/xyz/top/um/productESB/NamedSubscriber4611686018427387905
[Sat Apr 22 00:38:25 BST 2017],Cluster> Cluster Recovery started for
/namedsubqueues/xyz/top/um/productESB/NamedSubscriber4611686018427387905
Solution
The data in the Universal Messaging channels is stored in .mem files on disk, in the <DataDir> directory of the realm. The contents of the channel can be found in a file called <channelName><hashNumber>.mem. The hash number is internally generated and is usually not significant for our purposes.
Follow these steps for recovering from the data corruption cases
Stop (gracefully) the realm that is having problems recovering
Take a backup of the <DataDir> directory
Identify the .mem file for the channel on disk by the channel name
Delete this .mem file
Restart the realm
The data will be re-synchronized from the master, so no data will be lost for the channel.
Important:
Deleting a store erases the data from that slave node. If multiple nodes fail it may become impossible to recover data from the cluster. Always back up the contents of the data directory before performing any manual operation. Never modify the contents of the data directory while a realm is running.
Note: If there are one or more .mem files that are drastically different from the store on the master realm, recovery can take a long time, as this .mem file has to be reconciled with the master store. This case can be seen by cluster log messages showing those specific channels repeatedly in recovery over an extended period. For example multiple occurrences of messages like the following would be diagnostic: Cluster> Still in recovery for /various/messenger/generic For these cases removing the .mem file from a realm will cause the entire store to be synchronized from the master and will speed up recovery. Proceed as outlined above for the data corruption case. |
Contact Software AG Technical Support if none of the above works.
Mitigation
If possible, shut down all the servers (gracefully). Killing the realm can cause data corruption and problems reforming the cluster.
Realm drops out of cluster
A realm goes offline and drops out of the cluster, with the following message indicating that the keep alive pings between the realms are not being returned:
nPhysicalKeepAlive : Terminating inactive cluster inter realm link
due to no data received
Sometimes disconnected messages appear, like:
[Tue Jun 13 23:38:56 EDT 2017],Disconnected from p2um1
Cause
This may happen because of a slow response on the remote realm(s) due to load and/or resource constraints. By default a realm sends keep alive messages every 10 seconds and will log the message if there is no response in 35 seconds. Check the log on the remote realm(s) for signs of distress, warnings, and errors.
Frequent GCs could result in message in
UMRealmService.log like below, indicating that the JVM's response is very slow.
STATUS | wrapper | 2017/06/18 23:38:50 |
Pinging the JVM took 4 seconds to respond.
The slow response of the JVM could be caused by CPU, memory, or other constraints in its environment.
Cases of network errors are visible as java.io errors in the log files.
In the case of socket/TCP problems, UM does not always get a clear message from the Java I/O libraries on exactly what went wrong. The Java libraries will sometimes just give a generic socket exception message, which shows up as the java.io.EOFException or similar. Check on the network using network diagnostics.
Solution
Identify the root cause from the logs as described in the previous section and try to resolve the issue.
Mitigation
Monitor the network and other resources and take appropriate corrective actions before a realm server goes unresponsive and drops out of the cluster.
Consumer Performance leading to "Consumer Warning" entries
In some cases, there can be "Consumer Warning" entries in the server log (logged in log level "Warning").
Cause
These log entries are generally related to poor performance on the consumer side. Typical symptoms are:
Events piling up on a queue or channel with shared or serial durables.
Slow consumption or publishing as a result of a consumer having incorrectly configured event filters.
Solution
The presence of these "Consumer Warning" log entries can give information about consumers that are not properly configured, such as:
Consumers which are slow in acknowledging events, which can lead to events pile up.
Consumers with incorrectly configured filters which can affect performance or lead to events piling up.
To be able to see the "Consumer Warning" log entries, the log level of the server should be set to "Warning" or more verbose.
Mitigation
Based on the "Consumer Warning" entries, check if your event consumption strategy can be optimized.