Wednesday 25 April 2018

Informatica High- availability of Domain

Terminology

  • Gatewa​y Node: A node configured as a gateway can become the master of the domain.
  • Master Gateway Node: A gateway node that is currently acting as master.
  • Worker Node: A node marked as a worker node will not become the master of the domain but report to the current master.
  • Master Election: A routine run by nodes to find the current master or unanimously elect a master.
  • Domain database: Database schema where domain metadata is stored. This database acts as an arbiter during master election process.
  • Database heartbeat: A heartbeat (update query) run periodically by the master node on the domain database to broadcast its liveness
  • Node heartbeat: A heartbeat message sent by all non-master nodes to master node updating its liveness 

Design

The availability of the Informatica domain is based on the availability of an elected master node. When the first gateway node is started, it runs the master election routine and becomes the master node. When the next gateway nodes are started, they discover the first node as the master (as a part of master election) and register with it. These nodes periodically send a (node) heartbeat to the master node to express its aliveness.

When the master node terminates, the other live gateway nodes detect the unavailability of the master node by the failed Node heartbeat. At once, they re-run the master election (master re-election) to elect a new master. After a new master is identified, the remaining nodes registers with the new master.

Clients may not be able to connect to Informatica Services temporarily when there is no elected master in the domain.
Note
​Worker nodes do not provide domain High-availability. During startup, it attempts to connect to every gateway node in the domain to identify the current master and registers on identifying the new master.

Database Heartbeat

The master node runs a heartbeat update query on the domain database periodically to persist its aliveness. This is critical for other gateway nodes to be aware of a live master node.
All Gateway nodes perform this heartbeat during the master election routine. This heartbeat informs any electing node of other participating nodes.
This heartbeat is used by master node as a test on availability of domain database.
Database heartbeat periodicity is controlled by the domain-level custom property MasterDBRefreshInterval. The default (and minimum) value is 8 seconds.

Database heartbeat timeout

When master node terminates unexpectedly, the other gateway nodes wait for a timeout before attempting to become the new master. This timeout is 4 x MasterDBRefreshInterval (32 seconds by default). In case of a temporary glitch causing delay/error in the master’s heartbeat, the higher timeout value increases tolerance and does not trigger spurious re-election.

When the master node fails to update the database within the timeout, it gives up the master role and terminates itself. This ensures consistent behavior with other nodes that could be electing master (especially in situations when master node loses network connectivity)

Note

In a domain with single-gateway node (where domain HA & master election does not apply), the heartbeat timeout is 12 x MasterDBRefreshInterval. Here, this heartbeat is only used as a test for availability of domain.

Node heartbeat

The master node has to know which nodes are alive to maintain service(s) status and availability. All non-master nodes update the master node periodically to express its liveness.

Node heartbeat is controlled by node command line option (configured via INFA_JAVA_OPTS environment variable)    “-Dinfa.masterUpdateTimeInterval” and defaults to 15000 milliseconds.
​Note

This has to be configured to the same value on all nodes (gateways and workers alike).

Node heartbeat timeouts

Master node behavior              

If the master does not receive heartbeat from a node within than 6xNodeHeartBeat seconds (90 seconds default), it will mark the node status (as well as services running on that node) as inactive/dead and attempt to start services on other live nodes (as configured).

In the event of a non-master node termination (unexpected), it will take 6xNodeHeartBeat seconds before marking the node as dead.

Non-Master node behavior

If a non-master node failed to send its heartbeat to the master (error/timeout), then it assumes that the master node is not available and re-runs master election. This no-wait model ensures that the other gateway nodes do not waste time before re-electing a master.

In case of a temporary glitch causing the heartbeat message to error/time out, the node will discover the same old master node to be alive and continue reporting to it. 

Co​mmon Database heartbeat failures

As explained in the previous section, the master node terminates itself if it fails to run heartbeat query within the timeout. This is observed in node.log as follows:

This occurs in the following situations:

Error in accessing domain database

When domain database is not available (planned or unplanned) for longer than the timeout period, this will cause errors during heartbeat in the node and errors will be logged in node.log & exceptions.log.

Errors (SQLException) will be logged in node.log and exceptions.log before the termination message.

Network/communication errors

Network errors such as Connection refused (when database server is not accessible), No route to host (when the Informatica host loses network connectivity), Connection timed out (when a TCP timeout occurs) etc. will be logged in node.log and exceptions.log

Timeouts

When the heartbeat thread in the master node did not update the database within the timeout, the node will terminate.
A fatal message will be logged in node.log when the query was not run within the timeout.

Following situations can cause the timeout:
  • ​Network issues
    Network at either ends (Informatica host or Database host), or any network element in-between can cause timeout errors.
  • Resource  crunch (and/or) Node process starvation
    Saturated utilization of system resource on the Informatica host (such as CPU, memory, disk, network) can cause starvation in node process causing heartbeat threads to timeout.
  • Java Garbage collection pauses
    The Informatica node is a java process, where Java’s Garbage collection threads might suspend the application involuntarily for a longer time than the timeout.

Common Node heartbeat failures

By design, node heartbeat is interpreted differently by master node and non-master nodes. Master node gets to know the status of all other nodes in the domain so that it can ensure service availability. Non-master nodes piggyback on node heartbeat to be aware of the availability of master gateway node.
Similarly, failure of a heartbeat message (by non-master nodes) & heartbeat timeout error (by master node) trigger different routines. Below are common situations:

Node heartbeat timeout failure

When the master node fails to get a heartbeat from a non-master node within the timeout, it marks the node as inactive and starts failing over services that were running on the non-master to other available nodes (if applicable). The state of the non-master node process (if it was alive or dead) does not matter, as if it is not able to update the master, it is as good as dead.
  • ​Logs on the non-master node will help identify if the node was terminated for some reason.
  • Logs on master node and domain logs, tells the story of heartbeat failure.
  • In case of the non-master node was alive yet marked as inactive, infa9dump on the non-master node & master node helps identify if there’s any blocked communication between the 2 nodes

Node heartbeat failure

When a single heartbeat message from non-master node is not delivered on time, the non-master questions the availability of the master and starts master re-election (in case of gateway node) or searches for a new master (worker node). However, it continues to run the services running on the node during this period.

In case of a temporary network glitch, the heartbeat failure will not have any impact on domain/services.
Node.log shows an error as follows:

ERROR [Domain Monitor] [DOM_10025] The node cannot send heartbeat messages to the master gateway node. 

Troubleshooting

Heartbeat failure/timeout can happen due to difference causes such as Informatica configuration or related issues, and system-level causes such as resource utilization, environment, and network issues. 
Following are some common steps to isolate the issue to system-level causes.

Troubleshooting Informatica application related issues

InfaLogs

Analyzing logs from different components/nodes helps put parts into the full root-cause story. Collecting InfaLogs including domain and service logs from all nodes in the domain help understand the complete situation.
Note

Logs from an unaffected node also help in understanding the pattern of affected vs. unaffected parts of the domain

JDBC Spy logging can be enabled to debug queries executed and relevant errors/delays for database heartbeat failures (however, this does not help in database connectivity issues)

InfaDump

InfaDump on the node process collects diagnostics such as thread & heap dump that helps perform deeper analysis. This is useful to debug situations when threads are unexpectedly blocked/stuck when performing heartbeat causing timeouts/failures though processes are still running.

Java GC logging

Enabling Java’s GC log (command-line option) for the node process will help identify if GC pause is the cause of the issue. Typically, high GC pause is due to invalid configuration of –Xmx for the node process. The default value of 512m is the minimum for a node, but necessary size could be higher based on number of services & users in the domain.

In rare situations, this can be due to memory leak in the node process. Collecting infa9ump on the node process will help isolate memory leaks

Troubleshooting system & network

System resource crunch (and/or) process starvation

Monitoring system resources at granular-level and identifying resource-saturation & process-starvation during the window of timeout will help identify as the cause of the heartbeat failure. Including all resource utilization, and load-average will show any anomalies. In the case of virtualized environment, it will be critical to monitor resources granted to the guest and not just the provisioned resources. For instance, metrics such as CPU ready%, memory balloon in VMWare, help identify starvation caused by virtualization.

Network errors

Network errors such as dropped packets or delayed transfers at either ends of communication (Master Host ĂźĂ  Non-master host, Master host ĂźĂ Database host), or any network element in-between such as switches, routers, firewalls, virtualization layer can cause timeout. This can be due to network congestion, broken hardware, configuration issues etc.

Comparing TCP packet dumps captured at both ends of communication will help isolate packet delivery.
Note
A filter with IP addresses of host and port number (database’s port in case of database heartbeat failure, and node’s Service Manager port in case of node heartbeat failure) can reduce the size of captured packets

Tuning

The default configuration of heartbeat intervals & timeouts in Informatica are expected to function in robust environments. However, depending on system load & related configuration, there could be a lot of unexpected heartbeat failures (followed by unexpected service unavailability/failover). The heartbeat intervals can be increased to higher values to be more tolerant to temporary failures.

Note

Increasing heartbeat intervals has a side-effect of delaying failover when a real failure happens. For example, master re-election will take longer when a master node terminates unexpectedly on increasing database heartbeat interval

Following is a list of tunables relevant to Informatica heartbeats:

10 comments: