Home Read sample article
architecture
management
scripting
synchronization
misc

Enter Keywords :

Real World Fail Over Services

Background

As of the most current release of Directory Integrator (6.0), Fail Over Services (or FoS) consists of the ability to restart an AssemblyLine when it crashed due to an exception that you couldn't catch. While this is certainly better than nothing, and certainly fills a need for certain users, others may need more. For example, what happens if your primary data center goes down (power outage, hard drive crash, something worse...)? You need High Availability (HA). You need Continuity of Business (CoB). Whatever you (or your company) may call it, you need the ability to have another DI server, somewhere else, realize that the primary has gone down, and that it should become the new primary.

Eddie Hartman of IBM has produced some excellent, and ingenious, custom AssemblyLines that will do "real world FoS" -- basically what we are trying to accomplish here. But this article will show you (what we believe to be) a much simpler, cleaner, and more robust way of doing it. For more information on Eddie Hartman's FoS, please contact your IBM representative.

Desired Functionality

The HA design we cover here will take into account the following considerations:

  1. A Secondary server will exist in a hot-standby mode, waiting for the Primary to fail
  2. The Secondary server will automatically start up once the Primary fails
  3. The communication between the two servers should be robust
  4. The Secondary server should not duplicate any of the work done by the Primary, if at all possible
  5. If the Primary server comes back up, it should not interrupt the Secondary server until the Secondary relinquishes its status
  6. The DI config files should be the same for both the Primary and the Secondary servers
  7. We will use standard hardware to do the network routing

Let's briefly look at each consideration:

  1. While the most optimal architecture will have both servers working at the same time, this isn't a trivial design for DI. One way to have both machines working at the same time is to have a separate HA set of AssemblyLines that post all changes to an MQ infrastructure, and then you can have as many servers working off the same Queue as you want. Unfortunately, you would still need a hot-standby for the DI servers that post to the Queues, and you would also need to implement an HA MQ infrastructure... What you could do, however, is break out tasks into different AssemblyLines, and have each DI server work on just a subset of the AssemblyLines, and when one of the servers goes down, its designated Secondary server will take on the additional load. But that is left for another article, at another time.
  2. This is self-explanatory. An automated way to start up the Secondary server is the most efficient use of time during a server outage. While it is certainly helpful to send notification to an Operations team that the Primary server is down, the goal of this article is to have the Secondary server start up on its own, right away, to minimize any down time. Using more than one backup server (i.e you want a total of 5 servers), the process of each server being notified won't dramatically change the example in this article . However, the determination of which server is the Secondary (or Tertiary, Quaternary, or Quinary) becomes slightly more complex. Stand by for a future article that will address this concept, by having the servers join in an election process to choose the "next backup server".
  3. Although it adds overhead, this article uses TCP for the notification of a server's status. While other mechanisms are available (Eddie Hartman's FoS uses the more lightweight SNMP protocol over UDP), we feel that TCP is the most appropriate because a lost packet or two can produce a false outage scenario.
  4. This is trickier than it seems. There is no guaranteed way to do this. A simple way would be for ALs to use a shared filesystem, but reliability of most shared filesystems is not good enough (especially NFS). Probably the best way to do this is to use a high-end replicated database. In the database, you can either use the DI checkpointing (if you need that level of granularity), or the current event number (e.g. the last changelog number processed). Another way to do this without the replicated database is to put the status of the current entry being processed into the heartbeat message the Primary server sends back to the Secondary (and Tertiary, etc., if necessary) server. For example, instead of just sending out a message "I am alive", you can send out a message "I am alive and currently working on LDAPServer1 changelog number 10, ADServer5 USN 2212, OracleServer3 record xxx", etc.
  5. Therefore, when the Primary fails, the Secondary becomes the new Primary. When the old Primary comes back online, the router configuration determines the order (whether the old Primary becomes the new Primary, or whether it is delegated to the new Secondary). See #7.
  6. The best way to deal with a Highly Available architecture is to have all servers with the same capabilities. Using the same config files, each server can start up in the same manner, which is as a Secondary- (or backup-) server. Then, either via config, via the AMC, or via an automated election process (see #2), the server would know whether it is the Primary (and should begin processing), or whether it should remain a backup server.
  7. By using a hardware-based solution (such as a BigIP router or 3DNS) in a failover configuration, the active IDI server can always be recognized by one IP address (or DNS name).

Now let's hold on a second, and take a step back. Why do we need to build an HA solution via DI code? Why can't we do the same thing we do with web servers -- put them behind a 3DNS box, or a BigIP box, and let the DNS/router handle the failover for us? Well, it depends on your situation. If you are using DI in a scenario that solely runs when another server (or a user) connects to it, like a SOAP server, then the answer is "you can". Just configure the DNS/router like you do for web servers, and you're done. However, if you use DI for scenarios in which the DI server is the initiator, like polling the changelog on an LDAP server, ActiveDirectory server, database, etc., then you can't rely on incoming connections being re-routed to available DI servers (because there aren't any incoming connections!).

This is a scenario when you have incoming connections initiating your AssembyLines, and can solve your HA problems with a router:

However, many implementations don't have a web server-style configuration. They poll changelogs or listen to MQ queues. If that's your implementation, then it looks more like this:


(DI Tertiary Server removed for clarity)

Notice that there aren't any connections between the Secondary server and the end systems. This is because you can't have the backup servers running at the same time as the Primary. If you did, then all DI servers would be polling the end systems, and all servers would be performing the same synchronization, causing duplicates or errors. You can only prevent this by having "cold standby" servers, where you manually start them up when you've received notification that the Primary has failed. The ideal scenario would be to have the functionality of the "web server style" configuration (meaning the router does all the HA work, and you have "hot standby" servers) when your AssemblyLines are written the way they are now -- initiating the connections to the end systems. What this article will do for you is show you a simple way to get this ideal scenario -- the best of both worlds.

Since Directory Integrator is so good at listening for information, as well as transmitting information, we will build an AssemblyLine that will automatically start up your existing AssemblyLine(s) when the Primary server has gone down. If we insert a router into our topology, and configure it for failover, then any incoming connections will be sent to the appropriate DI server, right? But we just said earlier that our AssemblyLine doesn't listen for incoming connections. Let's remedy that by building RealWorld FoS that can perform the following tasks:

  • Sends out a heartbeat
  • Listens for that heartbeat to come back
  • Controls whether to start up, or shutdown your existing AsemblyLine (based upon whether it hears the heartbeat)

Very simple, very clean, no? To follow in our diagrams from above, what we want is:


(DI Tertiary Server removed for clarity)

In this architecture, each DI server sends a heartbeat to the router. The router determines who the recipient of the heartbeat is -- either the Primary server or the Secondary (or the Tertiary, etc.) Therefore, the logic that should be in the Controller FoS AssemblyLine is this:

  • If you hear a heartbeat, then you are the Primary, so you should be active
  • If you don't hear a heartbeat, then you are the Secondary, so you should be shutdown

Again, simple and straightforward. Now, let's build that with DI.

Implementation

  1. Create an AssemblyLine called SendHeartbeatAL
  2. Create an AssemblyLine called ListenForHeartbeatAL
  3. Create an AssemblyLine called ControllerAL

Looking forst at the SendHeartbeatAL, what do we want this to do? We want to send out a heartbeat, every n seconds. For the purposes of this article, let's assume 30 seconds is adequate. Better yet, modify the following code to put the n value in a Properties file and make it easily configurable. Add a ScriptConnector in Iterator mode , and let's call it SleepAndBeatConnector. We want it to sleep for 30 seconds, and then wakeup and send out a heartbeat. The script code can look like this:

function selectEntries()
{
}


function getNextEntry()
{
  var heartBeatFrequency = 30;


  system.sleep(heartBeatFrequency);


  // OK -- 30 seconds has passed. Let's send a beat
  // Let's also assume that the router interface to
  // the DI servers is known as myIDI.mycomp.com
  // and the port we want to use is 1234
  // (you can choose any port number, preferably
  // above 1024)


  var tcp	= new java.net.Socket("myIDI.mycomp.com",1234);

  var out = new java.io.BufferedWriter(
                    new java.io.OutputStreamWriter(tcp.getOutputStream()));

  out.write("Beat\r\n");

  out.flush();  // Always flush buffers!
  out.close(); // Not needed -- JVM should do it for you, but good practice
  tcp.close(); // Not needed -- JVM should do it for you, but good practice
}

For the second AL, ListenForHeartbeatAL, let's add a TCPConnector in Iterator mode called ListenForBeatConnector. In this connector, set the TCP Port to the one specified in your previous script (1234, in this case, but better yet, put the port number and hostname in a Properties file and use that for the greatest flexibility). Check the box for Server Mode. In the After GetNext hook, add the following code:

var ins = conn.getProperty("inp");
var str = ins.readLine();


if (str.equals("Beat"))
{
  // Received a beat
  // so let's leave a note for the Controller
  thisConnector.connector.setParam("LastHeartBeat",
                 java.lang.system.currentTimeMillis());


  // We've now left the timestamp of the most
  // recent beat we've heard

So far, so good, right? We have one AL that sends out a heartbeat every 30 seconds, and another that listens for a heartbeat and keeps track of the last time one was heard. The final step is the Controller, which will make a determination about what to do (e.g. start or stop your existing AL).

In the final ControllerAL, add a ScriptConnector in Iterator mode called ControllerConnector. In this connector, add the following script and we're done!

function selectEntries()
{
}

function getNextEntry()
{
  // Here, we re-use the variable, that's why it
  // should go in a Properties file -- change it
  // once and it's changed everywhere!
  var heartbeatFrequency = 30;

  // Actually, all of these variables should be
  // in a Properties file
  var heartbeatListenerALName = "ListenForHeartBeatAL";
  var heartbeatListenerConnectorName = "ListenForBeatConnector";
  // Below, use the name of your AssemblyLine
  // Make sure you match the upper/lower case
  var mainALName = "MainAL";

  system.sleep(heartbeatFrequency);

  // Now we get pointers to each running AL so
  // that we can query and control them
  var ALs = system.getRunningALs();
  var anAL = null;
  var ListenerAL = null;
  var mainAL = null;

  for (i=0; i< ALs.size(); i++)
  {
    anAL = ALs.elementAt(i);
    if (anAL.getShortName().equals(heartbeatListenerALName))

       ListenerAL = anAL;
    else if (anAL.getShortName().equals(mainALName))

       MainAL = anAL;
  }

  if (ListenerAL == null)
    // We have a problem -- the ListenerAL
    // should never be stopped.  Let's
    // start it:
    system.getServer().startAL(ListenerAL);

  var lastBeat = ListenerAL.getConnector(heartbeatListenerConnectorName)
                                    .connector.getParam("LastHeartBeat");

var now = java.lang.Systm.currentTimeMillis(); // Times are still in milliseconds, so // divide by 1000 to get seconds: var difference = (now - lastBeat)/1000; // Sometimes there is network latency or // processor lag, so add 2 secs to cover any // external factors if (difference <= (heartbeatFrequency + 2)) { // Heartbeat is within normal frequency, // and it is coming to us, so we must be the // Primary server. Make sure we are running: if (MainAL == null) // then we are not running system.getServer().startAL(mainALName); } else { // Opposite scenario -- heartbeat is stale, // and therefore we are sending them to a // different server, hence we are a backup // server. Make sure we are *not* running: if (MainAL != null) // then we *are* running { MainAL.shutdown(); MainAL.join(); // Wait for shutdown } } }

And that's it! When you start up IDI, make sure that you start all 3 of these AssemblyLines, but not yours. Yours will automatically be started approximately 30 seconds (or whatever heartbeat frequency you chose). This same code can be deployed to all of your servers, as well as the same Properties files, etc. If your organization uses any form of automated source code control, it will be very convenient to have the exact same files deployed to each server. The router configuration will handle the failover for you, so you can start any server (or any number of servers) at any time.

In addition, when you are upgrading code, you can deploy the new versions to the backup machines, and then take your Primary down for the upgrade. The moment you stop the Primary server, the Secondary will take over with your upgraded code set. If you notice immedite problems, you can simply start the Primary server again with the old code, or go ahead and update the Primary and bring it back online. None of the servers need to know which is the Primary -- the router does all the work for you.

What's Next

Now that you have Real World Failover Services, you can improve upon this by customizing the heartbeat. If you poll a changelog (for example, via an LDAP Changelog connector), you can pass the current changenumber in the heartbeat and parse it out in the Listener, updating that server's lastchangenumber.

  • Architecting for HA
  • AMC 2.0

 

  • Integrating with RACF & z/OS LDAP

  • Taking Advantage of Certificates
  • Building a File Search Connector

  • AMC 2.0
  • Securing your config
  • Logging with Log4J
  • Architecting for HA
  • copyright © 2012 Cubic Consulting, LLC. all rights reserved