Infrastructure and Application Monitoring on a Global Basis

  • The Problemshow

    Information intensive businesses with global reach often have tens of thousands of servers scattered across the globe. These servers are connected to each other and to the outside world via a myriad of routers and switches. Simply keeping track of all the equipment can be a challenging enough job. Making sure that hundreds or thousands of disparate applications running on the network are executing properly is a huge task, but one that if undertaken properly, can pay enormous dividends to the organization with the vision to see it through.

  • Economies of Scaleshow

    The Sentinel product is designed with just such organizations in mind. The core Sentinel technology is designed to scale. Unlike some other competing products, Sentinel uses small amounts of the vast distributed compute power of the network that it monitors. Specifically, Sentinel deploys lightweight agents on every server throughout the network that not only can be configured to collect and report operational data, but also to effect complex monitoring rules and report back application specific state information. All of this is done in a technically sophisticated way to minimize Sentinel's use of resources and to allow vast swaths of the network to be monitored by a single pair of redundant Sentinel servers. A single pair of Sentinel servers can easily monitor upwards of 10,000 machines.

  • Depth and Breadthshow

    Successful application monitoring on any scale requires a great depth of functionality. Sentinel has built up a huge library of monitoring methods over the years, so that out of the box Sentinel is capable of 'seeing' into virtually all of the generic infrastructure variables (CPU, DISK, MEMORY, NETWORK, TCP, UDP, SNMP etc....) and a host of application specific methods supporting Web Services, Database Services etc. Sentinel agents can be handed rules that test the machine and network in which it is operating and to report back to the Sentinel Servers the results of these tests.

  • Partitioning the Problemshow

    Application monitoring on a global scale requires an organizational approach that might be unnecessary (though we would still highly recommend it!) in a smaller shop where only a few key applications were to be monitored. Traditionally, monitoring has been done on a machine by machine basis because monitoring has typically been driven by infrastructure groups in large organizations. While Sentinel can easily be configured to support such an approach, it can also be configured to support an approach that is based on the concept of application, rather than on the concept of machine. In this approach a set of machines can be organized into an application group and the monitoring rules, and Dashboard Views and Notifications associated with the application, 'travel' together. Multiple applications can share the same machines, and the rule sets that they use will happily run along side each other in the Sentinel agent, operating on the shared machine. This allows application groups to develop their specific monitoring methods without having to worry whether or not they are colliding with another group that might have access to the same physical box. Under this approach, infrastructure monitoring is just another application though its machine group may encompass the entire estate.

  • Code Sharingshow

    Sentinel supports a few other key concepts for application monitoring in large organizations. The first, is that of rule templating. The rules that are executed by the distributed Sentinel agents are at their heart, simple conditional statements that cause the agent to report an event if the condition is true. For example, the canonical CPU monitoring rule would have a condition clause that would look like: when (CPU > value). Where value would be some number between 0 and 100, and would correspond to the percent usage threshold level, over which, the rule would fire and report the event. Sentinel allows for the creation (and comes along with a library) of rule templates. A rule template allows for text substitution so that any number of executable rules with different values can be created from the same template. In the CPU example the template would have the condition clause: when (CPU > $value$). A specific instantiation of the rule would allow $value$ to be replaced by the number that was appropriate for the specific machine and application. This templating concept allows for one specification that can be re-used by any number of applications/machines.

    Sentinel also supports the concept of a rule sets, and machine sets. A rule set can be defined as a particular set of rule template instantiations. For example, in an infrastructure monitoring application there might be one CPU rule, one MEM rule, one DISK rule etc. Each of these rules will have threshold values set to some appropriate value for the monitoring task at hand. These rules will be joined together in a set. Likewise a set of machines can be defined. Sentinel can then automatically perform a join operation between rule set and machine set, whereby each machine in the set will be on the distribution list for the rules in the set. This allows for a single compact definition file to cause any set of rules to be distributed to any set of machines. This is particularly powerful for applications with large groups of similarly configured machines.

  • Automatic Visualizationshow

    Finally, Sentinel also offers a tool to automatically generate Dashboard Views from the combination of rule sets, and machine sets. These automatically generated Dashboards may be combined with other Dashboards to form arbitrarily complex views into the technology estate. Adding or subtracting rules, or machines from rule sets, or machine sets, is trivial, so that the ongoing maintenance of the monitoring suite is minimized

Business Process Modelling and Monitoring (BPM2)

  • The Problemshow

    Frequently it is the case that businesses have mission critical processes that run periodically, whose final goal state requires a carefully orchestrated cascade of subprocesses, each of which is a potential point of failure. As an example of this kind of process, consider a brokerage firm that must value its clients positions every night so that margin and buying power limits can be set properly for the subsequent trading day. Every night positions must be tied out and priced and cash movement in and out of the account must be tracked. This process will minimally depend upon the days trading data, external pricing feeds and account cash transfers. A failure of any element in this reconciliation process can at worst lead to catastrophic business failure and at best lead to customer dissatisfaction.

  • Solution Overviewshow

    As a generalized solution to this common business problem, Sentinel provides a powerful and elegant way to both organize and monitor the flow of business processes in real time, allowing for complex dependencies in distributed compute environments. Using a simple definitional framework, Sentinel will generate and distribute monitoring rules across an arbitrary number of servers and create dynamic process trees for the Sentinel Dashboard that will allow key personnel to monitor the real time state of any critical business process and to be notified of potential and actual failures at the earliest moments in the process so that remediation can be effected.

  • A Derivation of Methodshow

    Following is a discussion of Sentinel's BPM2 paradigm as it can be applied to a simplified real world process. A process is composed of a set of related subprocesses that we will represent as a collectio of distinct process elements. These process elements may execute on the same machine or on different machines anywhere on the network. A single element in the collection is the terminal element or goal state. Sentinel's BPM2 starts with the idea that each process element has four distinct states. They are:

    1. Standby
    2. Running
    3. Completed
    4. Failed

    The progression of states is sequential. Once a process begins running it may only complete or fail. On completion or failure the process will remain in that state until an event occurs to reset the process and place it back into Standby state. When we speak of events in this discussion we mean any state change that can be captured by a Sentinel agent. Examples of events might be a file coming into existence, the clock striking midnight, a process being deleted from the the system process table, a particular subject header in an email logfile, etc.

  • A Simple Exampleshow

    Consider a very simple example, where everyday at 2:00pm on a single server a process named twopm kicks off, runs for say five minutes and produces one of two output files twopm.success or twopm.fail. At 3:00pm a clean-up process runs which deletes any of the twopm output files that were created. In Sentinel we would monitor this process by creating four rules that would be executed by the sentinel agent running on the server in question and a view running within the Sentinel Dashboard that would serve as a real-time window into the state of the twopm process. The rules would schematically look like this:

    1. condition: time between 3:30pm and 4:00pm action: set process state to STANDBY
    2. condition: (process state = STANDBY) and (time between 2:00pm and 2:30pm) and exists (PROC = twopm) action: set process state to RUNNING
    3. condition: (process state = RUNNING) and (time between 2:00pm and 2:30pm) and FILEEXISTS('twopm.success') action: set process state to COMPLETED
    4. condition: (process state = RUNNING) and (time between 2:00pm and 2:30pm) and FILEEXISTS('twopm.fail') action: set process state to FAILED

    On a Sentinel Dashboard an icon would be created that would respond with colour changes to each of the four process states. The icon would take on the STANDBY colour at around 3:30pm and would remain that way until 2:00pm the following day when the twopm process began to run and showed up in the process table of the server on which it was running. At this point the icon would change to the colour of the RUNNING state. Five minutes later the icon would take on the colour of either the COMPLETED state or the FAILED state depending upon the outcome of the processing.

  • Adding Messages and Notificationsshow

    In addition to lighting up icons on a Sentinel Dashboard, we would also like to be able to send notifications to personnel interested in the ongoing state of the process and to have the messages attached to the state changes stored in the Sentinel Event History Database. So in addition to the above schematic we add on the ability to attach a message to the state change, and interested parties to be notified. So the new schematic for say the STANDBY state would look like:

    • condition: time between 3:30pm and 4:00pm
    • action : set process state to STANDBY
    • message : we are now in standby state
    • notify : proccess_operations_group

    Or a bit more compactly:

    • state : condition (...) : action (...) : message (...) : notify (...)

    Since actions are always the same with regard to state changes (we always go sequentially from Standby -> Running -> Completed or Failed -> Standby) we don't really have to specify an action, just the current state. So our final compaction looks like:

    • state : condition (...) : message (...) : notify (...)

    In the real world things are often not as simple this basic example. One big difference is that even an atomic process can fail for any number of reasons. In the example we used above there was only one failure path, and that was conditional on the file twopm.failure being created. Suppose the process could also fail if the server lost network connectivity. We would want to be able to add on to the failure condition the possibility of say ethernet port 0 going down. We could easily do this by simply 'or-ing' in the network failure condition. Rewriting for organizational purposes, allowing for this process element to have a name, our new schematic model would become:

    
    process_element name
    {
        standby  : condition(...) : message(...) : notify(...)
        running  : condition(...) : message(...) : notify(...)
        completed: condition(...) : message(...) : notify(...)
        failed1  : condition(...) : message(...) : notify(...)
        failedN  : condition(...) : message(...) : notify(...)
    }
                

    In this schematic we can have multiple failure paths and each one has its own separate condition, message and notification clause.

  • Dependencyshow

    As is often the case with a complex business process, elements of the process have dependencies on the successful completion of prior process elements. In the brokerage example above, part of the account valuation process requires closing securities prices. If these prices are imported from an external vendor, the account can not be valued until closing prices have been received. The process that does account valuation is sequentially dependent upon a sub process that acquires closing securities prices. To our process_element we add another field (of which there can be multiples) that defines a sequential dependency between one element and the dependent element. A single element may be dependent upon multiple sub-elements, and several elements may in turn be dependent on a single element.

  • Putting It All Togethershow

    Now that we have a definition of a process and its elements and their mutual dependencies Sentinel can do the rest. Sentinel will automatically generate a View where each Icon in the view represents a single element of the dependency graph for the defined process. The icons will be be sensitive to the current state of the process element to which it is attached. As the process element changes state (driven by the various condition clauses) so to will the Icon. In this way a user will have complete visualization of the ongoing process as it winds towards conclusion. Finally, Sentinel will also generate a series of rules for a correlation agent that will monitor that the defined dependencies are enforced, so that nodes higher up the tree may not run before dependent elements are complete.

  • End Noteshow

    There is no practical limit to the number of elements and/or the complexity of the dependencies that can be defined in this manner. Once the process definition file has been created, Sentinel will automatically generate and distribute all the rules sets to the proper servers and configure the correlation agent to monitor for cross element dependencies. To change the process, simply change the definition file and tell Sentinel to regenerate and distribute all the rules and recreate the Dashboard view.