Img_3_lg

Event management design concepts

Tap In System’s Cloud Management Service relies on gathering events from its managed devices in order to determine the health of the environment. Events from each type of managed device are gathered and normalized by the Tap In Management Server, which is implemented as an Amazon EC2 instance. Since reports and viewer applications use these events to display availability, performance and configuration information, it is critical that the structure of the events can contain rich data.

The following are the standard fields used in the Tap In event.

  • Class: Indicates the type of managed device.
  • Element Management System: The unique identifier of the Tap In Management Server.
  • Count: The number of multiple occurrences of this event. Unique for class, ems, rule, and type-name pairs combinations.
  • First Time: The first occurrence of this event.
  • Last Time: The last occurrence of this event.
  • Severity: The severity (1 to 5) of this event.
  • Rule: Indicates the type of error or event.
  • Group: Indicates common group across events from different classes of managed devices.
  • Type-Name pairs: These sets of name-value pairs specify the detailed component that is the cause of this event. These may be unique for each class.
  • Message Text: A free form text description of the event.

These fields are generally used to integrate any monitored element, both cloud and non-cloud, into a central system. A typical set of events from a managed Linux server using the Tap In Linux Agent might look like this. Note, field text has been abbreviated for clarity.

Class First Last Count Sev Group Rule Type 1 Name 1 Text
TI_Linux_Agent 01/01.08 09:00 01/01.08 11:00 24 5 Web Server Disk Host name WEB01 OK - Disk 30% free
TI_Linux_Agent 01/01.08 09:00 01/01.08 11:00 24 2 Web Server Memory Host name WEB01 WARNING –Memory 8% free
TI_Linux_Agent 01/01.08 09:00 01/01.08 11:00 24 5 Web Server CPU Host name WEB01 OK - CPU total 40%
TI_Linux_Agent 01/01.08 09:00 01/01.08 11:00 24 5 Web Server Load Host name WEB01 OK – Load_5 1.0

The Tap In Management Server is capable of processing thousands of events each minute, so hundreds of servers may be managed by a single Tap In instance. Only one pair of Type-Name pairs is used in this example since the host name identifies the source of these checks. However this might vary by the type of event. Tap In uses the Class field to identify the event type, in this case, events from the Linux agent. However the architecture allows different number of Type-Name pairs for different types of events. So if we have events generated by a network device that identifies an error on a node, card and port, the Type-Name pairs would be:

  • Type 1 – Node. The node that is the source of this event.
  • Type 2 – Card (card within the node)
  • Type 3 – Port (port within the card)
Class Last Sev Group Rule Type 1 Name 1 Type 2 Name 2 Type 3 Name3 Text
NET WORK 01/01.08 11:00 1 Web Server Port HW Fault Node NODE101 Card A Port 7 Hardware fault on port

Within a class, different type-name pairs may be associated with different rules. The Tap In architecture also allows a class to specify any number of Type-Name if the managed technology requires greater configuration control. Let’s use the management of Amazon cloud applications as an example.

Amazon cloud developers take advantage of the cloud’s ability to dynamically scale compute resources in order to meet demand. However, this dynamic nature presents management problems since it is difficult to monitor these instances. For complex applications, visibility of clusters or groups of instances based on the server’s role is critical.

One report that may be valuable is a configuration report which shows when Elastic Compute Cloud (EC2) instances start and terminate, and how many there are within various compute clusters at any point in time. Ideally, alerts and performance information can also be related to the dynamic configuration. In order to create these views, the events must contain the following information:

  • The unique identifier of an instance
  • The group or cluster identifier
  • Initial time the instance starts
  • Last time the instance was active

For non-cloud systems, the host name is typically used as the unique system identifier, as shown in the example above. However since IP host names and addresses are dynamically assigned when instances start, they can not be relied on. For cloud applications, the unique identifier can be the dynamically assigned host name, the instance id, or some other Amazon meta-data component. However this identified must also have meaning for the developer, or be able to be related for the developer. This can be done in a couple of ways. If the developer has a different Amazon Machine Image (AMI) for each server’s role, the AMI can be the unique identified. But it may also be possible that the developer is using the same AMI for multiple groups so the AMI is ambiguous.

In this case, the developer may pass the unique identifier as user data that is passed to the instance as it starts. The image then uses the user data to configure the server role. So the unique identified is a combination of multiple Amazon meta-data fields. In order for the management system to provide this level of detail, it must be able to access the Amazon meta-data and relate it to the monitoring information in the event. The Tap In Management Server does this.

For Amazon monitored instances, a Tap In agent is used to generate events that monitor server metrics. The Tap In agent for EC2 has the ability to extract any Amazon meta-data from the instance and combine it any event field. To generate the events required for the configuration report, one possibility is to extract the Amazon instance-id and place it in one of the type-name pairs as “instance-id” : i-xxxxxx.

If the server role must be extracted from the user-data, the Tap In EC2 agent can perform this function. The type-name can then combine the role and unique identifier. For example, if the role “web_server” is extracted from the user-data, the unique identified in the type-name pairs may be “web_server:i-xxxxx”, a combination of role and instance id. In addition, the event group field can contain the extracted role, enabling reports to identify the number of unique instances in each group. Since the first time and last time are in each event, a report can identify when each unique instance was present, and the number of instances active at any time for each group.

Class First Last Count Sev Group Rule Type 1 Name 1 Text
TI_EC2_Agent 01/01.08 09:00 01/01.08 11:00 24 5 Web Server Disk ID Webserver: i-12345 OK - Disk 30% free
TI_EC2_Agent 01/01.08 09:00 01/01.08 11:00 24 2 Web Server Mem ID Webserver: i-12345 WARNING –Memory 8% free
TI_EC2_Agent 01/01.08 09:00 01/01.08 11:00 24 5 Web Server CPU ID Webserver: i-12345 OK - CPU total 40%
TI_EC2_Agent 01/01.08 09:00 01/01.08 11:00 24 5 Web Server Load ID Webserver: i-12345 OK – Load_5 1.0
  • ami_id
  • instance_id
  • instance_type
  • local_hostname
  • public_hostname
  • local_ipv4
  • public_ipv4
  • reservation_id
  • security_groups
  • product_codes
  • user_data

With these events in the database, the configuration reports can be generated.

 
Tap In Cloud Management Service Features and Benefits Use Cases Tap In CloudControl Service
Event Management Architecture Managed Technologies Viewers Integrating Amazon CloudWatch Integrating 3tera Applogic Integrating GoGrid Integrating OpSource Cloud Process Automation
About Tap In Systems Management Contact
Documentation Downloads Technical Articles Technical Wiki Site Forum