Event management design concepts
Tap In System’s Cloud Management Service relies on gathering events from its managed devices in order to determine the health of the environment. Events from each type of managed device are gathered and normalized by the Tap In Management Server, which is implemented as an Amazon EC2 instance. Since reports and viewer applications use these events to display availability, performance and configuration information, it is critical that the structure of the events can contain rich data.
The following are the standard fields used in the Tap In event.
- Class: Indicates the type of managed device.
- Element Management System: The unique identifier of the Tap In Management Server.
- Count: The number of multiple occurrences of this event. Unique for class, ems, rule, and type-name pairs combinations.
- First Time: The first occurrence of this event.
- Last Time: The last occurrence of this event.
- Severity: The severity (1 to 5) of this event.
- Rule: Indicates the type of error or event.
- Group: Indicates common group across events from different classes of managed devices.
- Type-Name pairs: These sets of name-value pairs specify the detailed component that is the cause of this event. These may be unique for each class.
- Message Text: A free form text description of the event.
These fields are generally used to integrate any monitored element, both cloud and non-cloud, into a central system. A typical set of events from a managed Linux server using the Tap In Linux Agent might look like this. Note, field text has been abbreviated for clarity.
| Class | First | Last | Count | Sev | Group | Rule | Type 1 | Name 1 | Text |
|---|---|---|---|---|---|---|---|---|---|
| TI_Linux_Agent | 01/01.08 09:00 | 01/01.08 11:00 | 24 | 5 | Web Server | Disk | Host name | WEB01 | OK - Disk 30% free |
| TI_Linux_Agent | 01/01.08 09:00 | 01/01.08 11:00 | 24 | 2 | Web Server | Memory | Host name | WEB01 | WARNING –Memory 8% free |
| TI_Linux_Agent | 01/01.08 09:00 | 01/01.08 11:00 | 24 | 5 | Web Server | CPU | Host name | WEB01 | OK - CPU total 40% |
| TI_Linux_Agent | 01/01.08 09:00 | 01/01.08 11:00 | 24 | 5 | Web Server | Load | Host name | WEB01 | OK – Load_5 1.0 |
The Tap In Management Server is capable of processing thousands of events each minute, so hundreds of servers may be managed by a single Tap In instance. Only one pair of Type-Name pairs is used in this example since the host name identifies the source of these checks. However this might vary by the type of event. Tap In uses the Class field to identify the event type, in this case, events from the Linux agent. However the architecture allows different number of Type-Name pairs for different types of events. So if we have events generated by a network device that identifies an error on a node, card and port, the Type-Name pairs would be:
- Type 1 – Node. The node that is the source of this event.
- Type 2 – Card (card within the node)
- Type 3 – Port (port within the card)
| Class | Last | Sev | Group | Rule | Type 1 | Name 1 | Type 2 | Name 2 | Type 3 | Name3 | Text |
|---|---|---|---|---|---|---|---|---|---|---|---|
| NET WORK | 01/01.08 11:00 | 1 | Web Server | Port HW Fault | Node | NODE101 | Card | A | Port | 7 | Hardware fault on port |
Within a class, different type-name pairs may be associated with different rules. The Tap In architecture also allows a class to specify any number of Type-Name if the managed technology requires greater configuration control. Let’s use the management of Amazon cloud applications as an example.
Amazon cloud developers take advantage of the cloud’s ability to dynamically scale compute resources in order to meet demand. However, this dynamic nature presents management problems since it is difficult to monitor these instances. For complex applications, visibility of clusters or groups of instances based on the server’s role is critical.
One report that may be valuable is a configuration report which shows when Elastic Compute Cloud (EC2) instances start and terminate, and how many there are within various compute clusters at any point in time. Ideally, alerts and performance information can also be related to the dynamic configuration. In order to create these views, the events must contain the following information:
- The unique identifier of an instance
- The group or cluster identifier
- Initial time the instance starts
- Last time the instance was active
For non-cloud systems, the host name is typically used as the unique system identifier, as shown in the example above. However since IP host names and addresses are dynamically assigned when instances start, they can not be relied on. For cloud applications, the unique identifier can be the dynamically assigned host name, the instance id, or some other Amazon meta-data component. However this identified must also have meaning for the developer, or be able to be related for the developer. This can be done in a couple of ways. If the developer has a different Amazon Machine Image (AMI) for each server’s role, the AMI can be the unique identified. But it may also be possible that the developer is using the same AMI for multiple groups so the AMI is ambiguous.
In this case, the developer may pass the unique identifier as user data that is passed to the instance as it starts. The image then uses the user data to configure the server role. So the unique identified is a combination of multiple Amazon meta-data fields. In order for the management system to provide this level of detail, it must be able to access the Amazon meta-data and relate it to the monitoring information in the event. The Tap In Management Server does this.
For Amazon monitored instances, a Tap In agent is used to generate events that monitor server metrics. The Tap In agent for EC2 has the ability to extract any Amazon meta-data from the instance and combine it any event field. To generate the events required for the configuration report, one possibility is to extract the Amazon instance-id and place it in one of the type-name pairs as “instance-id” : i-xxxxxx.
If the server role must be extracted from the user-data, the Tap In EC2 agent can perform this function. The type-name can then combine the role and unique identifier. For example, if the role “web_server” is extracted from the user-data, the unique identified in the type-name pairs may be “web_server:i-xxxxx”, a combination of role and instance id. In addition, the event group field can contain the extracted role, enabling reports to identify the number of unique instances in each group. Since the first time and last time are in each event, a report can identify when each unique instance was present, and the number of instances active at any time for each group.
| Class | First | Last | Count | Sev | Group | Rule | Type 1 | Name 1 | Text |
|---|---|---|---|---|---|---|---|---|---|
| TI_EC2_Agent | 01/01.08 09:00 | 01/01.08 11:00 | 24 | 5 | Web Server | Disk | ID | Webserver: i-12345 | OK - Disk 30% free |
| TI_EC2_Agent | 01/01.08 09:00 | 01/01.08 11:00 | 24 | 2 | Web Server | Mem | ID | Webserver: i-12345 | WARNING –Memory 8% free |
| TI_EC2_Agent | 01/01.08 09:00 | 01/01.08 11:00 | 24 | 5 | Web Server | CPU | ID | Webserver: i-12345 | OK - CPU total 40% |
| TI_EC2_Agent | 01/01.08 09:00 | 01/01.08 11:00 | 24 | 5 | Web Server | Load | ID | Webserver: i-12345 | OK – Load_5 1.0 |
- ami_id
- instance_id
- instance_type
- local_hostname
- public_hostname
- local_ipv4
- public_ipv4
- reservation_id
- security_groups
- product_codes
- user_data
With these events in the database, the configuration reports can be generated.

