This document defines a basic set of data elements related to operational monitoring and instrumentation that should be provided by systems on-premises, in our AWS Cloud, or a SaaS provider.
requirementsforoperationalmonitoring.pdf | 156 KB |
requirementsforoperationalmonitoring.pdf | 156 KB |
Authors |
|
---|---|
Version | 1.0 |
Last Revised | 25-Aug-2017 |
Status | Final |
Document Type | Single Topic Guidance |
Audience Level |
|
A good deal of attention has been paid to specific tools for monitoring, and in some cases, vendors have been selected. Missing is a list of the desired management capabilities required in aggregate for HUIT. Also missing is an enumeration of the instrumentation (data) that each of the managed systems in our environment should support. This work addresses these two areas.
Operational management environments generally support five broad functional areas: Fault, Configuration, Accounting, Performance, and Security. To achieve these functions, a management environment will have two main building blocks:
This document defines a basic set of data elements related to operational monitoring and instrumentation that should be provided by systems in the Harvard environment whether they are on-premises, in our AWS Cloud, or a SaaS provider. It also includes a description of the basic management functions that are expected for deployed services.
This list is focused on general instrumentation and is not intended to be a replacement for product specific instrumentation and tooling. For example, Oracle our main database supplier, has specialized instrumentation. This instrumentation provides visibility not likely to be available via standard methods and is indispensable to providing required management functions. Where possible, we should have a preference for standard products and instrumentation, but where we have made decisions to use products that can only be managed via access to proprietary instrumentation via proprietary software we must use those systems.
Absent from the description of the data elements below are issues related to aggregation, overall architecture, frequency of collection, and functions performed on the data. Frequency of collection and functions performed will be described in the section on management functions. Architecture, including aggregation points, is beyond the scope of this backlog item. That can be taken up as a separate backlog topic if desired at the conclusion of this work item.
The further up the 'stack' you go, the more difficult it becomes to find common standard instrumentation. For this reason, when thinking about management/monitoring of technical services and their components, it is generally required that instrumentation from other levels be configured in such a way that they provide insight into application/service behavior. This is generally better that having a custom set of proprietary instrumentation and management tools for each application deployed at the University. Examples of standard instrumentation are provided for each of the categories below. In some cases, multiple technologies are available to provide the data. That involves defining the larger architecture for management and monitoring which is not in scope for this backlog topic, but could be taken up at the conclusion of this work. This work has value as a standalone element since it will help service owners identify what is required and begin to build environments that support the required instrumentation.
As noted above, our primary database vendor, Oracle tends to try to have their users manage their products with proprietary tools, though some standard instrumentation has been defined. Below is a starting list of simple information relevant to databases that would be of general interest:
Many of the requirements for instrumentation for the OS and resources it uses are the same as those collected for layers above the OS. What distinguishes the data at the higher level is that they are aggregated/segmented in different ways, for example, all the processes related to an application, database, or middleware component.
Items to include are:
It is understood that in AWS or other cloud/SaaS environments collection of some of these data elements may be more challenging.
The idea is to capture information that that an understanding of the state of the application(s) on an instance of a system (physical or virtual). Aggregating across multiple systems is a function of the management software application and will be covered in the section on management functions. For each application, and its elements (processes), the following information
It is strongly desired that as much of the system as is possible be user configurable consistent with the requirements specified in this document and security.
The page on instrumentation requirements details information from different resources required for a management system to perform its functions. This page list the functions the management system should be able to perform. Functions that are related have been grouped into major categories. These categories are:
This section details the basic data collection functions the management system should perform, on a configurable basis, for each management element it monitors. The details of the type of information to be collected for each managed system type such as a server are detailed on the instrumentation requirements page.
We need to be able to aggregate across a variety of dimensions. Here are some examples:
EXAMPLES: This type of aggregation can help us in capacity planning for services such as when will more network capacity be required or when should a database or other type of server be upgraded to the next larger size, or when an auto scale group size might be increased.
EXAMPLES: The objective here is to assist operational personnel focus on the issues of greatest importance and if there are issues, get to the root cause as soon as possible. By 'throttling' alarms or cutting down on event duplicates, operation staff are aware of problems but not getting so much information that it distracts them from the tasks of finding the error. See the route flapping example above.
This section identifies the operation performed on data received by the management system that have been divided into the recognized areas of management.
The essential difference between accounting data and performance/utilization data is that accounting data allocates consumption by an identifiable consumer such as an application, service, network, or other identifiable entity. It is not the intention of this set of functions to replicate a billing system. The idea is to use collected usage data to inform decision making capacity planning, performance or cost management.
This section refers to issues related to the management software itself:
Previous sections have identified the type of data to be collected and the operations to be performed on that data. This section enumerates the various outputs of the system.
This set of facilities relates to how the management system can integrate with other systems in the HUIT environment. The system should provide:
Requirement | Justification |
---|---|
Ability to override schedules | The ability to modify schedules to allow for time off is necessary for the operations workflow. |
Automated rotation | The ability to automatically rotate on-call operations team members is necessary for the operations workflow. |
Automated event escalation | The ability to escalate events to another operations team member is necessary for the operations workflow. |
Event groups and classification | The ability to group or classify events based on predetermined criteria is necessary for the operations workflow. |
Event notification suppression expiration | The ability to suppress an event notification is necessary for the operations workflow. The idea is that if you have an event that keeps firing (or is causing an alarm to fire) you want to be able to latch until cleared. |
Event history auditing and reporting | The ability to view historical events is necessary for the operations workflow. |
Guaranteed alert delivery | End-to-end alert delivery is critical to the operations workflow. |
International notification capabilities | The ability to send notifications internationally |
Multiple region availability | High availability in multiple regions |
On-call schedule history | The ability to view historical on-call participation is necessary for the operations workflow. |
On-call schedules | The ability to schedule operations team members to participate in an on-call schedule is necessary for the operations workflow. |
Service Element Provisioned State | If a server for example, is simply a backup or a link is a secondary/backup, a failure is less critical than a main production element. The system should be able to be configured to know about these differences. This concept applies to our different environments (e.g., DEV, PROD, etc.) |
Single platform | Single configuration and administration interface |
Support for authentication federation | The ability to federate with the currently preferred authentication systemThe ability to integrate alerts and notifications to currently implemented reporting platforms |
Support for software integrations | The ability to integrate alerts and notifications to currently implemented reporting platforms |
Support multiple event notification types | The ability to use SMS messaging, mobile app push notification, phone call or email is necessary for the operations workflow |
Support Time-of-Day Alerting | The ability to set up different alerting methods and pathways for events based on the time of day |
Web based and mobile application capable | The ability to use a web based dashboard or a mobile application to respond to events is necessary for the operations workflow. |
Requirement | Justification |
---|---|
Automated on-call reminder notifications | The ability to configure on-call reminder notifications is a nonessential feature for the operations workflow. |
Ability to export schedules to calendar applications | The ability to export on-call schedules to a personal calendar is a nonessential feature for the operations workflow. |
Ability to parse structured data | The ability to ingest data from a currently implemented Application Programming Interface |
Support for an Application Programming Interface | The ability to provide an Application Programming Interface to allow alerting data ingestion |