Automated responses to detected faults

Discussion:

Ron Mann

2008-12-15 16:28:54 UTC

Hello all!

We are constructing a distributed computing solution atop of Open Solaris virtualization technology which employs a monitoring facility to understand and assess the condition of any given node under its charge. In reviewing FMA, it seems reasonable and quite simple to employ the snmp trap mechanism to get the initial indication that something is amiss. However, at this level, we are uninterested in how to correct the fault, we are only concerned with the impact of the defect on the system so that we are able to programmatically make the decision as to whether to remove it from service or limit the amount of work allocated to it. Given the trap only contains an error code to be referenced via the web site, we require a way in software to be able to retrieve sufficient information to make su
ch a decision.

It would seem that there are number of ways we might go about getting it. Ignoring the distasteful notion of screen scraping the online knowledge base, upon receiving a trap we could conceivably have our local agent exec fmadm faulty to parse its output assuming it has sufficient structure. A second possibility would be to utilize the contents of the SUN-FM-MIB to retrieve the resource status, but my read is that some of the necessary information is not contained in the mib. A further variation on this would be to incorporate the problem codes into our handler to do an internal 'knowledge base' to determine the appropriate course of action, though I'm unclear as to where all these are defined. Equally, I'm as yet unfamiliar with how fmdump and fmadm go about their business, but we could pr
esumably mimic them to avoid parsing string output. Finally, we could install our own module and deal with the daemon directly. Perhaps there are other approaches as well.

Is there any theory or prescribed methodology for implementing such a facility? Is such a notion contrary to the current FMA design center? Which, if any, of the above approaches seem appropriate? Is there a RFE lurking here and a new facility is required? TIA!

=Ron=

--
This message posted from opensolaris.org

Rob Johnston

2008-12-15 19:48:51 UTC

Permalink

Hi Ron,

Here's some initial thoughts - to get the discussion rolling...

By "impact of the defect on the system", it sounds like what you're looking for
is whether or not one or more hw resources was disabled in reponse to the fault
(presumably impacting the box's service level)

However, whether or not a piece of hardware was disabled by FMA, in and of
itself, is not an accurate indicator of the severity or potential future impact
of a given fault.

Additionally, you'll need to be careful wrt building rules in your management
software around the problem codes (event-ids). Take, for example, a problem
code that indicates a processor cache fault. How should your management sw
react? There's no simple answer as it really depends on the configuration of
the box. If it's got a bunch of processors, where a single cpu fault can be
easily isolated, then it's not a big deal. If it's a smaller box, then
isolating the faulty cpu could dramatically affect performance. If it's a
single cpu box then the performance impact is none since we can't isolate it,
the box's availability is severely compromised because we can't prevent a future
cache UE from taking the whole system down.

Needless to say, It's a really tricky thing to get right.

But rather than getting bogged down talking about what rules your management sw
could employ - we'll let you figure that ;), let's focus on identifying what
sort of additional information you need from FMA (that you're not currently able
to get through documented interfaces) and see what we can do to rectify that (so
you don't have to resort to things like screen-scraping)

For example, I think we currently generate an SNMP trap for list.suspect,
list.repair and list.resolved events. Perhaps we should also generate an SNMP
trap when a list.isolated event occurs (indicating that one or more ASRU's was
disabled in response to the list.suspect with the same uuid)

rob

Post by Ron Mann
Hello all!
We are constructing a distributed computing solution atop of Open Solaris
virtualization technology which employs a monitoring facility to understand
and assess the condition of any given node under its charge. In reviewing
FMA, it seems reasonable and quite simple to employ the snmp trap mechanism
to get the initial indication that something is amiss. However, at this
level, we are uninterested in how to correct the fault, we are only concerned
with the impact of the defect on the system so that we are able to
programmatically make the decision as to whether to remove it from service or
limit the amount of work allocated to it. Given the trap only contains an
error code to be referenced via the web site, we require a way in software to
be able to retrieve sufficient information to make such a decision.
It would seem that there are number of ways we might go about getting it.
Ignoring the distasteful notion of screen scraping the online knowledge base,
upon receiving a trap we could conceivably have our local agent exec fmadm
faulty to parse its output assuming it has sufficient structure. A second
possibility would be to utilize the contents of the SUN-FM-MIB to retrieve
the resource status, but my read is that some of the necessary information is
not contained in the mib. A further variation on this would be to incorporate
the problem codes into our handler to do an internal 'knowledge base' to
determine the appropriate course of action, though I'm unclear as to where
all these are defined. Equally, I'm as yet unfamiliar with how fmdump and
fmadm go about their business, but we could presumably mimic them to avoid
parsing string output. Finally, we could install our own module and deal with
the daemon directly. Perhaps there are other approaches as well.
Is there any theory or prescribed methodology for implementing such a
facility? Is such a notion contrary to the current FMA design center? Which,
if any, of the above approaches seem appropriate? Is there a RFE lurking here
and a new facility is required? TIA!
=Ron=

Ronald J Mann

2008-12-16 18:43:31 UTC

Permalink

Post by Rob Johnston
Hi Ron,
Here's some initial thoughts - to get the discussion rolling...
By "impact of the defect on the system", it sounds like what you're
looking for is whether or not one or more hw resources was disabled in
reponse to the fault (presumably impacting the box's service level)
However, whether or not a piece of hardware was disabled by FMA, in
and of itself, is not an accurate indicator of the severity or
potential future impact of a given fault.
Additionally, you'll need to be careful wrt building rules in your
management software around the problem codes (event-ids). Take, for
example, a problem code that indicates a processor cache fault. How
should your management sw react? There's no simple answer as it
really depends on the configuration of the box. If it's got a bunch
of processors, where a single cpu fault can be easily isolated, then
it's not a big deal. If it's a smaller box, then isolating the faulty
cpu could dramatically affect performance. If it's a single cpu box
then the performance impact is none since we can't isolate it, the
box's availability is severely compromised because we can't prevent a
future cache UE from taking the whole system down.
Needless to say, It's a really tricky thing to get right.
But rather than getting bogged down talking about what rules your
management sw could employ - we'll let you figure that ;), let's focus
on identifying what sort of additional information you need from FMA
(that you're not currently able to get through documented interfaces)
and see what we can do to rectify that (so you don't have to resort to
things like screen-scraping)
For example, I think we currently generate an SNMP trap for
list.suspect, list.repair and list.resolved events. Perhaps we should
also generate an SNMP trap when a list.isolated event occurs
(indicating that one or more ASRU's was disabled in response to the
list.suspect with the same uuid)
rob

Yes, what were are really interested in, at least at this juncture, is
reacting to the removal of resources. At the moment trying to wrap our
heads around fault prediction seems beyond the pall. Equally, it seems
difficult or impossible to filter the incoming events to determine if
something has or is certain to be pulled. It did occur to us that we
might, for example, on a processor/mem fault parse the results of virsh
nodeinfo (given the virtualized nature of psrinfo etal), but I assume
there could be a timing window here. It does strike me that the notion
of an event being generated on retirement would be valuable.

=Ron=

Post by Rob Johnston

Post by Ron Mann
Hello all!
We are constructing a distributed computing solution atop of Open Solaris
virtualization technology which employs a monitoring facility to understand
and assess the condition of any given node under its charge. In reviewing
FMA, it seems reasonable and quite simple to employ the snmp trap mechanism
to get the initial indication that something is amiss. However, at this
level, we are uninterested in how to correct the fault, we are only concerned
with the impact of the defect on the system so that we are able to
programmatically make the decision as to whether to remove it from service or
limit the amount of work allocated to it. Given the trap only
contains an
error code to be referenced via the web site, we require a way in software to
be able to retrieve sufficient information to make such a decision.
It would seem that there are number of ways we might go about getting it.
Ignoring the distasteful notion of screen scraping the online
knowledge base,
upon receiving a trap we could conceivably have our local agent exec fmadm
faulty to parse its output assuming it has sufficient structure. A second
possibility would be to utilize the contents of the SUN-FM-MIB to retrieve
the resource status, but my read is that some of the necessary information is
not contained in the mib. A further variation on this would be to incorporate
the problem codes into our handler to do an internal 'knowledge base' to
determine the appropriate course of action, though I'm unclear as to where
all these are defined. Equally, I'm as yet unfamiliar with how fmdump and
fmadm go about their business, but we could presumably mimic them to avoid
parsing string output. Finally, we could install our own module and deal with
the daemon directly. Perhaps there are other approaches as well.
Is there any theory or prescribed methodology for implementing such a
facility? Is such a notion contrary to the current FMA design center? Which,
if any, of the above approaches seem appropriate? Is there a RFE lurking here
and a new facility is required? TIA!
=Ron=