Ron Mann
2008-12-15 16:28:54 UTC
Hello all!
We are constructing a distributed computing solution atop of Open Solaris virtualization technology which employs a monitoring facility to understand and assess the condition of any given node under its charge. In reviewing FMA, it seems reasonable and quite simple to employ the snmp trap mechanism to get the initial indication that something is amiss. However, at this level, we are uninterested in how to correct the fault, we are only concerned with the impact of the defect on the system so that we are able to programmatically make the decision as to whether to remove it from service or limit the amount of work allocated to it. Given the trap only contains an error code to be referenced via the web site, we require a way in software to be able to retrieve sufficient information to make su
ch a decision.
It would seem that there are number of ways we might go about getting it. Ignoring the distasteful notion of screen scraping the online knowledge base, upon receiving a trap we could conceivably have our local agent exec fmadm faulty to parse its output assuming it has sufficient structure. A second possibility would be to utilize the contents of the SUN-FM-MIB to retrieve the resource status, but my read is that some of the necessary information is not contained in the mib. A further variation on this would be to incorporate the problem codes into our handler to do an internal 'knowledge base' to determine the appropriate course of action, though I'm unclear as to where all these are defined. Equally, I'm as yet unfamiliar with how fmdump and fmadm go about their business, but we could pr
esumably mimic them to avoid parsing string output. Finally, we could install our own module and deal with the daemon directly. Perhaps there are other approaches as well.
Is there any theory or prescribed methodology for implementing such a facility? Is such a notion contrary to the current FMA design center? Which, if any, of the above approaches seem appropriate? Is there a RFE lurking here and a new facility is required? TIA!
=Ron=
We are constructing a distributed computing solution atop of Open Solaris virtualization technology which employs a monitoring facility to understand and assess the condition of any given node under its charge. In reviewing FMA, it seems reasonable and quite simple to employ the snmp trap mechanism to get the initial indication that something is amiss. However, at this level, we are uninterested in how to correct the fault, we are only concerned with the impact of the defect on the system so that we are able to programmatically make the decision as to whether to remove it from service or limit the amount of work allocated to it. Given the trap only contains an error code to be referenced via the web site, we require a way in software to be able to retrieve sufficient information to make su
ch a decision.
It would seem that there are number of ways we might go about getting it. Ignoring the distasteful notion of screen scraping the online knowledge base, upon receiving a trap we could conceivably have our local agent exec fmadm faulty to parse its output assuming it has sufficient structure. A second possibility would be to utilize the contents of the SUN-FM-MIB to retrieve the resource status, but my read is that some of the necessary information is not contained in the mib. A further variation on this would be to incorporate the problem codes into our handler to do an internal 'knowledge base' to determine the appropriate course of action, though I'm unclear as to where all these are defined. Equally, I'm as yet unfamiliar with how fmdump and fmadm go about their business, but we could pr
esumably mimic them to avoid parsing string output. Finally, we could install our own module and deal with the daemon directly. Perhaps there are other approaches as well.
Is there any theory or prescribed methodology for implementing such a facility? Is such a notion contrary to the current FMA design center? Which, if any, of the above approaches seem appropriate? Is there a RFE lurking here and a new facility is required? TIA!
=Ron=
--
This message posted from opensolaris.org
This message posted from opensolaris.org