Monday, September 24, 2012

Hi Guys,

I got one more SCOM issue on which I spend many hours to troubleshoot and finally identify the cause area and resolved it hence thought to share here for SCOM 2007 R2 product.

Microsoft SCOM 2007 R2 Enterprise - Randomly entire Management Group becomes greyed out

Setup - All nodes are part of VM
OS - Microsoft Windows Server 2008 R2 Enterprise
App - Microsoft System Center Operation Manager 2007 R2 Enterprise
DB - Microsoft SQL Server 2008 R2

Randomly, the management group becomes greyed out in SCOM 2007 R2 environment. however health service on each monitored system still healthy.

In this case, followed step by step troubleshooting in order to identify the area of cause and necessary action towards resolution.

1. Start perfmon capturing with necessory counters or use command below.
Logman.exe create counter Perf-1Sec -f bincirc -max 500 -c "\LogicalDisk(*)\*" "\Memory\*" "\Network Interface(*)\*" "\Paging File(*)\*" "\PhysicalDisk(*)\*"  "\Server\*" "\System\*" "\Process(*)\*" "\Processor(*)\*"  "\Cache\*" -si 00:00:01 -o C:\PerfMonLogs\Perf-1Sec.blg
Logman.exe start Perf-1Sec   

2. Wait for problem to report again and then stop the perfmon log. Here can either use Perfmon console or “Logman stop xxxx”

3. create a dump file for health service on RMS first.     
A.Open task manager, right click health service.exe to create a dump file.  
B.After dump file is created, please go to the temp folder and copy this dump file to a safe location. After OS restart, these dump file could be clean up.     

4. get a SCOM trace.
A. Stop the healthservice (tried to stop the service from services.msc, if the process hang during the stopping process, can create another dump for it and then terminate using Task Manager)  
B. After stop healthservice, open a command line and go to “c:\program files\System Center Operations Manager 2007\Tools” folder. Try the following command:      
Del c:\windows\temp\OpsMgrTrace\*.*  
StartTracing VER     

NOTE: VER is case sensitive.      
C.       Start healthservice and wait for 10 minutes to check if the service is recovered  
D.       If not, please stop and capture the trace:      
E.       Capture all the log file under c:\windows\temp\OpsMgrTrace     

5. once all the data is captured and you are ready to reboot the system to recover, instead of restart the system, please use to trigger a blue screen on the system. At that time, the system will crash and start to dump all the memory to C:\Windows\memory.dmp. this file will record the entire OS status.  

Note - If dump cannot be captured using bug check, please help to check if this method can work? 
Click HERE

After over with analysis, it was suspected issue not with SCOM side but SQL performance and conflicting with Backup job running on same time.
Hence captured perfmon log on all Database servers as well where found Disk Latency issue on logs.

When look at the system, even though the management group becomes Gray out, it is the Health Service Watcher object is grey, the health service running on each monitored system is still healthy. Since the watcher objects is running on RMS, the problem is more related to the RMS status.
By looking at the problem history, although the issue is reported randomly, it is found the most likely, the problem is reported during midnight. Usually, that is the time for backup tasks.
Before RMS report error, we can see that SCOM SDK service always reports error for connecting to the remote DB. Thus, we checked the task on SQL side. In most cases, the RMS error time matches with the DB backup schedule. For a test, we disabled the backup job during midnight, the RMS problem disappears.     Further check the SQL server, it is found some disk latency happens. Thus, we believe the RMS issue is not caused by SCOM configuration. Actually, it is a victim of SQL performance.
After fix the SQL latency problem, SCOM has been stable.  

Reference - Here is the process to promote an MS to RMS:  Click HERE
Post a Comment