Thursday, July 19, 2012

Microsoft System Centre Operation Manager (SCOM) 2007 R2 - Heartbeat failing randomly from any Management Servers to Any Clients.


Environment:
Operating System - Microsoft Windows 2008 R2 Enterprise
Database - Microsoft SOL 2008 
Application - Microsoft SCOM 2007 R2 with Management server in Clusters.


Issue is, Heartbeat failing randomly from any Management servers to any of clients with event 20022 which caused by many of the reasons like network issue, server performance issue etc... In this case, I faced this issue and troubleshoot it step by step as specified below.



To analyse if the issue happens at network side, I have provided steps to collect data below

Disable TCP Chimney
======================
Please help disable TCP chimney on MS,RMS and the SQL server as a best practice. Some more information of TCP chimney is shared below:
http://support.microsoft.com/default.aspx?scid=KB;en-us;q945977



Run SQL Queries Below
Please launch SQL management studio and run below queries then save the result to .csv file and send them to me.
Use operationsmanager
SELECT so.name,
8 * Sum(CASE WHEN si.indid IN (0, 1) THEN si.reserved END) AS data_kb,
Coalesce(8 * Sum(CASE WHEN si.indid NOT IN (0, 1, 255) THEN si.reserved END), 0) AS index_kb,
Coalesce(8 * Sum(CASE WHEN si.indid IN (255) THEN si.reserved END), 0) AS blob_kb
FROM dbo.sysobjects AS so JOIN dbo.sysindexes AS si ON (si.id = so.id)
WHERE 'U' = so.type GROUP BY so.name  ORDER BY data_kb DESC

Use operationsmanagerDW
SELECT so.name,
8 * Sum(CASE WHEN si.indid IN (0, 1) THEN si.reserved END) AS data_kb,
Coalesce(8 * Sum(CASE WHEN si.indid NOT IN (0, 1, 255) THEN si.reserved END), 0) AS index_kb,
Coalesce(8 * Sum(CASE WHEN si.indid IN (255) THEN si.reserved END), 0) AS blob_kb
FROM dbo.sysobjects AS so JOIN dbo.sysindexes AS si ON (si.id = so.id)
WHERE 'U' = so.type GROUP BY so.name  ORDER BY data_kb DESC

Network Logs
=============
Capture Network trace or run Netmon tool.......

1. On the selected machine and SCOM MS, download and install Network monitor 3.4
2. Run below command on both servers
nmcap /network * /capture /file <drive letter>:\nmcap.chn:200M
NOTE: Above command may generate a great number of files with 200M size. Please select a drive with sufficient free disk space and monitor the disk usage regularly.
3. Once MS reported event 20022 and the machine name is listed in the event 20022 , then please press Ctrl+C on both servers  to stop nmcap, save it as a file.

MS Tracing Log
======================
1. On MS server, replace C:\Program Files\System Center Operations Manager 2007\Tools\TracingGuidsNative.txt by attached one.
2. Open a command window MS server, change directory to C:\Program Files\System Center Operations Manager 2007\Tools
3. Run "Starttracing.cmd VER".
4. Wait until issue happens.
5. Run "stoptracing.cmd"
6. Run "formattracing.cmd"
7. Compress all files under C:\Windows\Temp\OpsMgrTracing 
8. Please also export Operations Manager event logs from MS and RMS (Root Management Server) 

Analysis
To work around this issue, I also suggested to set "Number of missed heartbeats allowed" to 10 from SCOM server.
By doing that, we can at least fix some false alert and still can get the correct information if MS or agents are done.

The issue still persists even after implementing above changes, so collected performance monitor log to trace the performance.
From that performance log, most the system resource, like CPU and Memory, are running at healthy level. but, the disk performance is not very good, especially on C and D. (as have 2 drives)
1.       On driver C, the Data transfer rate is not very high (average less than 1MB) but the disk queue length is pretty high. Also, every IO on that disk on takes about 0.2sec to be completed which is not a very good performance.
2.       On driver D, the IO load is much higher. About 6MB data transferred every second and most of these IO load comes from HealthService. The time to complete every IO request is similar with C driver (about 0.2-0.3sec, even reach to 1sec at the peak time)

By comparing the registry setting, I have noticed the RMS cluster node B Health Service Store is located on D: drive, which is a local drive. On A node, it’s on J: drive, which is on SAN, shared between both nodes. I noticed below reg key is configured differently:

On Node A:



On Node B:


This probably contributed to the disk IO issue as health service state should be on J: drive instead.

Solution
Changing above "State Directory" to SAN disk, issue is resolved now.




No comments: