Microsoft System Centre Operation Manager (SCOM) 2007 R2 - Heartbeat failing randomly from any Management Servers to Any Clients.
Environment:
Operating System - Microsoft Windows 2008 R2 Enterprise
Database - Microsoft SOL 2008
Application - Microsoft SCOM 2007 R2 with Management server in Clusters.
Issue is, Heartbeat failing randomly from any Management servers to any of clients with event 20022 which caused by many of the reasons like network issue, server performance issue etc... In this case, I faced this issue and troubleshoot it step by step as specified below.
Network Logs
=============
Capture Network trace or run Netmon tool.......
Environment:
Operating System - Microsoft Windows 2008 R2 Enterprise
Database - Microsoft SOL 2008
Application - Microsoft SCOM 2007 R2 with Management server in Clusters.
Issue is, Heartbeat failing randomly from any Management servers to any of clients with event 20022 which caused by many of the reasons like network issue, server performance issue etc... In this case, I faced this issue and troubleshoot it step by step as specified below.
To
analyse if the issue happens at network side, I have provided steps to collect data
below
Disable TCP Chimney
======================
Please help disable TCP chimney on MS,RMS and the SQL server as a best
practice. Some more information of TCP chimney is shared below:
http://support.microsoft.com/default.aspx?scid=KB;en-us;q945977
Run SQL Queries Below
Please launch SQL management studio and run below queries then save the
result to .csv file and send them to me.
Use operationsmanager
SELECT so.name,
8 * Sum(CASE WHEN si.indid IN (0, 1) THEN si.reserved END) AS data_kb,
Coalesce(8 * Sum(CASE WHEN si.indid NOT IN (0, 1, 255) THEN si.reserved
END), 0) AS index_kb,
Coalesce(8 * Sum(CASE WHEN si.indid IN (255) THEN si.reserved END), 0)
AS blob_kb
FROM dbo.sysobjects AS so JOIN dbo.sysindexes AS si ON (si.id = so.id)
WHERE 'U' = so.type GROUP BY so.name ORDER BY data_kb DESC
Use operationsmanagerDW
SELECT so.name,
8 * Sum(CASE WHEN si.indid IN (0, 1) THEN si.reserved END) AS data_kb,
Coalesce(8 * Sum(CASE WHEN si.indid NOT IN (0, 1, 255) THEN si.reserved
END), 0) AS index_kb,
Coalesce(8 * Sum(CASE WHEN si.indid IN (255) THEN si.reserved END), 0)
AS blob_kb
FROM dbo.sysobjects AS so JOIN dbo.sysindexes AS si ON (si.id = so.id)
WHERE 'U' = so.type GROUP BY so.name ORDER BY data_kb DESC
1. On the selected machine and SCOM MS, download and install Network
monitor 3.4
2. Run below command on both servers
nmcap /network * /capture /file <drive letter>:\nmcap.chn:200M
NOTE: Above command may generate a great number of files with 200M size.
Please select a drive with sufficient free disk space and monitor the disk
usage regularly.
3. Once MS reported event 20022 and the machine name is listed in the
event 20022 , then please press Ctrl+C on both servers to stop nmcap,
save it as a file.
MS
Tracing Log
======================
1.
On MS server, replace C:\Program Files\System Center Operations Manager
2007\Tools\TracingGuidsNative.txt by attached one.
2.
Open a command window MS server, change directory to C:\Program Files\System
Center Operations Manager 2007\Tools
3.
Run "Starttracing.cmd VER".
4.
Wait until issue happens.
5.
Run "stoptracing.cmd"
6.
Run "formattracing.cmd"
7.
Compress all files under C:\Windows\Temp\OpsMgrTracing
8.
Please also export Operations Manager event logs from MS and RMS (Root Management Server)
Analysis
To
work around this issue, I also suggested to set
"Number of missed heartbeats allowed" to 10 from SCOM server.
By doing that, we can at least fix
some false alert and still can get the correct information if MS or agents are
done.
The issue still persists even
after implementing above changes, so collected performance monitor log to
trace the performance.
From that performance log, most
the system resource, like CPU and Memory, are running at healthy level. but,
the disk performance is not very good, especially on C and D. (as have 2 drives)
1. On driver C, the Data transfer rate is not very high
(average less than 1MB) but the disk queue length is pretty high. Also, every
IO on that disk on takes about 0.2sec to be completed which is not a very good
performance.
2. On driver D, the IO load is much higher. About 6MB data
transferred every second and most of these IO load comes from HealthService.
The time to complete every IO request is similar with C driver (about
0.2-0.3sec, even reach to 1sec at the peak time)
By comparing the registry setting, I have noticed the
RMS cluster node B Health Service Store is located on D: drive, which is a
local drive. On A node, it’s on J: drive, which is on SAN, shared between both
nodes. I noticed below reg key is configured differently:
On Node A:
On Node B:
This
probably contributed to the disk IO issue as health service state should be on
J: drive instead.
Solution
Changing above "State
Directory" to SAN disk, issue is resolved now.
No comments:
Post a Comment