Solaris server monitoring error

Hi All,

I have 2 Solaris 10 servers clustered for oracle database. I configured the WinRM, the agents, SSH enabled the root account, etc...

I was able to complete the discovery of the Solaris box, but now periodically I get a heartbeat failure.  Once restarted, it will work for 2 or 3 days then again get a heartbeat failure.

Please note that for monitoring oracle database, bridgeways management pack integrated with scom. for reference below is the screen shot of error.

If required to reinstall the agent, have any problem for configured bridgeways stuff's.




Regards

Machu

September 8th, 2013 11:09am

Can you please give some details on the OpsMgr version used and the Agent you use on the Solaris system?

I have seen this before, but it turned out to be an issue with the network.

/Christian

Free Windows Admin Tool Kit Click here and download it now
September 17th, 2013 3:55pm

OpsMgr Version: 2007 R2 CU6

Agent on the Solaris: scx-1.0.4-277.solaris.10.sparc.pkg

Regards

November 24th, 2013 10:31am

Hello,

Are there any errors in the log (/var/opt/microsoft/scx/log/scx.log) that correspond to the heartbeat failures?

-Kris

Free Windows Admin Tool Kit Click here and download it now
November 25th, 2013 11:06pm

Dear 

root@sprodn1:21:/var/opt/microsoft/scx/log# tail -20 scx.log      

2013-12-26T06:46:19,011Z Error      [scx.core.common.pal.system.disk.statisticallogicaldiskenumeration:23271:13] StatisticalLogicalDiskEnumeration::SampleDisks() - Unexpected exception caught: Calling readlink() returned an error with errno = 22 (Invalid argument) - [/export/home/serviceb/CoreWrkSpcSparc_5.10_2010_V1/source/code/common_lib/pal/system/disk/diskdepend.cpp:737]; for logical disk /dev/vx/dsk/swx_dg/swxpvol01

2013-12-26T06:46:19,011Z Error      [scx.core.common.pal.system.disk.statisticallogicaldiskenumeration:23271:13] StatisticalLogicalDiskEnumeration::SampleDisks() - Unexpected exception caught: Calling readlink() returned an error with errno = 22 (Invalid argument) - [/export/home/serviceb/CoreWrkSpcSparc_5.10_2010_V1/source/code/common_lib/pal/system/disk/diskdepend.cpp:737]; for logical disk /dev/vx/dsk/fcat_dg/fcatpvol01

2013-12-26T06:46:19,011Z Error      [scx.core.common.pal.system.disk.statisticallogicaldiskenumeration:23271:13] StatisticalLogicalDiskEnumeration::SampleDisks() - Unexpected exception caught: Calling readlink() returned an error with errno = 22 (Invalid argument) - [/export/home/serviceb/CoreWrkSpcSparc_5.10_2010_V1/source/code/common_lib/pal/system/disk/diskdepend.cpp:737]; for logical disk /dev/vx/dsk/fcat_dg/fcatpvol03

2013-12-26T06:46:19,011Z Error      [scx.core.common.pal.system.disk.statisticallogicaldiskenumeration:23271:13] StatisticalLogicalDiskEnumeration::SampleDisks() - Unexpected exception caught: Calling readlink() returned an error with errno = 22 (Invalid argument) - [/export/home/serviceb/CoreWrkSpcSparc_5.10_2010_V1/source/code/common_lib/pal/system/disk/diskdepend.cpp:737]; for logical disk /dev/vx/dsk/fcat_dg/fcatpvol04

2013-12-26T06:46:19,011Z Error      [scx.core.common.pal.system.disk.statisticallogicaldiskenumeration:23271:13] StatisticalLogicalDiskEnumeration::SampleDisks() - Unexpected exception caught: Calling readlink() returned an error with errno = 22 (Invalid argument) - [/export/home/serviceb/CoreWrkSpcSparc_5.10_2010_V1/source/code/common_lib/pal/system/disk/diskdepend.cpp:737]; for logical disk /dev/vx/dsk/fcat_dg/fcatpvol02

2013-12-26T06:46:19,011Z Error      [scx.core.common.pal.system.disk.statisticallogicaldiskenumeration:23271:13] StatisticalLogicalDiskEnumeration::SampleDisks() - Unexpected exception caught: Calling readlink() returned an error with errno = 22 (Invalid argument) - [/export/home/serviceb/CoreWrkSpcSparc_5.10_2010_V1/source/code/common_lib/pal/system/disk/diskdepend.cpp:737]; for logical disk /dev/vx/dsk/fcatarch_dg/fcatparch_vol

2013-12-26T06:47:07,450Z Warning    [scx.core.common.pal.system.process.processinstance:23271:6] No data can be gathered for 64-bit process : oracle - 16805

2013-12-26T06:47:07,453Z Warning    [scx.core.common.pal.system.process.processinstance:23271:6] No data can be gathered for 64-bit process : oracle - 16811

2013-12-26T06:47:07,455Z Warning    [scx.core.common.pal.system.process.processinstance:23271:6] No data can be gathered for 64-bit process : oracle - 16831

2013-12-26T06:47:07,466Z Warning    [scx.core.common.pal.system.process.processinstance:23271:6] No data can be gathered for 64-bit process : oracle - 16819

2013-12-26T06:47:07,469Z Warning    [scx.core.common.pal.system.process.processinstance:23271:6] No data can be gathered for 64-bit process : oracle - 16816

2013-12-26T06:47:07,471Z Warning    [scx.core.common.pal.system.process.processinstance:23271:6] No data can be gathered for 64-bit process : oracle - 16814

2013-12-26T06:47:07,476Z Warning    [scx.core.common.pal.system.process.processinstance:23271:6] No data can be gathered for 64-bit process : oracle - 16836

2013-12-26T06:47:07,478Z Warning    [scx.core.common.pal.system.process.processinstance:23271:6] No data can be gathered for 64-bit process : oracle - 16827

2013-12-26T06:47:07,493Z Warning    [scx.core.common.pal.system.process.processinstance:23271:6] No data can be gathered for 64-bit process : oracle - 16823

2013-12-26T06:47:07,499Z Warning    [scx.core.common.pal.system.process.processinstance:23271:6] No data can be gathered for 64-bit process : oracle - 16825

2013-12-26T06:47:07,503Z Warning    [scx.core.common.pal.system.process.processinstance:23271:6] No data can be gathered for 64-bit process : oracle - 16829

2013-12-26T06:47:07,504Z Warning    [scx.core.common.pal.system.process.processinstance:23271:6] No data can be gathered for 64-bit process : oracle - 16756

2013-12-26T06:47:07,504Z Warning    [scx.core.common.pal.system.process.processinstance:23271:6] No data can be gathered for 64-bit process : oracle - 16833

2013-12-26T06:47:07,505Z Warning    [scx.core.common.pal.system.process.processinstance:23271:6] No data can be gathered for 64-bit process : oracle - 16821

root@sprodn1:21:/var/opt/microsoft/scx/log#

January 7th, 2014 10:38am

After getting Heartbeat Failure alert, I cross checked the scxadmin status :

#scxadmin -status

scxcimserver: is running

scxcimprovagt: is stopped

Also I noticed that the SCOM process scxcimprovagt is generating coredump files  in our production cluster  sprodn2  and sprodn1(see below).   

root@sprodn2:/:# uname -a;date

SunOS sprodn2 5.10 Generic_148888-05 sun4u sparc SUNW,SPARC-Enterprise

Tue Dec 24 12:18:49 AST 2013

root@sprodn2:/:#

root@sprodn2:/:# ls -lt core*

-rw-------   1 root     root     66860506 Dec 19 07:10 core 

core_21446:

total 148288

-rw-------   1 root     root     75868658 Dec 20 15:36 core

root@sprodn2:/:#

root@sprodn2:/:# file /core

/core:          ELF 32-bit MSB core file SPARC Version 1, from 'scxcimprovagt'

root@sprodn2:/:#

root@sprodn2:/:# file /core_21446/core

/core_21446/core:       ELF 32-bit MSB core file SPARC Version 1, from 'scxcimprovagt'

root@sprodn2:/:#

root@sprodn2:/:# ls -lt /opt/microsoft/scx/bin/scxcimprovagt

-rwxr-xr-x   1 root     root       54704 Mar 22  2011 /opt/microsoft/scx/bin/scxcimprovagt

root@sprodn2:/:#

root@sprodn2:/:# dbx /opt/microsoft/scx/bin/scxcimprovagt ./core

For information about new features see `help changes'

To remove this message, put `dbxenv suppress_startup_message 7.6' in your .dbxrc

Reading scxcimprovagt

core file header read successfully

Reading ld.so.1

Reading libpegpmservice.so.1

Reading libpegprovidermanager.so.1

Reading libDefaultProviderManager.so.1

Reading libpegprovider.so.1

Reading libpegconfig.so.1

Reading libpegclient.so.1

Reading libpegqueryexpression.so.1

Reading libpegwql.so.1

Reading libpegquerycommon.so.1

Reading libpegcommon.so.1

Reading libpthread.so.1

Reading libdl.so.1

Reading libsocket.so.1

Reading libnsl.so.1

Reading libxnet.so.1

Reading libCstd.so.1

Reading librt.so.1

Reading libpam.so.1

Reading libCrun.so.1

Reading libm.so.2

Reading libthread.so.1

Reading libc.so.1

Reading libpegprm.so.1

Reading libpegrepository.so.1

Reading libssl.so.0.9.7

Reading libcrypto.so.0.9.7

Reading libaio.so.1

Reading libmd.so.1

Reading libcmd.so.1

Reading libCstd_isa.so.1

Reading libssl_extra.so.0.9.7

Reading libcrypto_extra.so.0.9.7

Reading libc_psr.so.1

Reading libCMPIProviderManager.so.1

Reading libscf.so.1

Reading libdoor.so.1

Reading libuutil.so.1

Reading libgen.so.1

Reading libmp.so.2

Reading libBridgeWaysOracleProviderModule.so.21209.0.0

Reading libclntsh.so.11.1

Reading libnnz11.so

Reading libstdc++.so.6.0.3

Reading libgcc_s.so.1

Reading libkstat.so.1

Reading libresolv.so.2

Reading libsched.so.1

Reading libm.so.1

t@1 (l@1) terminated by signal ABRT (Abort)

0xfea4af84: __lwp_park+0x0014:  bcc,a,pt  %icc,__lwp_park+0x24  ! 0xfea4af94

(dbx) where                                                                 

current thread: t@1

=>[1] __lwp_park(0x4, 0x0, 0x0, 0x0, 0xfec70000, 0x1), at 0xfea4af84

  [2] mutex_lock_queue(0xfec52a00, 0x0, 0xfeac5a60, 0x0, 0x1c00, 0x1d3c), at 0xfea432e0

  [3] malloc(0x6b1, 0x1, 0xea654, 0xfef95748, 0xfeac23f0, 0xfeacc5e0), at 0xfe9d7dd8

  [4] Pegasus::AnonymousPipe::readMessage(0xffbffd04, 0xffbff664, 0xff104aa0, 0xff0f638c, 0x32800, 0x6b0), at 0xfef95fc0

  [5] Pegasus::ProviderAgent::_readAndProcessRequest(0xffbffb88, 0x0, 0x10000000, 0x2ae98, 0x1a131, 0x13c00), at 0x15a10

  [6] Pegasus::ProviderAgent::run(0xffbffb88, 0xffbff79c, 0xffbff784, 0xffbff710, 0x2b784, 0xffbffbc8), at 0x153dc

  [7] main(0xff100e14, 0xc, 0x4e8f0, 0x2ae98, 0xfffed4d0, 0x800), at 0x18510

(dbx) quit

dbx: internal warning: td_ta_clear_event() failed -- debugger service failed

dbx: internal warning: td_ta_sync_tracking_enable(0) failed -- debugger service failed

root@sprodn2:/:#

 

root@sprodn2:/:# scxadmin -status

scxcimserver: is running

scxcimprovagt: 3 instances running

root@sprodn2:/:#

root@sprodn2:/:# scxadmin -restart

svc:/application/management/scx-cimd:default enabled.

svcadm: Instance "svc:/application/management/scx-cimd:default" is in maintenance state.

 

RETURN CODE: 3

root@sprodn2:/:# scxadmin -status

scxcimserver: is stopped

scxcimprovagt: is stopped

root@sprodn2:/:#

root@sprodn2:/:# scxadmin -start

svc:/application/management/scx-cimd:default enabled.

svcadm: Instance "svc:/application/management/scx-cimd:default" is in maintenance state.

 

RETURN CODE: 3

root@sprodn2:/:#

root@sprodn2:/:# svcs -a |grep scx                                         

maintenance    12:35:03 svc:/application/management/scx-cimd:default

root@sprodn2:/:# svcadm disable svc:/application/management/scx-cimd:default

root@sprodn2:/:# svcs -a |grep scx                                         

disabled       12:37:54 svc:/application/management/scx-cimd:default

root@sprodn2:/:# svcadm enable svc:/application/management/scx-cimd:default

root@sprodn2:/:#

root@sprodn2:/:# svcs -a |grep scx                                        

online         12:38:06 svc:/application/management/scx-cimd:default

root@sprodn2:/:#

root@sprodn2:/:# scxadmin -status

scxcimserver: is running

scxcimprovagt: 1 instance running

root@sprodn2:/:#


  • Edited by machu007 Tuesday, January 07, 2014 7:50 AM added more details
Free Windows Admin Tool Kit Click here and download it now
January 7th, 2014 10:44am

Any help/advice would be appreciated.

Thanks.

Kris, Steve.....anyone has any ideas ?
January 13th, 2014 12:38am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics