[collectd] Re: Query: Odd "scsi err" messages to /var/log/messages - RESOLUTION

Wed Mar 22 14:46:23 CET 2006

Hi Florian,

Many thanks for your help. I think your comments have got it sorted out. So,
here is the story:

There is no hardware raid here - sorry for forgetting to state this
clearly.  I've got 2 SATA drives connected to normal SATA headers on the
motherboard, and then have regular software raid running in linux to provide
mirroring.  I believe the SATA controller is VIA chipset, and according to
dmesg the driver being used is:

--paste--
scsi0 : sata_via
scsi1 : sata_via
---endpaste

HDD temp is installed and running in daemon mode, and is configured to
recognize these drives correctly / report temperatures. Collectd is
configured to use hddtemp module to poll this data.

Now that I test HDDtemp, it begins to tell the story:

(1) verify that /var/log/messages is clear of scsi error messages - yup
(2) connect to hddtemp daemon to check output
(3) repeat step one - bingo, we have errors that look familar logged to the
messages file.

thus:

---paste---
[root at docs log]# nc 127.0.0.1 7634
|/dev/sda|ST3250824AS|32|C|

[root at docs log]# tail messages
...
Mar 22 09:39:07 docs crond(pam_unix)[6270]: session closed for user root
Mar 22 09:39:40 docs kernel: SCSI error : <0 0 0 0> return code = 0x8000002
Mar 22 09:39:40 docs kernel: Invalid sda: sense key No Sense
Mar 22 09:39:40 docs kernel: Additional sense: Filemark detected

---endpaste---

after waiting patiently for 5 minutes, with collectd running but the hddtemp
module disabled, there are no further messages thrown.

So - it seems the culprit is indeed the hddtemp program, and however it is
pulling the temp data via SMART monitoring.

as an interesting aside/test: using "smartctl" to evaluate the SMART status
of the HDD, I can see .. all the correct SMART monitoring data.. but then
looking into the messages file, there is NO error logged (as was the case
with the hddtemp doing the smart poll to the HDD).  Clearly they must be
doing things in somewhat different manner..

Anyhow.  My problem is thus solved - it seems I have to abandon the hddtemp
monitoring for now :-)

Many thanks for the help!

---Tim Chipman

---------- Forwarded message ----------
From: Florian Forster <octo at verplant.org>
To: "The system statistics collection daemon &quot; collectd&quot; ' list."
<collectd at verplant.org>
Date: Tue, 21 Mar 2006 21:42:03 +0100
Subject: Re: [collectd] Query: Odd "scsi err" messages to /var/log/messages
while collectd is running
Hello Tim,

On Tue, Mar 21, 2006 at 04:23:14PM -0400, Tim Chipman wrote:
> (CentOS 4.2 x86_64, sempron3300+ /64bit with 1gig ram, mirrored 250gb
> SATA HDDs)

what RAID controller/which kernel module are you using for SCSI access
to the drives? Which plugins do you use in collectd? Are the messages
really apearing every two minutes or is syslog summarizing anything?

> If I stop the collectd daemon, the messages stop piling up.  If I
> restart the daemon, the messages resume.  Hence my belief they are
> related.

That sounds weird.. collectd itself doesn't access the SCSI bus at any
time. However, it may be configured to query `hddtempd' which, I
_think_, uses special SCSI commands to get to the SMART data in the
hardware/disk. If you're using the `hddtemp' plugin try to disable it
and check if the problem persists.
If that didn't change anything, or if you never had the `hddtemp' plugin
in the first place, try disabling all but one plugin (I'd suggest
something very easy for that one, like the `load' plugin). Does that
change anything?

Right now I don't have any more ideas, but maybe the diagnostic steps
above reveal something.. Good luck :)

Regards,
-octo
--
Florian octo Forster
Hacker in training
GnuPG: 0x91523C3D
http://verplant.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.verplant.org/pipermail/collectd/attachments/20060322/6f2ef345/attachment.htm