Mikko Kortelainen

HP ProLiant Management Component Pack on Ubuntu

HP seems to have set up a package repository for Ubuntu 12.04, which is an improvement since I last checked a few years ago. To use the repo, add the following line to /etc/apt/sources.list:

deb http://downloads.linux.hp.com/downloads/ManagementComponentPack/ubuntu precise current/non-free

Run "sudo apt-get update".

You can install a number of software packages from the repository:

  • hpsmh: HP System Management Homepage
  • hp-smh-template: HP System Management Homepage Templates
  • cpqacuxe: HP Array Configuration Utility, web-based
  • hp-snmp-agents: Insight Management SNMP Agents for HP ProLiant Systems
  • hponcfg: RILOE II/iLO online configuration utility
  • hp-health: HP System Health Application and Command line Utility Package
  • hpacucli: HP Command Line Array Configuration Utility
  • ams: Agentless Monitoring Service for HP ProLiant Gen8 Systems

I installed the iLO configuration utility, System Health App and Array Configuration command line utility.

root@host:/etc/apt# apt-get install hponcfg hp-health hpacucli

I couldn't find a working GPG key so you need to press y or force package installation.

Useful Commands

You can blink the UID light with the hpuid command.

The hpasmcli is a tool to show and set various system parameters.

hpasmcli> show powermeter
Power Meter #1
    Power Reading  : 224

hpasmcli> show powersupply
Power supply #1
    Present  : Yes
    Redundant: No
    Condition: Ok
    Hotplug  : Supported
Power supply #2
    Present  : Yes
    Redundant: No
    Condition: FAILED
    Hotplug  : Supported

A command called "hplog" can be used to view the log:

root@host:~# hplog -v

ID   Severity       Initial Time      Update Time       Count
-------------------------------------------------------------
1026 Repaired       13:44  04/10/2012 13:46  04/10/2012 0001
LOG: System Power Supplies Not Redundant

1027 Repaired       13:46  04/10/2012 13:48  04/10/2012 0001
LOG: System Power Supply: General Failure (Power Supply 2)

1028 Repaired       13:46  04/10/2012 13:48  04/10/2012 0001
LOG: System Power Supplies Not Redundant

1029 Repaired       13:48  04/10/2012 13:49  04/10/2012 0001
LOG: System Power Supply: General Failure (Power Supply 2)

1030 Repaired       13:48  04/10/2012 13:49  04/10/2012 0001
LOG: System Power Supplies Not Redundant

And show system health information (fans, power supplies, temperatures):

root@host:~# hplog -f
ID     TYPE        LOCATION      STATUS  REDUNDANT FAN SPEED
 1  Var. Speed   I/O Zone        Normal     Yes     Medium ( 45)
 2  Var. Speed   I/O Zone        Normal     Yes     Medium ( 45)
 3  Var. Speed   Processor Zone  Normal     Yes     Medium ( 41)
 4  Var. Speed   Processor Zone  Normal     Yes     Low    ( 36)
 5  Var. Speed   Processor Zone  Normal     Yes     Low    ( 36)
 6  Var. Speed   Processor Zone  Normal     Yes     Low    ( 36)

root@host:~# hplog -p
ID     TYPE        LOCATION      STATUS  REDUNDANT
 1  Standard     Pwr. Supply Bay Normal     No
 2  Standard     Pwr. Supply Bay Failed     No

root@host:~# hplog -t
ID     TYPE        LOCATION      STATUS    CURRENT  THRESHOLD
 1  Basic Sensor I/O Zone        Normal   105F/ 41C 158F/ 70C
 2  Basic Sensor Ambient         Normal    68F/ 20C 102F/ 39C
 3  Basic Sensor CPU (1)         Normal    86F/ 30C 260F/127C
 4  Basic Sensor CPU (1)         Normal    86F/ 30C 260F/127C
 5  Basic Sensor Pwr. Supply Bay Normal   111F/ 44C 170F/ 77C
 6  Basic Sensor CPU (2)         Normal    86F/ 30C 260F/127C
 7  Basic Sensor CPU (2)         Normal    86F/ 30C 260F/127C

Array Configuration Utility

The "hpacucli" is a Smart Array configuration tool. Some examples (the prompt is the =>):

root@host:~# hpacucli
HP Array Configuration Utility CLI 9.30.15.0
Detecting Controllers...Done.
Type "help" for a list of supported commands.
Type "exit" to close the console.
=> ctrl all show

Smart Array P400 in Slot 3                (sn: P61630D9SV063I)

=> ctrl slot=3 show

Smart Array P400 in Slot 3
   Bus Interface: PCI
   Slot: 3
   Serial Number: P61630D9SV063I
   Cache Serial Number: PA2270H9VV23RN
   RAID 6 (ADG) Status: Enabled
   Controller Status: OK
   Hardware Revision: D
   Firmware Version: 7.22
   Rebuild Priority: Medium
   Expand Priority: Medium
   Surface Scan Delay: 3 secs
   Surface Scan Mode: Idle
   Wait for Cache Room: Disabled
   Surface Analysis Inconsistency Notification: Disabled
   Post Prompt Timeout: 15 secs
   Cache Board Present: True
   Cache Status: OK
   Cache Ratio: 25% Read / 75% Write
   Drive Write Cache: Enabled
   Total Cache Size: 512 MB
   Total Cache Memory Available: 464 MB
   No-Battery Write Cache: Enabled
   Cache Backup Power Source: Batteries
   Battery/Capacitor Count: 1
   Battery/Capacitor Status: OK
   SATA NCQ Supported: True

=> ctrl slot=3 array all show

Smart Array P400 in Slot 3

   array A (SAS, Unused Space: 0  MB)

=> ctrl slot=3 array A show

Smart Array P400 in Slot 3

   Array: A
      Interface Type: SAS
      Unused Space: 0  MB
      Status: OK
      Array Type: Data

=> ctrl slot=3 physicaldrive all show

Smart Array P400 in Slot 3

   array A

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 146 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 146 GB, OK)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 146 GB, OK)

The utility also understands commands directly from the command line:

root@host:~# hpacucli ctrl slot=3 show status

Smart Array P400 in Slot 3
   Controller Status: OK
   Cache Status: OK
   Battery/Capacitor Status: OK

E-mail Alerts

To get e-mail out of the system, I installed Postfix.

root@host:~# apt-get install postfix mailutils

Select "Internet Site". After installation, do a reconfiguration:

root@host:~# dpkg-reconfigure postfix

Select "Internet Site" again. Give your username as the recipient for root and postmaster.

Use the default destination list. No forcing of synchronous updates.

Next question is about where to accept mail from. The default is localhost only, which is good for my purposes, because this is not a proper mail server.

For the rest of the questions I just use defaults. For additional security, you can edit the /etc/postfix/main.cf and change the line

inet_interfaces = all

to:

inet_interfaces = loopback-only

Restart Postfix. To forward all important mail from the system to yourself, edit /etc/aliases:

postmaster: root
root:       your@email.here

Run command

root@host:~# newaliases

Now you will get all root mail. I also like to change root full name, which will show up as the sender of the e-mail. This way I can see which host's root is sending me mail.

chfn -f "Hostname Root" root

 Hardware Health Check Script

For Smart Array checking, I wrote this little script and put it in /usr/local/sbin/smart_array_check:

#!/bin/sh

SLOT="3"      # Your controller slot number
ARRAY="A"     # Your array letter
EMAIL="root"  # You can put your e-mail address here

# This function is called if checks don't pass. Send e-mail.
Notify()
{
  SUBJECT=$1
  {
    echo $SUBJECT
    echo
    echo Check date: $(date +"%F %T%:::z")
    echo
    echo Controller Status:
    hpacucli ctrl slot=$SLOT show

    echo Array Status:
    hpacucli ctrl slot=$SLOT array $ARRAY show

    echo Physical Drives:
    hpacucli ctrl slot=$SLOT physicaldrive all show

    echo Physical Drive Details:
    for DRIVE in $(hpacucli ctrl slot=$SLOT physicaldrive all show | grep physicaldrive | awk '{ print $2 }')
    do
      hpacucli ctrl slot=$SLOT physicaldrive $DRIVE show
    done
  } | mail -s "$SUBJECT" $EMAIL
}

# Check that there's a line saying 'Controller Status: OK' etc.

hpacucli ctrl slot=$SLOT show status \
  | grep -q 'Controller Status: OK'  \
  || Notify "Smart Array CONTROLLER FAILURE at $(hostname)"

hpacucli ctrl slot=$SLOT show status \
  | grep -q 'Cache Status: OK'       \
  || Notify "Smart Array CACHE FAILURE at $(hostname)"

hpacucli ctrl slot=$SLOT show status       \
  | grep -q 'Battery/Capacitor Status: OK' \
  || Notify "Smart Array BATTERY FAILURE at $(hostname)"

hpacucli ctrl slot=$SLOT array $ARRAY show \
  | grep -q 'Status: OK' \
  || Notify "Smart Array ARRAY FAILURE at $(hostname)"

# This is for testing:
#Notify "This is a test"

For other hardware health checks I wrote this one and put it in /usr/local/sbin/hw_health_check:

#!/bin/sh

EMAIL="root"  # You can put your e-mail address here

# This function is called if checks don't pass
Notify()
{
  # Something went wrong. Send e-mail
  SUBJECT=$1
  {
    echo $SUBJECT
    echo
    echo Check date: $(date +"%F %T%:::z")
    echo
    echo '== Power Supply Status =='
    echo
    hplog -p

    echo '== System Fan Status =='
    echo
    hplog -f

    echo '== Temperatures =='
    echo
    hplog -t

    echo '== HP System Log =='
    hplog -v

  } | mail -s "$SUBJECT" $EMAIL
}

# Power supply check with hplog -p:
#
# ID     TYPE        LOCATION      STATUS  REDUNDANT
#  1  Standard     Pwr. Supply Bay Normal     No
#  2  Standard     Pwr. Supply Bay Normal     No
#
# A failed power supply looks like this:
#
# ID     TYPE        LOCATION      STATUS  REDUNDANT
#  1  Standard     Pwr. Supply Bay Normal     No
#  2  Standard     Pwr. Supply Bay Failed     No
#
# A removed power supply looks like this:
#
# ID     TYPE        LOCATION      STATUS  REDUNDANT
#  1  Standard     Pwr. Supply Bay Normal     No
#  2  Standard     Pwr. Supply Bay Absent     No
#
# The total number of power supplies should equal the number of
# power supplies with the status "Normal"

TOTAL_PSU_COUNT=$(hplog -p | tail -n +2 | head -n -1 | wc -l)
OK_PSU_COUNT=$(hplog -p | tail -n +2 | head -n -1 | grep -c Normal)

if [ "$TOTAL_PSU_COUNT" != "$OK_PSU_COUNT" ]
then
  Notify "SYSTEM POWER SUPPLY PROBLEM at $(hostname)"
fi

# Fan check with hplog -f:
#
# ID     TYPE        LOCATION      STATUS  REDUNDANT FAN SPEED
#  1  Var. Speed   I/O Zone        Normal     Yes     Medium ( 45)
#  2  Var. Speed   I/O Zone        Normal     Yes     Medium ( 45)
#  3  Var. Speed   Processor Zone  Normal     Yes     Medium ( 41)
#  4  Var. Speed   Processor Zone  Normal     Yes     Low    ( 36)
#  5  Var. Speed   Processor Zone  Normal     Yes     Low    ( 36)
#  6  Var. Speed   Processor Zone  Normal     Yes     Low    ( 36)
#
# The total number of fans should equal the number of
# fans with the status "Normal"

TOTAL_FAN_COUNT=$(hplog -f | tail -n +2 | head -n -1 | wc -l)
OK_FAN_COUNT=$(hplog -f | tail -n +2 | head -n -1 | grep -c Normal)

if [ "$TOTAL_FAN_COUNT" != "$OK_FAN_COUNT" ]
then
  Notify "SYSTEM FAN PROBLEM at $(hostname)"
fi

# Temperature check with hplog -t:
#
# ID     TYPE        LOCATION      STATUS    CURRENT  THRESHOLD
#  1  Basic Sensor I/O Zone        Normal   105F/ 41C 158F/ 70C
#  2  Basic Sensor Ambient         Normal    69F/ 21C 102F/ 39C
#  3  Basic Sensor CPU (1)         Normal    86F/ 30C 260F/127C
#  4  Basic Sensor CPU (1)         Normal    86F/ 30C 260F/127C
#  5  Basic Sensor Pwr. Supply Bay Normal   111F/ 44C 170F/ 77C
#  6  Basic Sensor CPU (2)         Normal    86F/ 30C 260F/127C
#  7  Basic Sensor CPU (2)         Normal    86F/ 30C 260F/127C
#
# The total number of temperature readings should equal the number of
# temperature readings with the status "Normal"

TOTAL_TEMP_COUNT=$(hplog -t | tail -n +2 | head -n -1 | wc -l)
OK_TEMP_COUNT=$(hplog -t | tail -n +2 | head -n -1 | grep -c Normal)

if [ "$TOTAL_TEMP_COUNT" != "$OK_TEMP_COUNT" ]
then
  Notify "SYSTEM TEMPERATURE PROBLEM at $(hostname)"
fi

# This is for testing:
#Notify "This is a test"

Add both scripts to crontab with "crontab -e":

PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
0 0,6,18,12 * * * smart_array_check
0 0,6,18,12 * * * hw_health_check

That will run the checks four times a day and e-mail every time there is a failure.

Links: