HP ProLiant Management Component Pack on Ubuntu
HP seems to have set up a package repository for Ubuntu 12.04, which is an improvement since I last checked a few years ago. To use the repo, add the following line to /etc/apt/sources.list:
deb http://downloads.linux.hp.com/downloads/ManagementComponentPack/ubuntu precise current/non-free
Run "sudo apt-get update".
You can install a number of software packages from the repository:
- hpsmh: HP System Management Homepage
- hp-smh-template: HP System Management Homepage Templates
- cpqacuxe: HP Array Configuration Utility, web-based
- hp-snmp-agents: Insight Management SNMP Agents for HP ProLiant Systems
- hponcfg: RILOE II/iLO online configuration utility
- hp-health: HP System Health Application and Command line Utility Package
- hpacucli: HP Command Line Array Configuration Utility
- ams: Agentless Monitoring Service for HP ProLiant Gen8 Systems
I installed the iLO configuration utility, System Health App and Array Configuration command line utility.
root@host:/etc/apt# apt-get install hponcfg hp-health hpacucli
I couldn't find a working GPG key so you need to press y or force package installation.
Useful Commands
You can blink the UID light with the hpuid command.
The hpasmcli is a tool to show and set various system parameters.
hpasmcli> show powermeter
Power Meter #1
Power Reading : 224
hpasmcli> show powersupply
Power supply #1
Present : Yes
Redundant: No
Condition: Ok
Hotplug : Supported
Power supply #2
Present : Yes
Redundant: No
Condition: FAILED
Hotplug : Supported
A command called "hplog" can be used to view the log:
root@host:~# hplog -v ID Severity Initial Time Update Time Count ------------------------------------------------------------- 1026 Repaired 13:44 04/10/2012 13:46 04/10/2012 0001 LOG: System Power Supplies Not Redundant 1027 Repaired 13:46 04/10/2012 13:48 04/10/2012 0001 LOG: System Power Supply: General Failure (Power Supply 2) 1028 Repaired 13:46 04/10/2012 13:48 04/10/2012 0001 LOG: System Power Supplies Not Redundant 1029 Repaired 13:48 04/10/2012 13:49 04/10/2012 0001 LOG: System Power Supply: General Failure (Power Supply 2) 1030 Repaired 13:48 04/10/2012 13:49 04/10/2012 0001 LOG: System Power Supplies Not Redundant
And show system health information (fans, power supplies, temperatures):
root@host:~# hplog -f ID TYPE LOCATION STATUS REDUNDANT FAN SPEED 1 Var. Speed I/O Zone Normal Yes Medium ( 45) 2 Var. Speed I/O Zone Normal Yes Medium ( 45) 3 Var. Speed Processor Zone Normal Yes Medium ( 41) 4 Var. Speed Processor Zone Normal Yes Low ( 36) 5 Var. Speed Processor Zone Normal Yes Low ( 36) 6 Var. Speed Processor Zone Normal Yes Low ( 36) root@host:~# hplog -p ID TYPE LOCATION STATUS REDUNDANT 1 Standard Pwr. Supply Bay Normal No 2 Standard Pwr. Supply Bay Failed No root@host:~# hplog -t ID TYPE LOCATION STATUS CURRENT THRESHOLD 1 Basic Sensor I/O Zone Normal 105F/ 41C 158F/ 70C 2 Basic Sensor Ambient Normal 68F/ 20C 102F/ 39C 3 Basic Sensor CPU (1) Normal 86F/ 30C 260F/127C 4 Basic Sensor CPU (1) Normal 86F/ 30C 260F/127C 5 Basic Sensor Pwr. Supply Bay Normal 111F/ 44C 170F/ 77C 6 Basic Sensor CPU (2) Normal 86F/ 30C 260F/127C 7 Basic Sensor CPU (2) Normal 86F/ 30C 260F/127C
Array Configuration Utility
The "hpacucli" is a Smart Array configuration tool. Some examples (the prompt is the =>):
root@host:~# hpacucli
HP Array Configuration Utility CLI 9.30.15.0
Detecting Controllers...Done.
Type "help" for a list of supported commands.
Type "exit" to close the console.
=> ctrl all show
Smart Array P400 in Slot 3 (sn: P61630D9SV063I)
=> ctrl slot=3 show
Smart Array P400 in Slot 3
Bus Interface: PCI
Slot: 3
Serial Number: P61630D9SV063I
Cache Serial Number: PA2270H9VV23RN
RAID 6 (ADG) Status: Enabled
Controller Status: OK
Hardware Revision: D
Firmware Version: 7.22
Rebuild Priority: Medium
Expand Priority: Medium
Surface Scan Delay: 3 secs
Surface Scan Mode: Idle
Wait for Cache Room: Disabled
Surface Analysis Inconsistency Notification: Disabled
Post Prompt Timeout: 15 secs
Cache Board Present: True
Cache Status: OK
Cache Ratio: 25% Read / 75% Write
Drive Write Cache: Enabled
Total Cache Size: 512 MB
Total Cache Memory Available: 464 MB
No-Battery Write Cache: Enabled
Cache Backup Power Source: Batteries
Battery/Capacitor Count: 1
Battery/Capacitor Status: OK
SATA NCQ Supported: True
=> ctrl slot=3 array all show
Smart Array P400 in Slot 3
array A (SAS, Unused Space: 0 MB)
=> ctrl slot=3 array A show
Smart Array P400 in Slot 3
Array: A
Interface Type: SAS
Unused Space: 0 MB
Status: OK
Array Type: Data
=> ctrl slot=3 physicaldrive all show
Smart Array P400 in Slot 3
array A
physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 146 GB, OK)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 146 GB, OK)
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 146 GB, OK)
The utility also understands commands directly from the command line:
root@host:~# hpacucli ctrl slot=3 show status Smart Array P400 in Slot 3 Controller Status: OK Cache Status: OK Battery/Capacitor Status: OK
E-mail Alerts
To get e-mail out of the system, I installed Postfix.
root@host:~# apt-get install postfix mailutils
Select "Internet Site". After installation, do a reconfiguration:
root@host:~# dpkg-reconfigure postfix
Select "Internet Site" again. Give your username as the recipient for root and postmaster.
Use the default destination list. No forcing of synchronous updates.
Next question is about where to accept mail from. The default is localhost only, which is good for my purposes, because this is not a proper mail server.
For the rest of the questions I just use defaults. For additional security, you can edit the /etc/postfix/main.cf and change the line
inet_interfaces = all
to:
inet_interfaces = loopback-only
Restart Postfix. To forward all important mail from the system to yourself, edit /etc/aliases:
postmaster: root root: your@email.here
Run command
root@host:~# newaliases
Now you will get all root mail. I also like to change root full name, which will show up as the sender of the e-mail. This way I can see which host's root is sending me mail.
chfn -f "Hostname Root" root
Hardware Health Check Script
For Smart Array checking, I wrote this little script and put it in /usr/local/sbin/smart_array_check:
#!/bin/sh SLOT="3" # Your controller slot number ARRAY="A" # Your array letter EMAIL="root" # You can put your e-mail address here # This function is called if checks don't pass. Send e-mail. Notify() { SUBJECT=$1 { echo $SUBJECT echo echo Check date: $(date +"%F %T%:::z") echo echo Controller Status: hpacucli ctrl slot=$SLOT show echo Array Status: hpacucli ctrl slot=$SLOT array $ARRAY show echo Physical Drives: hpacucli ctrl slot=$SLOT physicaldrive all show echo Physical Drive Details: for DRIVE in $(hpacucli ctrl slot=$SLOT physicaldrive all show | grep physicaldrive | awk '{ print $2 }') do hpacucli ctrl slot=$SLOT physicaldrive $DRIVE show done } | mail -s "$SUBJECT" $EMAIL } # Check that there's a line saying 'Controller Status: OK' etc. hpacucli ctrl slot=$SLOT show status \ | grep -q 'Controller Status: OK' \ || Notify "Smart Array CONTROLLER FAILURE at $(hostname)" hpacucli ctrl slot=$SLOT show status \ | grep -q 'Cache Status: OK' \ || Notify "Smart Array CACHE FAILURE at $(hostname)" hpacucli ctrl slot=$SLOT show status \ | grep -q 'Battery/Capacitor Status: OK' \ || Notify "Smart Array BATTERY FAILURE at $(hostname)" hpacucli ctrl slot=$SLOT array $ARRAY show \ | grep -q 'Status: OK' \ || Notify "Smart Array ARRAY FAILURE at $(hostname)" # This is for testing: #Notify "This is a test"
For other hardware health checks I wrote this one and put it in /usr/local/sbin/hw_health_check:
#!/bin/sh EMAIL="root" # You can put your e-mail address here # This function is called if checks don't pass Notify() { # Something went wrong. Send e-mail SUBJECT=$1 { echo $SUBJECT echo echo Check date: $(date +"%F %T%:::z") echo echo '== Power Supply Status ==' echo hplog -p echo '== System Fan Status ==' echo hplog -f echo '== Temperatures ==' echo hplog -t echo '== HP System Log ==' hplog -v } | mail -s "$SUBJECT" $EMAIL } # Power supply check with hplog -p: # # ID TYPE LOCATION STATUS REDUNDANT # 1 Standard Pwr. Supply Bay Normal No # 2 Standard Pwr. Supply Bay Normal No # # A failed power supply looks like this: # # ID TYPE LOCATION STATUS REDUNDANT # 1 Standard Pwr. Supply Bay Normal No # 2 Standard Pwr. Supply Bay Failed No # # A removed power supply looks like this: # # ID TYPE LOCATION STATUS REDUNDANT # 1 Standard Pwr. Supply Bay Normal No # 2 Standard Pwr. Supply Bay Absent No # # The total number of power supplies should equal the number of # power supplies with the status "Normal" TOTAL_PSU_COUNT=$(hplog -p | tail -n +2 | head -n -1 | wc -l) OK_PSU_COUNT=$(hplog -p | tail -n +2 | head -n -1 | grep -c Normal) if [ "$TOTAL_PSU_COUNT" != "$OK_PSU_COUNT" ] then Notify "SYSTEM POWER SUPPLY PROBLEM at $(hostname)" fi # Fan check with hplog -f: # # ID TYPE LOCATION STATUS REDUNDANT FAN SPEED # 1 Var. Speed I/O Zone Normal Yes Medium ( 45) # 2 Var. Speed I/O Zone Normal Yes Medium ( 45) # 3 Var. Speed Processor Zone Normal Yes Medium ( 41) # 4 Var. Speed Processor Zone Normal Yes Low ( 36) # 5 Var. Speed Processor Zone Normal Yes Low ( 36) # 6 Var. Speed Processor Zone Normal Yes Low ( 36) # # The total number of fans should equal the number of # fans with the status "Normal" TOTAL_FAN_COUNT=$(hplog -f | tail -n +2 | head -n -1 | wc -l) OK_FAN_COUNT=$(hplog -f | tail -n +2 | head -n -1 | grep -c Normal) if [ "$TOTAL_FAN_COUNT" != "$OK_FAN_COUNT" ] then Notify "SYSTEM FAN PROBLEM at $(hostname)" fi # Temperature check with hplog -t: # # ID TYPE LOCATION STATUS CURRENT THRESHOLD # 1 Basic Sensor I/O Zone Normal 105F/ 41C 158F/ 70C # 2 Basic Sensor Ambient Normal 69F/ 21C 102F/ 39C # 3 Basic Sensor CPU (1) Normal 86F/ 30C 260F/127C # 4 Basic Sensor CPU (1) Normal 86F/ 30C 260F/127C # 5 Basic Sensor Pwr. Supply Bay Normal 111F/ 44C 170F/ 77C # 6 Basic Sensor CPU (2) Normal 86F/ 30C 260F/127C # 7 Basic Sensor CPU (2) Normal 86F/ 30C 260F/127C # # The total number of temperature readings should equal the number of # temperature readings with the status "Normal" TOTAL_TEMP_COUNT=$(hplog -t | tail -n +2 | head -n -1 | wc -l) OK_TEMP_COUNT=$(hplog -t | tail -n +2 | head -n -1 | grep -c Normal) if [ "$TOTAL_TEMP_COUNT" != "$OK_TEMP_COUNT" ] then Notify "SYSTEM TEMPERATURE PROBLEM at $(hostname)" fi # This is for testing: #Notify "This is a test"
Add both scripts to crontab with "crontab -e":
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 0 0,6,18,12 * * * smart_array_check 0 0,6,18,12 * * * hw_health_check
That will run the checks four times a day and e-mail every time there is a failure.
Links: