Ollis Blog

Ich bin Oliver Skibbe, 27 Jahre alt und als System-Administrator, Entwickler und für alles Kreative im IT-Bereich tätig, dazu mache ich meinen Informatik-Betriebswirt im Abendstudium. Meine Berufserfahrung umfasst eine IT-Ausbildung bei einer Kommunalverwaltung, 3 Jahre bei einem IT-Dienstleister (http://www.ciphron.de) und aktuell als "interner Dienstleister" bei einem Unternehmen aus dem GKV-Bereich tätig. Dabei begegnen mir die interessante Dinge, über die ich ein wenig berichten werde.

Donnerstag, 3. April 2014

Netzwerk-Debugging mit und unter Linux

Nachdem der letzte Blog-Eintrag schon etwas her ist, möchte ich dieses Mal etwas über (einfaches) Netzwerk-Debugging mit und unter Linux schreiben.

Die typischen Kommandos ping, netstat lasse ich dabei mal außen vor, sondern gehe auf "etwas" speziellere Kommandos bzw. Alternativen ein.

Prüfung auf aktive Hosts/Erreichbarkeit

Ich nutze dafür fping:
Vorteile: Ip-Ranges (Range oder CIDR Notation!), Statistiken und parallele Verarbeitung!
Nachteile: muss meist nachinstalliert werden

1. Single Host

$ fping -s 172.16.1.91

172.16.1.91 is alive

       1 targets
       1 alive
       0 unreachable
       0 unknown addresses

       0 timeouts (waiting for response)
       1 ICMP Echos sent
       1 ICMP Echo Replies received
       0 other ICMP received

 0.98 ms (min round trip time)
 0.98 ms (avg round trip time)
 0.98 ms (max round trip time)
        0.011 sec (elapsed real time)

2. mit IP-Range und Statisktik

$ fping -s -g 172.16.1.90 172.16.1.100 -r 1
172.16.1.91 is alive
172.16.1.92 is alive
172.16.1.93 is alive
172.16.1.94 is alive
172.16.1.95 is alive
172.16.1.97 is alive
172.16.1.90 is unreachable
172.16.1.96 is unreachable
172.16.1.98 is unreachable
172.16.1.99 is unreachable
172.16.1.100 is unreachable
 
      11 targets
       6 alive
       5 unreachable
       0 unknown addresses
 
      10 timeouts (waiting for response)
      16 ICMP Echos sent
       6 ICMP Echo Replies received
       0 other ICMP received
 
 0.38 ms (min round trip time)
 0.75 ms (avg round trip time)
 1.45 ms (max round trip time)
        2.808 sec (elapsed real time)

Verbindungen/Sessions anzeigen

Die meisten werden wahrscheinlich netstat nutzen, ich nehme dafür "ss".
"ss" ist schneller und bietet mehr Möglichkeiten in den Abfragen.

Alle Verbindungen anzeigen

$ ss -a | less
 
State      Recv-Q Send-Q      Local Address:Port          Peer Address:Port
LISTEN     0      0                       *:6080                     *:*
LISTEN     0      0             172.16.0.250:5667                     *:*
LISTEN     0      0                       *:5668                     *:*
LISTEN     0      0                       *:11301                    *:*
LISTEN     0      0               127.0.0.1:smux                     *:*
LISTEN     0      0                       *:mysql                    *:*
LISTEN     0      0                       *:sunrpc                   *:*
LISTEN     0      0                       *:http                     *:*
LISTEN     0      0                       *:ssh                      *:*
LISTEN     0      0                       *:smtp                     *:*
LISTEN     0      0                       *:iscsi                    *:*
TIME-WAIT  0      0             172.16.0.250:http           172.16.0.1:14653
TIME-WAIT  0      0             172.16.0.250:http           172.16.0.14:58166
TIME-WAIT  0      0             172.16.0.250:http           172.16.0.14:58167
TIME-WAIT  0      0             172.16.0.250:http           172.16.0.14:58164
TIME-WAIT  0      0             172.16.0.250:http           172.16.0.1:14654
TIME-WAIT  0      0             172.16.0.250:http           172.16.0.14:58165
<snip>

Lauschende Ports

Alle offenen Ports auflisten

$ ss -l | less
 
State      Recv-Q Send-Q      Local Address:Port          Peer Address:Port
LISTEN     0      50              127.0.0.1:netbios-ssn              *:*
LISTEN     0      128                     *:59532                    *:*
LISTEN     0      128                    :::sunrpc                  :::*
LISTEN     0      128                     *:sunrpc                   *:*
LISTEN     0      128                     *:http                     *:*
LISTEN     0      128                    :::38386                   :::*
LISTEN     0      128                    :::ssh                     :::*
LISTEN     0      128                     *:ssh                      *:*
LISTEN     0      128             127.0.0.1:6010                     *:*
LISTEN     0      128                   ::1:6010                    :::*
LISTEN     0      50              127.0.0.1:microsoft-ds             *:*
LISTEN     0      50              127.0.0.1:mysql                    *:*

Nur TCP auflisten

$ ss -lt | less
 
State      Recv-Q Send-Q      Local Address:Port          Peer Address:Port
LISTEN     0      50              127.0.0.1:netbios-ssn              *:*
LISTEN     0      128                     *:59532                    *:*
LISTEN     0      128                    :::sunrpc                  :::*
LISTEN     0      128                     *:sunrpc                   *:*
LISTEN     0      128                     *:http                     *:*
LISTEN     0      128                    :::38386                   :::*
LISTEN     0      128                    :::ssh                     :::*
LISTEN     0      128                     *:ssh                      *:*
LISTEN     0      128             127.0.0.1:6010                     *:*
LISTEN     0      128                   ::1:6010                    :::*
LISTEN     0      50              127.0.0.1:microsoft-ds             *:*
LISTEN     0      50              127.0.0.1:mysql                    *:*

Nur UDP auflisten

$ ss -lu | less
 
State      Recv-Q Send-Q      Local Address:Port          Peer Address:Port
UNCONN     0      0                       *:bootpc                   *:*
UNCONN     0      0                       *:sunrpc                   *:*
UNCONN     0      0              172.16.0.75:ntp                      *:*
UNCONN     0      0               127.0.0.1:ntp                      *:*
UNCONN     0      0                       *:ntp                      *:*
UNCONN     0      0            172.16.255.255:netbios-ns               *:*
UNCONN     0      0              172.16.0.75:netbios-ns               *:*
UNCONN     0      0                       *:netbios-ns               *:*
UNCONN     0      0            172.16.255.255:netbios-dgm              *:*
UNCONN     0      0              172.16.0.75:netbios-dgm              *:*
UNCONN     0      0                       *:netbios-dgm              *:*
UNCONN     0      0                       *:37222                    *:*
UNCONN     0      0               127.0.0.1:746                      *:*
UNCONN     0      0                       *:858                      *:*
UNCONN     0      0                      :::sunrpc                  :::*
UNCONN     0      0                     ::1:ntp                     :::*
UNCONN     0      0        fe80::250:56ff:feb2:6063:ntp             :::*
UNCONN     0      0                      :::ntp                     :::*
UNCONN     0      0                      :::858                     :::*
UNCONN     0      0                      :::38139                   :::*

Prozess-IDs zu den jeweiligen Diensten mit anzeigen

ss -lp

Anzeige mit Filterung

Die Filter können im iproute(2)-doc nachgeschlagen werden.

Nach Port:

Alle SSH Verbindungen:

$ ss -t src :22
 
State      Recv-Q Send-Q      Local Address:Port          Peer Address:Port
ESTAB      0      0              172.16.0.75:ssh           172.16.0.12:62708

Alle HTTP oder HTTPS Verbindungen

$ ss -t '( src :80 or src: 443 )'
 
State      Recv-Q Send-Q        Local Address:Port          Peer Address:Port
ESTAB      0      0               172.16.0.250:80             172.16.0.239:59538

Nach Adresse in CIDR Notation, nur TCP (-t!)

$ ss -t  dst 172.16.0.0/24

oder single Host:

$ ss -t dst 172.16.0.123

oder mit Host und Port

$ ss -t dst 172.16.0.123:80

Aktive (state established) TCP-Sessions

$ ss -t state established | less
 
Recv-Q Send-Q           Local Address:Port               Peer Address:Port
0      0                   172.16.0.75:58739                 172.16.0.2:microsoft-ds
0      0                   172.16.0.75:51709                 172.16.0.3:microsoft-ds
0      0                   172.16.0.75:39253                 172.16.0.3:1025
0      0                   172.16.0.75:ssh                 172.16.0.123:62708

Mögliche weitere States:

established
syn-sent
syn-recv
fin-wait-1
fin-wait-2
time-wait
closed
close-wait
last-ack
closing
all - Alle..
connected - Alle verbundenen Sessions
synchronized - Alle verbundenen Sessions, ohne syn-sent
bucket - minisockets, z.B. time-wait und syn-recv.
big - „Normale“ sockets

Statistiken

$ ss -s
 
Total: 364 (kernel 449)
TCP:   1272 (estab 19, closed 1240, orphaned 1, synrecv 0, timewait 1238/0), ports 825
 
Transport Total     IP        IPv6
*         449       -         -
RAW       0         0         0
UDP       11        11        0
TCP       32        32        0
INET      43        43        0
FRAG      0         0         0

Verbindungen mitschneiden

Bei Netzwerk/Verständnis-Problemen können Verbindungen mitgeschnitten werden, dazu wird TCPdump genutzt. Die geschriebene Datei kann dann mit WireShark verarbeitet/gelesen werden.

Format:

$ tcpdump -n -i $INTERFACE -s$MAXIMALEPAKETGRÖßE -w $AUSGABEDATEI $FILTER
 
-s = maximale Paketgröße, 0 => 65535 Bytes (empfehlenswert)
-n = keine DNS Auflösung
-i = Interface
-w = Ausgabedatei

Filter

Alles was einen bestimmten Host betrifft (IP oder DNS-Name)

$ tcpdump -ni lanbond0 -w ~/http_mitschnitt.pcap -s0 host 172.16.0.5

Alles was HTTP betrifft, ohne SSH Verbindungen

$ tcpdump -ni lanbond0 -w ~/http_mitschnitt.pcap -s0  port not ssh and port http

Beispiel

Welche Daten werden bei einem Request mittels HTTP und Host 172.16.0.5 übertragen:

$ tcpdump -ni lanbond0 -w ~/http_mitschnitt.pcap -s0 host 172.16.0.5 and port http

Montag, 6. Januar 2014

How-To: Backup/Snapshots mehrerer Oracle Datenbanken mittels NetApp

Um Oracle Datenbanken zu sichern gibt es mehrere Wege:

RMAN (http://www.oracle.com/webfolder/technetwork/de/community/dbadmin/tipps/rman_i_backup/index.html)
Datapump (http://www.oracle.com/webfolder/technetwork/de/community/dbadmin/tipps/pump/index.html)
Snapshots mittels NetApp, z.B. mit SnapManager for Oracle oder eigenen Skripten

Alle Technologien haben jeweils Ihre Vor- und Nachteile, auf diese möchte ich jetzt nicht eingehen, dieses Thema behandle ich in einem weiteren Blogpost in naher Zukunft.

Heute möchte ich mich auf die Sicherung mehrerer Oracle Datenbanken mittels Snapshottechnologie beziehen, welche als LUNs via ASM angebunden sind.

Beim manuellen Weg würde man die Datenbank in den Backup Modus (z.B. via sqlplus als sysdba) versetzen ("ALTER DATABASE BEGIN BACKUP"), einen Log-Switch durchführen ("ALTER SYSTEM ARCHIVE LOG CURRENT"), sich dann auf die NetApp mittels SSH aufschalten und den Snapshot anlegen lassen ("snap create VOLUME_NAME SNAPSHOT_NAME").

Dieses Verfahren mag bei einer Datenbank noch gut funktionieren, allerdings kommt man schnell in Zeitnot oder könnte einige Schritte vergessen, daher bietet es sich die Automatisierung an.

Ein weiterer Effekt, den man betrachten muss, sind mehrere, unterschiedliche Datenbanken, welche voneinander abhängig sind, d.h. die den gleichen Stand haben müssen, dort sollte man dann zwingend automatisieren.

Dieses habe ich im Rahmen eines Projekts gemacht und möchte hier das Perl-Skript kurz präsentieren, vielleicht hilft es noch anderen Leuten.

Vorab, das Perl-Skript bietet sicher noch den einen oder anderen Optimierungsspielraum, allerdings geht es im Projektgeschäft primär um das Erfüllen der Anforderungen, das "schön" machen ist dann ein nice-to-have, falls noch Zeit über ist.

Folgende Prerequisistes benötigt das Skript, welches auf dem Oracle Host ausgeführt werden soll:

Anpassung der Konfigurationsvariablen in den Zeilen: 9, 22, 72, 232
DBD::Oracle (z.B. via CPAN, Paketmanager, Oracle Instant Client oder eine volle Oracle Datenbank [das Skript ist auf diesen Fall konfiguriert, siehe Zeile "use lib xxx"])
DBI
Net::SSH::Expect (via CPAN)
Key-based SSH-Login auf der NetApp mit möglichst wenigen Rechten (wie z.B. hier beschrieben: http://cmdcmplt.wordpress.com/category/ssh-snapshot-filer-netapp-passwordless-roles/ Achtung: bei neueren OnTap Versionen wird als zusätzliches Recht: ssh-login benötigt: useradmin role modify snaps -a login-ssh,cli-snap*)
Volume-Name auf der NetApp: Diskgroup-name + Zusatz "_vol" z.B. DATA_DB1_vol

#!/usr/bin/perl -w
# Author: Oliver Skibbe oliskibbe (at) gmail.com
# Date: 2014-01-06
# Purpose: Backup multiple Oracle ASM Databases with NetApp and snapshotting
#
#
use strict;

use lib "/u01/app/oracle/11.2.0.3/perl/lib/site_perl/5.10.0/x86_64-linux-thread-multi";

use DBI;
use Net::SSH::Expect;

# debug stuff
my $debug = 0;
my $self = undef;

# config stuff
my $auto_snaps = 5;

# databases to backup 
my %sidhash = (
    "DB1" => 
       {
       "data" => "DATA_DB1",
       "archive" => "ARCHIVE_DB1",
       "password" => 'PASSWORD'
       },
    "DB2" => 
       {
       "data" => "DATA_DB2",
       "archive" => "ARCHIVE_DB2", 
       "password" => 'PASSWORD' 
       },
    "DB3" => {
       "data" => "DATA_DB3",        
       "archive" => "ARCHIVE_DB3", 
       "password" => 'PASSWORD'
       },
); # end sidhash

### oracle connect stuff
# db handler
my $dbh = undef;
my $sth = undef;
# connect options
## ora session mode 2: sysdba
my $connecthash = { RaiseError => 0, RowCacheSize => 16, AutoCommit => 0, PrintError => 0,
              ora_session_mode => 0x0002 };
my $dsn = undef;
# user & pass
my $username = "sys";

# helper arrays
my @data_arr;
my @archive_arr;
my @row;

# backup sqls
my $b_begin_backup_sql = "alter database begin backup";
my $b_end_backup_sql = "alter database end backup";
my $b_archive_log_sql = "alter system archive log current";
my $b_control_file_trace_sql = "alter database backup controlfile to trace as '/backup/controlfiles/controlfile_PLACEHOLDER.sql' reuse";
my $b_control_file_sql = "alter database backup controlfile to '/backup/controlfiles/controlfile_PLACEHOLDER.ctl' reuse";

##################################################
#  SSH Stuff         #
##################################################

# prepare ssh object
my $ssh = Net::SSH::Expect->new (
    host => "NETAPP_HOST",
    user => 'SNAPSHOT_USER',
    raw_pty => 1
);

# just a little dumper
sub dump {
 my $message = shift || "";
 printf "%s \n", Data::Dumper::Dumper($message);
}

# begin backup, switch archive log, controlfiles
sub begin_backup { 
 foreach my $sid ( keys %sidhash ) {

  print "START: ", $sid, "\n" if $debug;
  # connect to each database
  $dsn = sprintf "DBI:Oracle:%s", $sid;
  $dbh = DBI->connect(
   $dsn, 
   $username,
   $sidhash{$sid}{'password'},
   $connecthash) || die( "begin_backup: " . $DBI::errstr . "\n" );
  
  # reuse sql for all SIDs
  (my $sid_b_control_file_trace_sql = $b_control_file_trace_sql) =~ s/PLACEHOLDER/$sid/;
  (my $sid_b_control_file_sql = $b_control_file_sql) =~ s/PLACEHOLDER/$sid/;
  
  # begin backup
  $dbh->do($b_begin_backup_sql);
  # log switch
  $dbh->do($b_archive_log_sql);
  # controlfile trace sql
  $dbh->do($sid_b_control_file_trace_sql);
  # control file
  $dbh->do($sid_b_control_file_sql);
  
  # db disconnect
  $dbh->disconnect if defined($dbh);  
 }
}

sub manage_snapshots {
 my $diskgroup = shift;
 my $snapshot = $ssh->exec(sprintf "snap list %s", $diskgroup);
 my $snapshot_name = undef;
 
 # if return value includes "no snapshot exists", create first snapshot
 if ( $snapshot =~ /No snapshots exist/ ) {
  $snapshot_name = "AUTO_${diskgroup}_1";
  print "No auto shapshot for ", $diskgroup, " found, lets create ", $snapshot_name, "\n" if $debug;
 } else {
  # get all elements of interest
  my @snapshots = $snapshot =~ /AUTO_${diskgroup}_\d+/g;
  my $snapshot_count = @snapshots;
  
  if ( $snapshot_count >= $auto_snaps ) {
   # delete oldest snapshot
   my $snapshot_delete = $ssh->exec(sprintf "snap delete %s %s", $diskgroup, $snapshots[-1]);
   # new snapshot name
   $snapshot_name = $snapshots[-1];
  } elsif ( $snapshot_count == 0 ) {
   $snapshot_name = "AUTO_${diskgroup}_1";
  } else {
   my @list;   
   foreach my $snapshot ( @snapshots ) {
    push(@list,substr($snapshot,-1));
   }
   @list = sort(@list);
   print "List: ", @list, "\n\n" if $debug;
   
   my $lo = 0;
   my $hi = $auto_snaps;
   
   my $idx = 0;
   for (my $cnt=$lo;$cnt<=$hi;$cnt++) {
    if ($cnt == $list[$idx]) {
     $idx++;
    } else {
     # free index will be used for name
     $snapshot_name = sprintf "AUTO_%s_%s", $diskgroup, $cnt;
    }
   }
  }  
 }
 print "Snapshot: ", $snapshot_name, " for diskgroup ", $diskgroup, " will be created\n";
 # create snapshot
 my $create_snapshot = $ssh->exec(sprintf "snap create %s %s", $diskgroup, $snapshot_name);
}

sub end_backup {
 my $dbh = undef;
 foreach my $sid ( keys %sidhash ) {
  print "END: ", $sid, "\n" if $debug;
  # connect to each database
  $dsn = sprintf "DBI:Oracle:%s", $sid;
  $dbh = DBI->connect(
   $dsn, 
   $username,
   $sidhash{$sid}{'password'},
   $connecthash) || die( "end_backup: " . $DBI::errstr . "\n" );
  $dbh->{TraceLevel} = 0;
  # end backup
  $dbh->do($b_end_backup_sql);
  
  # disconnect
  $dbh->disconnect if defined($dbh);
 }
}

# base sql to get groups and generate volume names, assumption: volume name is asm diskgroup name + _vol e.g. DATA_VISITOUR_1_vol
my $diskgroup_sql = 'SELECT
    d.GROUP_NUMBER, 
    g.NAME AS groupname, 
    d.NAME, 
    LOWER (d.NAME)||\'_vol\' AS volume 
   FROM 
    v$asm_diskgroup g, 
    v$asm_disk d 
   WHERE 
    d.GROUP_NUMBER = g.GROUP_NUMBER 
    AND g.NAME IN (?,?,?)';

#####################################
# Begin Backup      #
#####################################

begin_backup();

# prepare arrays for sql statement
foreach my $sid ( keys %sidhash ) {
 push(@data_arr, $sidhash{$sid}{'data'});
 push(@archive_arr, $sidhash{$sid}{'archive'});
}

# now start the ssh process
$ssh->run_ssh() or die "SSH process couldn't start: $!";
#
#$ssh->timeout(10); 
sleep(3);
# you should be logged on now. Test if you received the remote prompt:
my $counter = 0;
while ($counter <= 10) {
 if ($ssh->read_all(10) =~ />\s*\z/) {
  # break out if login prompt
  last;
 } else {
  $ssh->run_ssh();
 }
 $counter++;
}

# disable terminal translations and echo on the SSH server executing on the server the stty command:
$ssh->exec("stty raw -echo");

##################################################
# Get Diskgroups          #
##################################################
$dbh = DBI->connect(
 'DBI:Oracle:DB1', 
 $username,
 $sidhash{'DB1'}{'password'},
 $connecthash) || die( "get_diskgroup: " . $DBI::errstr . "\n" );

# parse and prepare query
$sth = $dbh->prepare($diskgroup_sql);

##############################
# DATA SNAPSHOTS    #
##############################
# execute query for data diskgroups
$sth->execute(@data_arr);

# create snapshots for each data volume on atlas-2
while (my @row = $sth->fetchrow_array()) {
 # create, delete snapshots
 manage_snapshots($row[3]);
}

# executes end backup sql
end_backup();

##############################
# ARCHIVE SNAPSHOTS   #
##############################
# log switch
$dbh->do($b_archive_log_sql);

# execute query for archive diskgroups
$sth->execute(@archive_arr);

# create snapshots for each archive volume on atlas-2
while (my @row = $sth->fetchrow_array()) {
 # create, delete snapshots
 manage_snapshots($row[3]);
}

END {
 # closes the ssh connection
 $ssh->close() if defined($ssh);
 # closes db handler
    $dbh->disconnect if defined($dbh);
}
# EOF

Folgende Schritte führt das Skript aus:

Über einen Hash (Zeile: 22) werden 1-n Datenbanken in den Backup-Modus versetzt, ein Log-Switch wird durchgeführt, ein Trace Controlfile und das Plain SQL Controlfile werden exportiert.
SSH Login (Zugangsdaten Zeile 72,73) zur NetApp
Hole alle Diskgroups, aus den konfigurierten Datenbanken (Quell-DB-Name ist dort hardcodiert, siehe Zeilen 232 und 234)
Starte für jede DATA Diskgroup (es wird angenommen, dass die NetApp Volumes im Format: DISKGROUP_vol benamt wurden) die Snapshoterstellung (Format: VOLNAME_1-X (siehe dazu Zeile 19, maximale Anzahl von Snapshots), der älteste Snapshot wird jeweils vorher gelöscht, wenn die maximale, konfigurierte Anzahl erreicht wurde
Beende Backup-Modus
Weiterer Log-Switch
Snapshoterstellung für die Archive Log Volumes im gleichen Format wie bei den Data Diskgroups
Ende..

Kleiner Hinweis: In der Schleifenlogik für die Erstellung der Snapshotnamen gibt es noch einen kleinen Bug, bei der Generierung des Namens für den zweiten Snapshot, hat aber keine Einschränken auf den Produktiveinsatz.

Und hier natürlich noch der Download-Link: https://dl.dropboxusercontent.com/u/9482545/smo_backup.pl

Bei Fragen bitte wie immer melden.

Freitag, 22. November 2013

Nagios: Überwachung eines ESX Clusters und Visualisierung

Wer einen ESX-Cluster im Hause hat, möchte sicherlich auch informiert werden, wenn dort etwas schief geht.

Dazu setze ich das Plugin check_esx3.pl von OP5 ein, dieses liefert, via Webschnittstelle, den Status für ein Datacenter, einzelne ESXi-Hosts oder virtuelle Maschinen.

Als Abhängigkeiten hat es folgende Perl Klassen: Nagios::Plugins, File::Basename und VMware::VIRuntime (muss separat über das Perl SDK von VMWare installiert werden: https://my.vmware.com/de/web/vmware/details?productId=285&downloadGroup=VSP510-SDKPERL-510 )

Überblick:

Usage: check_esx3.pl -D <data_center> | -H <host_name> [ -N <vm_name> ]

    -u <user> -p <pass> | -f <authfile>

    -l <command> [ -s <subcommand> ]

    [ -x <black_list> ] [ -o <additional_options> ]

    [ -t <timeout> ] [ -w <warn_range> ] [ -c <crit_range> ]

    [ -V ] [ -h ]



 -?, --usage

   Print usage information

 -h, --help

   Print detailed help screen

 -V, --version

   Print version information

 --extra-opts=[section][@file]

   Read options from an ini file. See http://nagiosplugins.org/extra-opts

   for usage and examples.

 -H, --host=<hostname>

   ESX or ESXi hostname.

 -C, --cluster=<clustername>

   ESX or ESXi clustername.

 -D, --datacenter=<DCname>

   Datacenter hostname.

 -N, --name=<vmname>

   Virtual machine name.

 -u, --username=<username>

   Username to connect with.

 -p, --password=<password>

   Password to use with the username.

 -f, --authfile=<path>

   Authentication file with login and password. File syntax :

   username=<login>

   password=<password>

 -w, --warning=THRESHOLD

   Warning threshold. See

   http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT

   for the threshold format.

 -c, --critical=THRESHOLD

   Critical threshold. See

   http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT

   for the threshold format.

 -l, --command=COMMAND

   Specify command type (CPU, MEM, NET, IO, VMFS, RUNTIME, ...)

 -s, --subcommand=SUBCOMMAND

   Specify subcommand

 -S, --sessionfile=SESSIONFILE

   Specify a filename to store sessions for faster authentication

 -x, --exclude=<black_list>

   Specify black list

 -o, --options=<additional_options>

   Specify additional command options

 -t, --timeout=INTEGER

   Seconds before plugin times out (default: 30)

 -v, --verbose

   Show details for command-line debugging (can repeat up to 3 times)

Supported commands(^ means blank or not specified parameter) :

    Common options for VM, Host and DC :

        * cpu - shows cpu info

            + usage - CPU usage in percentage

            + usagemhz - CPU usage in MHz

            ^ all cpu info

        * mem - shows mem info

            + usage - mem usage in percentage

            + usagemb - mem usage in MB

            + swap - swap mem usage in MB

            + overhead - additional mem used by VM Server in MB

            + overall - overall mem used by VM Server in MB

            + memctl - mem used by VM memory control driver(vmmemctl) that controls ballooning

            ^ all mem info

        * net - shows net info

            + usage - overall network usage in KBps(Kilobytes per Second)

            + receive - receive in KBps(Kilobytes per Second)

            + send - send in KBps(Kilobytes per Second)

            ^ all net info

        * io - shows disk io info

            + read - read latency in ms (totalReadLatency.average)

            + write - write latency in ms (totalWriteLatency.average)

            ^ all disk io info

        * runtime - shows runtime info

            + status - overall host status (gray/green/red/yellow)

            + issues - all issues for the host

            ^ all runtime info

    VM specific :

        * cpu - shows cpu info

            + wait - CPU wait time in ms

            + ready - CPU ready time in ms

        * mem - shows mem info

            + swapin - swapin mem usage in MB

            + swapout - swapout mem usage in MB

            + active - active mem usage in MB

        * io - shows disk I/O info

            + usage - overall disk usage in MB/s

        * runtime - shows runtime info

            + con - connection state

            + cpu - allocated CPU in MHz

            + mem - allocated mem in MB

            + state - virtual machine state (UP, DOWN, SUSPENDED)

            + consoleconnections - console connections to VM

            + guest - guest OS status, needs VMware Tools

            + tools - VMWare Tools status

    Host specific :

        * net - shows net info

            + nic - makes sure all active NICs are plugged in

        * io - shows disk io info

            + aborted - aborted commands count

            + resets - bus resets count

            + kernel - kernel latency in ms

            + device - device latency in ms

            + queue - queue latency in ms

        * vmfs - shows Datastore info

            + (name) - free space info for datastore with name (name)

            ^ all datastore info

        * runtime - shows runtime info

            + con - connection state

            + health - checks cpu/storage/memory/sensor status

            + maintenance - shows whether host is in maintenance mode

            + list(vm) - list of VMWare machines and their statuses

        * service - shows Host service info

            + (names) - check the state of one or several services specified by (names), syntax for (names):<service1>,<service2>,...,<serviceN>

            ^ show all services

        * storage - shows Host storage info

            + adapter - list bus adapters

            + lun - list SCSI logical units

            + path - list logical unit paths

    DC specific :

        * io - shows disk io info

            + aborted - aborted commands count

            + resets - bus resets count

            + kernel - kernel latency in ms

            + device - device latency in ms

            + queue - queue latency in ms

        * vmfs - shows Datastore info

            + (name) - free space info for datastore with name (name)

            ^ all datastore info

        * runtime - shows runtime info

            + list(vm) - list of VMWare machines and their statuses

            + listhost - list of VMWare esx host servers and their statuses

            + tools - VMWare Tools status

        * recommendations - shows recommendations for cluster

            + (name) - recommendations for cluster with name (name)

            ^ all clusters recommendations

Wie man sehen kann, lassen sich dort sehr viele Dinge überwachen, meiner Meinung nach sind Sachen wie die Auslastung (CPU, RAM, VMFS, IO, Netzwerk) und der Zustand von wichtigen VMs interessante Faktoren, die auf Probleme (Ausfälle, geringe Leistungskapazitäten, ...) hindeuten können.

Dazu hier eine typische commands.cfg:

 
define command {
        command_name    check_esxi_vm_list
        command_line    $USER1$/check_esx3.pl -D $ARG1$ -H $_HOSTESX_HOSTNAME$.DOMAIN.TLD -f $ARG2$ -l runtime -s list
}

define command {
        command_name    check_esxi_status
        command_line    $USER1$/check_esx3.pl -D $ARG1$ -H $_HOSTESX_HOSTNAME$.DOMAIN.TLD -f $ARG2$ -l runtime -s status
}

define command {
        command_name    check_esxi_io
        command_line    $USER1$/check_esx3.pl -D $ARG1$ -H $_HOSTESX_HOSTNAME$.DOMAIN.TLD -f $ARG2$ -l io
}

define command {
        command_name    check_esxi_vmfs
        command_line    $USER1$/check_esx3.pl -D $ARG1$ -H $_HOSTESX_HOSTNAME$.DOMAIN.TLD -f $ARG2$ -l vmfs -w 10%: -c 5%:
}

define command {
        command_name    check_esxi_cpu
        command_line    $USER1$/check_esx3.pl -D $ARG1$ -H $_HOSTESX_HOSTNAME$.DOMAIN.TLD -f $ARG2$ -l cpu -w 80 -c 90
}

define command {
        command_name    check_esxi_mem
        command_line    $USER1$/check_esx3.pl -D $ARG1$ -H $_HOSTESX_HOSTNAME$.DOMAIN.TLD -f $ARG2$ -l mem -w 80 -c 90
}

define command {
        command_name    check_esxi_net
        command_line    $USER1$/check_esx3.pl -D $ARG1$ -H $_HOSTESX_HOSTNAME$.DOMAIN.TLD -f $ARG2$ -l net
}

define command {
        command_name    check_esxi_vm_running
        command_line    $USER1$/check_esx3.pl -D $ARG1$ -H $_HOSTESX_HOSTNAME$.DOMAIN.TLD -f $ARG2$ -l runtime
}

define command {
        command_name    check_esxi_dc_running_hosts
        command_line    $USER1$/check_esx3.pl -D $ARG1$ -f $ARG2$ -l runtime -s listhost -c 8
}
define command {
        command_name    check_esxi_dc_vmfs
        command_line    $USER1$/check_esx3.pl -D $ARG1$ -f $ARG2$ -l vmfs -w 10%: -c 5%:
}

define command {
        command_name    check_esxi_dc
        command_line    $USER1$/check_esx3.pl -D $ARG1$ -f $ARG2$ -l runtime
}
define command {
        command_name    check_esxi_dc_net
        command_line    $USER1$/check_esx3.pl -D $ARG1$ -f $ARG2$ -l net
}

define command {
        command_name    check_esxi_dc_io
        command_line    $USER1$/check_esx3.pl -D $ARG1$ -f $ARG2$ -l io
}
define command {
        command_name    check_esxi_dc_cpu
        command_line    $USER1$/check_esx3.pl -D $ARG1$ -f $ARG2$ -l cpu
}

define command {
        command_name    check_esxi_dc_mem
        command_line    $USER1$/check_esx3.pl -D $ARG1$ -f $ARG2$ -l mem
}

In der commands.cfg verwende ich Custom Hostvariablen, da Hosts im Nagios normalerweise über die IP angelegt werden (keine DNS Abhängigkeit..) und die IPs dann natürlich nicht im Virtual Center matchen würden, diese Host Variablen kann man ganz einfach anlegen:
_ESX_HOSTNAME esx2

Argument 1 muss die IP oder den Hostname des Virtual Centers enthalten, welcher die Abfragen für ein DataCenter bündeln kann.
Argument 2 wäre ein Authentifizierungsfile im Format, dabei ist Domain nur nötig, wenn man AD Authentifizierung nutzt:

username=foobar@domain.tld 

password=barfoo

Wenn die commands.cfg nun eingebunden ist, bietet es sich an eine Hostgruppe für die Veerbung der Service Checks einzurichten (Redundanzen vermeiden!) und dieser dann die ESXi Knoten zuweisen.

Nachdem nun die Service Checks eingerichtet wurden, kann die Überwachung beginnen.

Nun fehlt aber noch das i-Tüpfelchen: die Visualisierung der Auslastung über alle Hosts!

Dazu bediene ich mich dem großartigen PNP4Nagios (Achtung, bitte Version >= 0.6.x nutzen!), die Installation ist, durch die sehr gute Dokumentation, relativ einfach und schnell vollzogen.
Zum Start gibt es bereits ein Template, welches von OP5 im PNP4Nagios mitgeliefert wird.

Dieses unterstützt allerdings keine Multigraphen über mehrere Hosts, daher habe ich dafür eines selbst geschrieben, welches dann so aussieht:

Der grüne Balken ist ein Ticker und zeigt dort direkt Grün, Gelb, Rot an, wenn einer der Hosts ein Problem hat, also die Thresholds überschritten wurden.

Das Template bekommt ihr hier und muss in den templates.special Ordner (z.B: /usr/share/pnp4nagios/templates.special) abgelegt und geringfügig angepasst werden:

Hostnames unterstützen RegEx (z.B. 'esx' => esx1,esx2,...,esxN) oder arrays, falls die Namen kein gleiches Muster haben (array("host1","clusternode2","clusterhost1").
Die ServiceNames (in meinem Fall: ESXi_XXX) müssen genau passen, damit das Special Template funktionieren kann.

Bei Unsicherheiten kann man aber auch einfach die Zeilen "throw new Kohana_exception(print_r($data,TRUE));" auskommentieren und das Template über /pnp4nagios/special anschließend aufrufen, damit werden alle verfügbaren Daten ausgegeben.

$services = $this->tplGetServices("HOST_NAMES", "SERVICE_NAME")
In Zeile: 12 (CPU), 37 (RAM), 60 (Network)

Wenn nun das Template in dem entsprechenden Ordner abgelegt ist, wird beim Aufruf des PNP4Nagios ein neuer "S" Button auf der rechten Seiten erscheinen, wenn man den auswählt, sollte das neue Template aufgelistet bzw. schon geöffnet sein.

Falls nicht, bitte die Pfade, Berechtigungen und Endung (.php!) überprüfen.

Et voila, die Überwachung der ESXi-Hosts / DataCenter ist eingerichtet und wird in Farbe und bunt angezeigt.