Nagios and Cloud Computing

Loading...
Nagios and Cloud Computing Presentation by William Leibzon ([email protected])

Nagios

Thanks for being here! Open Source System Management Conference May 10, 2012 – Bolzano, Italy

Cloud Computing 

What is Cloud Computing? Virtualized systems, independent of hardware and leased to customers in what is referred to as Infrastructure as a Service





Virtualization is the Core of Cloud Computing 

Separates Hardware from Operating System



Efficient use of modern multi-core processors

More Servers with Less Hardware 



Unused system resources can be utilized in other types of servers with different resource usage Less energy, Less Rack Space, More Efficient efficient use of resources

Benefits of Cloud Architecture 

Virtualized Systems in a Cloud 

Can be managed entirely remotely



Can move (even live) from one hardware to another





Can be shutdown, saved to disk and started again when required Can be easily cloned to have another alike system started exactly when it is needed

Cloud allows to automate scaling up of infrastructure to handle peak traffic load while scaling down after to keep overall cost low 

This requires monitoring of all system resources !

Cloud Solutions and Vendors 

Hypervisors (Viritualization Kernels): 

Commercial: VMware ESX, IBM Z/VM, Microsoft VirtualPC



Open-Source: Xen, KVM, OpenVZ, Quemu, VirtualBox Xen originally implemented paravirtualization, requirying Linux with modified cloud os kernel. KVM and Xen-HVM do full virtualization with Quemu and CPU virtualization extensions (Intel's VT or AMD's SVM). OpenVZ is hybrid of parvirtualizer and user-mode linux and requires both host and cloud os to be same version of Linux, sharing kernel.



Virtualization and Cloud Software Suites 

Commercial: VMware vCloud, Microsoft Azure



Open-Source: Eucalyptus, OpenStack, OpenNebula, Baracus





Commercial based on Open-Source: Citrix XenServer, Oracle VM, Ubuntu Enterprise Cloud, Redhat CloudForms, Parallels Virtuozzo

Cloud Infrastructure providers 

Amazon EC2 (Xen), Rackspace (modified Xen), Linode (Xen), Savvis (Vmware), many many more...

Open-Source Cloud Software 



Open-Source Hypervisors used in Cloud Systems 

Xen - http://www.xen.org/



KVM - http://www.linux-kvm.org/



OpenVZ - http://www.openvz.org/

Open-Source Cloud Management Software 

Eucalyptus - http://open.eucalyptus.com/



OpenStack – http://www.openstack.org/



OpenNebula - http://www.opennebula.org/



Baracus – http://baracus-project.org/



Proxmox - http://pve.proxmox.com/

Of above I recommend OpenStack with KVM or Eucalyptus with Xen. OpenVZ provides best peformance but no true isoluation.

Monitoring for the Cloud 

Monitoring of hardware (host OS) & hypervisor  



More static, hardware does not change as often Monitoring of system resources often integrated into virtualizer and info not available to cloud customer

Monitoring of virtual systems 

 



Dynamic, should be able to handle addition and removal of server instances Focus on application and network performance Ideally should monitor utilization and be able to launch new server instances (auto-scaling) Monitoring system should itself be robust and handle more servers without impacting performance

Cloud Monitoring Architecture 

Horizontal Scaling Clouds can be as small as 10 servers and as as large as 10,000+. When developing architecture, you need to support its future growth.



Scaling on Demand A pro-active system should handle big changes in the number of cloud instances. You may have 2 webserver instances at 6am and grow to 20 at 10pm.



High Availability Good system design should be fully fault-tolerant and application as a whole should continue to function without interruption if any one server instance dies

This means cluster !!!

Nagios Cluster Options The base nagios-core package is for stand-alone monitoring where server does all service checks. It can be extended to Nagios Cluster with : 

Passive Service Checks (Classic Distributed Model) ”Old Way” - NCSA used to forward results of checks from clients to main nagios server, not robust



Shared database (Central Dashboard Model) NDO-Mod and Merlin projects implement this with a combination of NEB modules, daemon & database



Worker Nodes (Load Balancing of Checks) DNX and Mod-Gearman do it with combination of loaded NEB module, server daemon & client servers

Passive Service Checks 



NCSA NCSA Nagios Client Server

Nagios Client Server



How - One central server with all services, it does not do any checks listing them all passive - Separate client nagios servers run plugins and do checks for specific sets of hosts, each has its own subset of full nagios config - Scripts are setup that capture results from each client host and send them to central server using NSCA, it puts them into nagios command queue Advantages This will work with any nagios server, organizations have been doing it from at least 2002 Disadvantages Requires a lot of custom scripting to organize nagios configs. Not reliable if server dies. Not robust to automate cloud instances being added and deleted

Shared Database 

Who: NDO-DB and Merlin



How - Multiple Peer Nagios servers, each has different config file specifying which services it would check - All servers use common database to share results of checks and status of services they are monitoring



Advantages - There is no master nagios server. There is master DB server, however it is a better understood topic how to create a db cluster - Using NEB avoids slow command-queue processing



Disadvantages Partioning of monitoring infrastructure among servers is still manual process. It is not easy to use this for dynamic cloud environment, however it works very well for fault-tolerance

DNX and Mod-Gearman Worker Nodes 

How - Similarly to Passive Service Checks, there is a central Nagios Server, it does not execute any plugins. - Unlike with Passive Checks, nagios does schedule checks. Thereafter NEB module takes over. - Module passes information on which plugin(s) to run to DNX server (or Gearman server for Mod-Gearman) which manages worker nodes. - Worker nodes are separate servers, each has special worker daemon running. The daemon communicates with management server and gets information (plugin command) on what to run. It then passes results back to management server and NEB module writes these results directly into nagios memory.

Advantages and Disadvantages of DNX and Mod-Gearman 

Advantages 









Robust: checks are automatically distributed among all cluster worker nodes (default round-robin on equal basis) Scalable: Fully achieves Horizontal Scaling of nagios checks Easy to Use in a Cloud Environment: - Existing worker node can be replicated with no special config to start it. Lets expand cluster on demand - All worker nodes are essentially the same and there is no additional re-configuration necessary to add a new node Efficient Integration with Nagios: Using NEB loaded modules achieves low-level integration with nagios, much better than NCSA and command queue

Disadvantages 

Still relies on a single central nagios server, If central nagios server dies entire system is out

DNX vs Mod-Gearman DNX 

Single package, no external dependencies. Includes all job cluster control components 







Mod-Gearman 



Hard to maintain and test for non-Linux environment

Can use localCheckPattern in server configuration to direct jobs. But it is not documented Supports nagios-2.x with a patch and nagios-3.x as is Client can be extended with nagiosspecific features. Planned are: - embedded Perl, check_icmp, - check_snmp, check_nrpe

Mod-Gearman is built around Gearman Project











Better maintained since Gearman has many uses Enjoys benefits of wider testing on new releases

Easy to configure and direct to separate queues depending on hostgroup & servicegroup Only supports nagios 3.x Supports eventhandlers and not just checks ! Nagios-only features are hard to add at node level

Combining Checks Together Combining data collection for multiple services together is a great way to off-load Nagios. There are several approaches on how to do it:  



”Old Way” - cron jobs run plugins and submit results with NSCA Check-Multisite – http://www.my-plugin.de/check_multi This allows to combine multiple checks together, output status is multiple lines, new feature of nagios 3.0 A separate collection daemon on a each server. This is like ”Old way” but no longer using cron and instead dedicated process on server doing collection. Two such popular open-source packages: - Munin : http://munin-monitoring.org/ - Collectd: http://collectd.org (/wiki/index.php/Collectd-nagios) I recommend Collectd because its faster and munin relies on NSCA to push data to nagios from each host where as Collectd can send data using its own protocol to nagios server and then nagios can check this data locally. For munin you can do without NSCA using http://code.google.com/p/nagios-munin/ (I haven't tried it myself though)

And in general use plugins that can have multiple thresholds and do checks together instead of ones that have to be called separately for each check

Combining All Options Together All Nagios cluster options can be combined ! 





DNX and Mod-Gearman offers horizontal scaling for all checks and relieaves Nagios of need to run them Merlin or ADO-DB can be used for failover and scaling of Nagios Server itself Munix or Collectd can run on cloud hosts to off-load gathering of data from hosts for standard checks and provide data to nagios together

Collectd

Collectd on each server

Ideal Fully Fault-Tolerant Nagios Cluster Architecture Nagios Web Interface Server

Merlin/ADO DB

Backup Nagios Web Interface Server

Performance Data (RRD) Server (like NagiosGrapher)

DB Proxy

udpecho

udp Backup Performance Data (RRD) Server

Worker Node

Nagios Server

Replication

heartbeat

crossmonitor

Merlin/ADO DB Backup Standby DB Proxy

Backup Nagios Server

udp Worker Node

Worker Node

Worker Node

Ideally you would have each of the above as a separate cloud server, but even those with 1000s of servers may find this hard to maintain

Nagios Cloud Cluster with 4 hosts STANDBY NAGIOS SERVER

MAIN NAGIOS SERVER Apache



Apache

PNP w/ RRD

PNP w/ RRD 

N P C D Nagios

Mysql DB

replication

Merlin

Mysql DB Merlin

Daemon

Nagios

N P C D Daemon

crossmonitor DNX Server

DNX Client

DNX Server

DNX Client



Standby Server has all checks disabled (except cross-monitor of other nagios which should not use DNX) If main server dies, backup takes over and registers itself in dynDNS server replacing primary. DNX Clients use dynDNS address, they are restarted on server switch

Note: I'm working a new nagios add-on for failover which would be a NEB module that will take care of cross-monitoring, switching on failure and syncing. This should be ready in late in 2012.

Configuration of a cloud host The best way to configure monitoring of cloud hosts with multiple instances is to have a template and define all services by hostgroups define host { use host_name alias address hostgroups parents contact_groups }

wprod-server
Loading...

Nagios and Cloud Computing

Nagios and Cloud Computing Presentation by William Leibzon ([email protected]) Nagios Thanks for being here! Open Source System Management Confere...

378KB Sizes 4 Downloads 34 Views

Recommend Documents

No documents