Building Proxmox 4.3 2 Node Cluster Part 2
The more I use Proxmox the more I like it. I upgraded to 4.4 yesterday and I really like the at a glance data center interface, but I will go into that later. I am now up to 5 VMs running on the servers and they are all on NAS shared storage. I have also written a few script that help me monitor the cluster and VMs using Nagios.
I am now up to 7 VMs but had something weird happen. The night after I upgraded the servers one of the boxes just shut off. There was an mc error that showed in the log files but I didn’t have mclog installed so I could get the exact problem that happened. None the less there was several new packages that showed up that day in the Proxmox repo and I upgraded to them. I have not had a problem since. The upgrade process was pretty straight forward. You can power down your VMs but I migrated the VMs to the other node during the upgrade. Below are the upgrade steps I took in order from a root terminal:
- apt-get update
- pveupdate
- apt-get upgrade
- pveupgrade
- pveversion -v Just to see if your packages upgraded.
This took about 20 minutes for both systems. I forgot the pveupgrade on one of the systems and was wondering why some of the packages didn’t upgrade, so don’t forget to run it. It will save time trouble shooting a problem you caused.
The Nagios scripts I put together were pretty basic bash scripts that run a command and use if statements to compare the output. Then they exit with a code 0,1,2, or 3. I didn’t want to put NRPE on my servers so I used ssh to run the checks on the servers. So I setup a one way password less ssh check. This is easy to do and as long as you setup the user to sudo run these commands it is relatively secure.. Below are the steps to setup the checks.
- Create a user on the servers named nagios
- Create an .ssh directory in the nagios user on Proxmox using mkdir -p .ssh
- On the Proxmox servers run ssh-keygen -t rsa as nagios and save the output in the .ssh directory
- Log into the Nagios machine and copy the ssh key from the .ssh/id_rsa.pub file in the nagios user directory
- Create an authorized_keys file on the Proxmox server in the /home/nagios/.ssh/ and copy the key from the Nagios server into it.
- You should now be able to ssh to the Proxmox server from the Nagios server without a password using the nagios user.
There are two caveats. One I used putty from a Windows machine while doing this so cut and paste was simple. Two it also assumes that you already have ssh setup for the nagios user on the Nagios server. If not you will have to run steps 2 and 3 on the Nagios server as well. If this is clear as mud you can go to this website for more help. http://www.linuxproblem.org/art_9.html
Once this was done I created 6 scripts to check different things on the server. The standard cpu load, running processes, number of running VMs, drive space, NAS mount, etc. They are not perfect and will be revised as I find problems but for the most part they work and are not to bad for about 2 hours putting them together. I will post these files at the end of the post. They are copied to the nagios user home directory on each Proxmox server.
I then setup I the commands in the Nagios commands.cfg and the checks in my Linux.cfg for my Linux servers. This part can be tricky and time consuming if you have problems. Just remember nagios -v /usr/local/nagios/etc/nagios.cfg is your friend for checking if Nagios will work after the changes. It will also tell you where the problem is. When there are errors Nagios will not start. The location of the config file will vary based on your install.
Once I had this all up and running I decided to build another machine to setup a full HA cluster. I had another system that was used as a game server and really was not used very much. It was an HP 8200 Ultra Elite slim desktop. It has an i7 cpu and 16 gig of memory. The only down fall with this one is that it doesn’t have any USB 3 ports. I still needed to use a USB NIC for the second connection, so I swapped the on board NIC for storage and the USB NIC for normal traffic. The only time I noticed any difference was during VM builds. Using the USB NIC for storage took longer to install so that was my primary reason to swap them. Also once I got the machine up and working I couldn’t get the VMs to turn on or migrate to this machine. Well it turned I didn’t enable support for virtualization in the BIOs. It took me about a half of hour of trouble shooting before I found it, so don’t be a bone head like me and configure your systems correctly.
One of the best things about this cluster is the fact that it only takes up a 2 ft x 2ft square on my shelf. I am now up to 9 VMs and have migrated them all around the cluster without problems. I know this is not a full scale commercial setup nor is the system completely redundant, but it is a ultra low power cluster that can handle a pretty good work load. It also didn’t cost a great deal of money. If I went out an bought everything today I would pay around $1500. you wouldn’t be able to touch a commercial server for that price. Any how as I finish the build I am going to add the ability to monitor the UPS the machines are hooked to and auto power up and down of the VMs. Once I figure out the HA part of this I will write one more post stating how I did it and how well it works.
I got started writing the third part but had to stop before I finished. I am traveling and it will be a few months before I get back and can finish. I will post it once it is complete.
Where is part 3 about the HA functionality?!