Tuesday, August 11, 2009

VMWare Server 2.0 2.0.1 2.0.x DNS via NAT broken

scenario:
- created a Linux machine (Ubuntu 8.04.3) hosting VMware server 2.0.1
- two network cards, one to WAN (DSL), one with NO ip address used to bridge a physical network (wireless access points) to the captive portal "fenced" network
- created a virtual guest using downloaded VM image of pfSense 1.2.2 (and tried 1.2.3 beforehand) which is running Freebsd 7.0
- created a second virtual guest using Ubuntu Server 9.10 beta as a client for testing the pfSense captive portal
- scenario worked during testing off-site and after the installation on-site at clients location

- intermittent issues began cropping up with either slow, or no internet access for the kiosks using the captive portal, or the wireless guests
- a reboot of the Linux host machine seemed to "fix" the issues
- issues came back a few days later, or maybe sooner, but were not as apparent

- I start troubleshooting the issues using the Ubuntu server guest VM, and find that I can ping hosts directly by IP on the Internet, but not by their hostnames... DNS requests fail almost all the time
- I login to the pfSense via SSH and try some ping tests... once again, direct IP addresses work, while domain names do not resolve and fail
- I try changing to different DNS servers, doesn't help
- I try using nslookup on the Ubuntu server guest VM, issuing the "server ip.ip.ip.ip" command to switch to a known working DNS server, and issue a lookup for a domain such as www.google.ca or www.yahoo.ca which promptly (or not so promptly) fails after a few seconds
- I try the same nslookup command on the Linux Host machine, which works perfectly fine.
- I start googling for VMWare issues with NAT networks (which is what the pfSense WAN network was using with VMware to get to the internet
- I find some issues in the forums at vmware, but the ones I found were mostly issues with Windows XP Host machines, which had a work-around editing the NAT config file for vmware and setting the DNS lookups to ordered instead of burst
- I look into this file, on the Linux host, and I find the dns section, but am told by a comment in the file that this section is ignored by Linux Hosts, and only used with Windows... what luck (not) :/
- I continue googling trying to narrow it down to vmware server issues since many of the issues were related to VMware Workstation I believe.. but really get nowhere
- I start thinking of new ways to avoid using the NAT networking in VMware
- idea1: install a 3rd physical network card into the Linux Host machine, provide NAT services via Linux on this interface, and bridge the pfSense WAN network to this interface
- problem with this idea.. server is a 3hr round-trip drive away from me
- idea2: switch to KVM for virtual hosting instead of VMware
- problem with this idea.. server processor does not support AMD-V or Intel VT.. I don't mind swapping the CPU out, but once again, 3hr round-trip not desirable
- idea3: switch to XEN for virtual hosting instead of VMware
- problem with this idea.. time, and working on a server remotely is slightly more dangerous if I break something mid-work, once again, 3hr trip to fix if I do break something
- idea4: try putting a virtual network interface or "alias" interface (nothing to do with VMware) on the primary NIC that attaches to the internet on the Linux Host machine, give it a private subnet address different than any other physical or virtual network currently in use (192.168.50.1), and enable forwarding on this virtual interface on the Linux Host and bond the pfSense WAN nic in VMware to this physical interface, set the pfSense WAN to static IP (192.168.50.129) and gateway of the Linux Host's alias IP of 192.168.50.1
- problem with this idea.. don't know if it'll work, somewhat risky playing with networking on remote server, seems complicated
- I went ahead with this idea after weighing the other ideas.

- after choosing idea4, I went ahead and implemented the changes, setting up the virtual alias of eth0:50 with an IP of 192.168.50.1, enabled IP forwarding manually for testing, then later in the sysctl config plus 2 other settings related to forwarding, created a small script to run the 4 iptables rules that implement the masking/masqing/masquerading/NAT redirection functions of linux
- I modified the pfSense WAN IP to be static 192.168.50.129 netmask 192.168.50.0/24 (255.255.255.0)
- I adjusted the pfSense virtual guest's first virtual network card to be bridged to the Linux Host's primary network card (default network name in VMware is "Bridged") which is used as the WAN interface in pfSense
- maybe some reboots involved since things weren't working initially
- also made a correction to my pfSense netmask, since I mistakenly chose /32 initially instead of /24 netmask
- can now ping from pfSense interface to linux host's eth0:50 ip of 192.168.50.1, and vice-versa
- can ping outbound from pfSense to the internet
- nslookup requests from pfSense work
- switch to the testing guest Ubuntu Server VM, and reboot it to get a new lease from the pfSense captive portal, probably not required, but reboots are the order of the day it seems
- nslookup requests from the testing guest using the captive portal work
- call client to have them test using their Kiosks
- wait to hear from client, and continue to test with testing guest to see that the intermittent issue side of things doesn't crop up again

- waiting to know if it all worked out