Testing MTU in vSphere

Well I have been playing around with vxlan more than I care to admit.   It’s a painful process.  One key component to vxlan is an increased MTU to 1600 in order to support the encapsulation.  You verify that you don’t have a MTU issue the following way:

Login to your ESXi host (I like ssh but it’s up to you).

Identify the vmknic with your MTU settings:

esxcfg-vmknic -l

You should see a list of vmknic’s and MTU settings.   Then check to make sure your local switch also has the MTU setting => the nic setting

esxcfg-vswitch -l

Check for MTU of switch.   If everything looks ok you can use the vmkping to send a packet.  Test basic connectivity first:

vmkping IP_Address_of_local_interface
vmkping IP_address_of_remote_interface

This should return with pings unless you are using 5.5 (see below for more 5.5 stuff).   If this fails you have basic connectivity issues like firewall,subnet or some other layer 2 problem.  Now test for a 1600 byte packet (VMware has a 28 byte overhead that command does not take into account)

5.0 (-d is do not fragment -s is size)

vmkping -d -s 1572 IP_Address_of_local_interface
vmkping -d -s 1572 IP_address_of_remote_interface

5.1 (-I allows you to identify the vmknic to use)
vmkping -I vmknic# -d -s 1572 IP_Address_of_local_interface
vmkping -I vmknic# -d -s 1572 IP_address_of_remote_interface

5.5 (this one is different this actually shoots out a vxlan packet not a MTU 1572 packet - true test of vxlan)

vmkping ++netstack=vxlan vmknic_IP -d -s 1572


esxcli network diag ping --netstack=vxlan --host vmknic_IP --df --size=1572

Enjoy your testing and remember the 1572 rule.

2014 Top Virtualization Blogs


So it’s that time of year again time to vote for your favorite vmware blogs.  This year I selfishly added myself.. but there was a name Snafu so it ended up as voting for my name :)   You can read the results on the official site http://vsphere-land.com/.  I just wanted to thank the 9 people who voted for me and for the one person who voted for me as number 1 (it was not me I voted for yellow-bricks as always).   Thanks again and I promise to add more content this year.  I have a bit of a secret project and when it’s done in about a month I will be back with lots and lots of posts around design.

Enjoy my favorite cat.

Radically simple networking design with VMware


VMware has lots of great options and features.  Filtering through all the best practices combined with legacy knowledge can be a real challenge.  I envy people starting with VMware now they don’t have knowledge of all the things that were broken on 3.5, 4.0, 4.1 etc…  It’s been a great journey but you have  to be careful not to let the legacy knowledge influence the design of today.   In this design I will provide a radically simple solution to networking with VMware.


Design overview:

You have been given a VMware cluster running on HP blades.  Each blade has a total of 20GB’s of potential bandwidth that can be divided anyway you want.   You should make management of this solution easy and provide as much bandwidth as possible to each traffic type.  You have the following traffic types:

  • Management
  • vMotion
  • Fault Tolerance
  • Backup
  • Virtual machine

Your storage is fiber channel and not in scope for the network design.   You chassis is connected to a two upstream switches that are stacked.  You cannot configure the switches beyond assigning vlans.


This design takes into account the following assumptions:

  • Etherchannel and LAG are not desired or available
  • You have enterprise plus licensing and vcenter

Physical NIC/switch Design:

We want a simple solution with maximum available bandwidth.  This means we should use two 10Gb nic’s on our blades.   The connections to the switch for each nic should be identical (exact same vlans) and include the vlans for management, FT, vMotion, backup and all virtual machines.   Each with their own vlan ID for security purposes.  This solution provides the following benefits:

  • Maximum bandwidth available to all traffic types
  • Easy configuration on the switch and nics (identical configuration)

The one major draw back to this solution is some environments require physical separation of traffic and require traffic to be segregated by nics.

Virtual Switch Design:

On the virtual switch side we will use a dVS.  In the past there has been major concerns with using a dVS for management and virtual center.  There are a number of chicken and the egg scenarios that come into play.   If you still have concerns then make the port group for vCenter ephemeral so it does not need vcenter to allocate ports.   Otherwise vDS brings a lot to the table over standard switches including:

  • Centralized consistent configuration
  • Traffic Shaping with NIOC
  • Load based teaming
  • Netflow
  • dVS automatic health check


Traffic Shaping:

The first thing to understand about traffic shaping in VMware is it can only have effect ingress traffic and is unique to each host.   We use a numeric value known as a share to enforce traffic shaping.  These share values are only used during time of contention by default.  This unique ability allows you to ensure nothing uses 100% of a link while other neighbors want access to the link.   This is a unique and awesome feature that automates traffic policing in VMware solutions.  You can read about the default NIOC pools here.   I suggest you leave the default pools in place with their default values and then add a custom pool for backup.   Traffic shares are applied a value from 1 to 100.  Another design factor is that traffic that is not in use is not applied to the share algorithm.   For example assume the following:


You would assume that the total shares would be 10+25+25+50 = 110  but if you are not using any FT traffic then it’s 10+25+50=95  either way this number can be divided by total bandwidth so worst case scenario with 100% contention with all traffic types would get the following:

  • Management (20/110=.18*10) 1.8 GB
  • FT (20/110=.18*25) 4.5 GB
  • vMotion (20/110=.18*25) 4.5GB
  • Virtual machine (20/110=.18*50) 9 GB

And remember this is per host.   You will want to adjust the default settings to fit your requirements and traffic patterns.

This design has some real advantages:

  • The vmotion nic is seen as 10GB which means you can do 8 concurrent vmotions at the same time
  • No more wasted bandwidth
  • Easy to setup and forget about

Load balancing:

Load balancing algorithms in vSphere each have their own personality and physical requirements.   But we want simple above everything else.  So we choose to use Load Balanced teaming (LBT) known as physical nic load in vDS.  This is a great choice for enterprise plus customers.  It allocates usage of any one link to 80%, Once 80% is reached then some of the traffic is moved over to the next link.  This configuration will work with any number of uplinks without any configuration on the physical switch.  We avoid loops because unique traffic does not share uplinks.  For example virtual machine 1 will use uplink1 exclusively while virtual machine 2 uses uplink2.   With this load balancing method we don’t have to assign different uplink priorities to port groups in order to balance traffic just let LBT handle it.    It is 100% fire and forget.  If you find you need more bandwidth just add more uplinks to the switch and you will be using it.

Radically simple networking

It’s simple and it works.  Here is a simple diagram of the solution:



Once setup it scales and provides for all your needs.   It’s consistent clean and designed around possible failures.  It allows all traffic types to use as much network as needed unless contention is present.   Just think of it as DRS for networking.   I just wish I could handle my physical switches this way… maybe some day NSX.

VMware predefined NIOC settings what do they mean?

Recently I was setting up a new 5.5 cluster with NIOC and I noticed all the new NIOC pre-build categories:

Untitled (1)


Some are obvious but others are a little more questionable.  After a great discussion with VMware support I found out the following:

  • NFS traffic – This is traffic using the NFS bindings in ESXi (not guest NFS traffic) only ESXi NFS traffic
  • Management Traffic – ESXi management traffic only – connections between vcenter and ESXi
  • vMotion Traffic – vMotion and heartbeats
  • vSphere storage Area network traffic – I had a lot of questions on this one but it turned out to be simple vSAN only traffic
  • vSphere replication traffic – Traffic coming from the vsphere replication appliance only no other replication traffic
  • iSCSI traffic – As expected it’s traffic to ESXi that is iSCSI using hardware or software initiator
  • Virtual Machine traffic – Traffic out of guest virtual machines
  • Fault Tolerance Traffic – Traffic specific to vmware FT

There is all the predefined ones… what if I create a user defined category and assign it to my NFS port group… which assigns NIOC.   Simple the one with the larger share.

How does storage multipathing work?

Every week I spend some time answering questions on the vmware forums.  It also provides me great idea’s for blog posts just like this one.   It started with a simple question how does multipathing work?   Along with a lot of well thought out specific questions.   I tried to answer the questions but figured it would be best with some diagrams and a blog post.    I will focus this post on fiber channel multipathing.  First it’s important to understand that Fiber channel is nothing more than L2 communication using frames to push scsi commands.   Fiber channel switches are tuned to pass scsi packets as past as possible.

Types of Arrays

There are really three types connectivity with fiber channel (FC) arrays

  • Active/Active – I/O can be sent to a LUN via any of the arrays storage processors (SP) and port.  Normally this is implemented in larger arrays with lots of cache.  Writes are sent to the cache then destaged to disk.   Since everything is delivered to cache SP and port does not matter.
  • Active/Passive – I/O is sent down to a single SP and port that owns the LUN.  If I/O is send down any other path it is denied by array.
  • Pseudo Active/Active – I/O can be sent down any SP and port but there is a SP and port combination that owns the LUN.  Traffic send to the owner of the LUN is much faster than traffic sent to non-owners.

The most common implementation of pseudo active/active is asymmetric logical unit access (AULA) defined in the SCSI-3 protocol.  In AULA the SP identifies the owner of a LUN with SCSI sense codes.

Access States

AULA has a few possible access states for any SP port combination:

  • Active/Optimized (AO) – this is the SP and port that owns the lun best possible path to use for performance
  • Active/Non-Optimized (ANO) – this is a SP and port that can be used to access a lun but it’s slower than the AO
  • Transitioning – this lun is changing from one state to another and not available for IO – Not used in most AULA now
  • Standby – Not active but available – Not used in most AULA now
  • Unavailable – SP and port not available

In a active/active array the following states exist:

  • Active – All SP and ports should be this state.
  • Unavailable – SP and port not available

In a active/passive array the following states exist:

  • Active – SP and port to access the lun (single owner)
  • Standby – SP and port available is active is gone
  • Transitioning – Switch to Active or Standby

In AULA arrays you also have Target port groups (TPG) which are SP and ports that have a similar state.  For example all the ports on a single SP may be a TPG since the LUN is owned by the SP.

How does your host know what the state is?

Great question.  Using SCSI commands a host and array communicate state.   There are lots of commands in the standard.  I will show three management commands from AULA array’s since they are the most interesting:

  • Inquiry – Ask a scsi question
  • Report Target port – Reports what TPG has the optimized path
  • Set Target port group – ask the array to switch the target port group ownership


This brings up some fun scenario’s who can initiate these commands and when…  All of these will use a AULA array


So we have a server with two HBA’s connected to san switches.  In turn the SP’s are connected to the san switches.  SPa owns LUN1 via AO and SPb owns LUN2 via AO.



Consider the following failures:

  • HBA1 fails – assuming the pathing software on the OS is set correctly (more on this later) The operating system access LUN1 via ANO path to SPb to continue to access storage.  Then it initiates a set target group command to SPb asking it to take over LUN1.  Which is fulfilled and the array sends out a report target port groups to all known systems that they should use SPb for access to LUN1 for AO.
  • SPa fails – assuming the pathing in OS is good.  Access to LUN1 fails via SPa and the OS fails over the SPb and initiates the LUN fail over.

This is designed just to show the interaction in a real environment you would want san switch a and b both connected to SPa and SPb if possible for redundancy.

How does ESXi deal with paths?

ESXi has three possible path states:

  • Active
  • Standby
  • Dead – cable unplug, bad connection / switch

It will always try to access to the lun via any path available.

Why does path selection policy matter?

The path selection policy can make a huge difference.  For example if you have a AULA array you would not use the round robin path selection policy.  Doing this would cause at least half your I/O’s to go down the ANO path which would be slow.   ESXi supports three policies out of the box:

  • Fixed – Honors the AO path until available most commonly used with AULA arrays
  • Most recently used (MRU) – Ignores the prefered path and uses the most recently used path until it’s dead (used in active/passive arrays)
  • Round Robin (RR) – sends a fixed number or I/O’s / bytes down a path then switches to next path.  Ignores AO.  Used normally with active/active arrays

The number of I/O’s or bytes sent before switching in RR can be configured but defaults to 1000 io’s and 10485760 bytes.

Which path should you use?  That depends on your storage array and you should work with your vendor to understand their best practices.  In addition a number of vendors have their own multipath systems that you should use (for example EMC’s powerpath).


VMUG Virtual Event

I am a huge fan of the VMUG organization.  I attend two different VMUG’s in central ohio.  From time to time I have been known to speak at the events. In fact I will be speaking on Feb. 25th on design at the Central Ohio VMUG.  So last week I attended the VMUG virtual day long event.  VMware has been playing around with this type of event for about six months now.  The concept is simple avoid the traveling show and create a virtual event.    There are live streams with Q and A sessions and of course vendor booths.

This type of medium is getting more popular by all companies to reduce expense.  Of course the key is they need to bring value to the table.  Traditionally they have these type of shows attract people for these reasons:

  • SWAG (Stuff we all get)
  • Get out of the office
  • Chance to talk to a lot of vendors in one place
  • Chance to learn about companies product (marketing)

The virtual event does offer prizes but not a lot of swag beyond white papers.   Personally, I enjoy going to the events because I can interact with other people.  I find the lunches at these events to be the most useful.   In that respect the virtual event is really missing out.  I can talk to vendors via chat or VMware employees but my ability to talk to other customers is limited.   I have always felt that VMUG is really about other customers.  I hope they find a way to create community at these events as well.  Here is the good news even if you could not attend you can get all the presentations streaming until Feb. 24th right here.

I did enjoy seeing more vSAN best practices.  I am really looking forward to seeing more vSAN deployments.

How do VMware Snapshots work

When I studied computer science it was not a raw science.  My training did not require knowledge of how transistors worked or even logic circuits.  It focused mostly on programming languages and how to configure a web server.   Why?  Because these were the skills most likely to be used by a computer scientist in the field today.   Very few people build computers from scratch.  Intel has a corner on that market.  Personally I wanted to understand all the under the hood components so I took a minor in electrical engineering.  It was worth my time and a great learning experience.   I find that a lot of technology is like this… which includes VMware snapshots.   I have had snapshots explained to me in every VMware course that I have attended and every answer is different.  I have cobbled together lots of KB articles and other sources into this article.  If something is missing or incorrect let me know so I can fix it.

What is a Snapshot?

  • A snapshot file is only a change log of the original virtual disk
  • A virtual machine uses the disk descriptor file to access the most current snapshot not the original disk.
  • It is not a complete copy of the original disk
  • Snapshots are combined dynamically with the original disk to form a current state of system
  • The snapshot change logs are sometimes called delta files
  • Think of them as chains to get the complete picture you need all the links in order.
  • Snapshot files will grow forever unless deleted(re-integrated into original disk)
  • Using a lot of snapshots can effect performance of the virtual machine

How do snapshots work?

The process on the surface seems simple.  When you initiate a snapshot the following process is followed:

  1. A request to create (CreateSnapshot) snapshot for a virtual machine is forwarded to ESXi host running the virtual machine.
  2. If memory snapshot is included the ESXi host writes the memory of virtual machine to disk
  3. If quiesce is possible the ESXi host request the guest OS to quiesce the disks via VMtools
  4. ESXi host changes the virtual machines snapshot database (.vmsd file) to denote snapshots
  5. ESXi host calls a function to make changes to the child disks (-delta.vmdk via the .vmdk descriptor)

What is a .vmdk descriptor?

Due to the nature of file systems os’s don’t like to have file names change mid-access.  So VMware has implemented descriptors.  Or symbolic links that point to the real files.  This allows a snapshot to be created and access via descriptor to continue.

How can I identify snapshots?

  • Select the Virtual Disk and check the Disk File. If it is labeled as VM-000001.vmdk , the virtual machine is running on snapshot disks.
  • Run the following command from ESXi Shell:
find /vmfs/volumes/ -iname "*.vmx" -exec grep -Hie "-[0-9][0-9][0-9][0-9][0-9][0-9].vmdk" {} \;
  • List currently open delta disks via command line:
ls -l /vmfs/devices/deltadisks
  • Locate all delta disks on file system:
find /vmfs/volumes/ -iname "*delta.vmdk"
  • In powercli
get-vm | get-snapshot | format-list