Network Automation Tools

Network automation was THE network topic of 2019, and perhaps for 2020 as well, although COVID-19 is a strong candidate as well. However, network automation not a new idea or technology. It’s becoming more prevalent in our networks through vendors (finally) adding more API support, but also products such as Cisco DNA Center, various SD-WAN products and so on; Networks with controllers that provide a programmatic interface. In the age of DevOps, everything needs an API and networking vendors are finally coming around to supporting it.

The tools of today, such as Ansible, Python and the likes should sound familiar to the ears of a lot of engineers. However, many might not be really comfortable or familiar with them, or might want to know what’s what, or perhaps have been waiting to see which one emerges as the ‘winner’. In the end, every hour of study time can only be spent once.

As a network engineer, I’ve been blessed to have had a few training courses for Python and DevOps. Now, this is not directly applicable to our line of work, especially when you’re doing projects all the time, but it does provide a good insight on what’s happening on the other side of the fence with the dev teams.

I want to cover the automation tools in this post and why automation and scripting is definitely something to look into.

Reasons

There needs to be a reason to want to do some automation. Let’s first cover why you should be interested in automation.

Repeatability

Whenever you start codifying your infrastructure, you make it easier to deliver the exact same product again. A simple example: I’m not a linux engineer. By now, I know my way around the system a bit, but every time I needed to roll out a linux machine in my homelab, I had to start googling again for commands, best practices and so on. My homelab documentation is as non-existent as any other enterprise out there.

So, when I finally had enough, I started coding my actions so they would be repeatable. Using variables for server names, IP addresses and so on, the first phase of deploying a machine is pretty identical. The added benefit of working this way, is that through using comments, you automatically have some form of documentation.

Right now, I kick of an Ansible script and it manages my machine by setting it up, installing packages, setting up Docker etc. After that, depending on a few variables, it will run another ‘playbook’ to run actions for that specific type of machine. If something crashes, I either spin up a new virtual machine or get some hardware and run that playbook all over again.

With wider support for APIs and better structure of configurations (via SSH), we can know get the same benefits for our networks.

Consistent results

The great thing about automation is the consistency. From personal experience, I can tell that if I have to configure something on 30 switches, I might make a mistake somewhere. Or when provisioning 10 routers, the end-result is not always the same. Whether that be due to different out-of-the-box configs or my own mistakes.

Automation allows you to make those mistakes consistently. At least you only have to apply a fix once as well. Of course, this also means that the downside of automation is a potentially very big blast radius. Applying one mistake all over the network can take the whole down. There are several ways to limit this though, and testing is something you should always do. However, once you get the hang of it, applying updates to your config via automation will be quicker than manually performing the job.

Versioning & storing code

When you start writing down your code in a text file, that’s great and all. However, we can do better. Developers have been using versioning systems for years and these systems are pretty mature. It’s time network folks jump on the band wagon as well. Tools like git versioning allow you to store your code elsewhere and use a versioning system to keep track of who made what change when. You can split your code into a separate branch (dev) from your production (master) in order to develop or add new features. You can work with permissions so that not everyone in your team can automatically commit to the master branch, and thus ‘protect’ your production code.

The benefits of this approach are obvious:

  • your code does not only live on your machine,
  • versioning is sort of a back-up, allowing you to go back in time;
  • permission system to allow certain people only to commit to certain branches, files etc.;
  • branching of code, allowing you to keep your production code intact;

The basic git CLI is not too difficult, but takes a bit of getting used to. There are plenty of resources to read up on this tool. Perhaps a GUI will help to better see what’s going on and visualize the branches on your repository. Sourcetree from Atlassian is a very decent tool to get started.

The following providers offer free and paid plans for git repositories:

Of course, you can roll your own as well with the many software packages out there. Gitlab is available for on-prem as well, but perhaps your company’s devs already have some local git server running.

Automation: Configuration vs. Orchestration

I’m going to briefly touch on this subject as it is important in the different ‘tools’ discussed below. However, you can find much more information on this subject elsewhere on the net. Simply put, there are tools that take care of configuration and those that more actively manage your environment. For example, an example of a configuration tool is one that will apply configuration through automation and time and time again perform the exact same actions. It’s a …. Ansible is such a tool. You can for example, schedule Ansible playbooks in a cron job and have them overwrite manual changes to parts or the entire device config. Changes would then have to happen through the playbook, or if integrated with a ‘source of truth’ (database or other tool) in that tool.

On the other hand we have orchestrators, which are of a declarative nature. These orchestrators (in a way) do not care how they get to a certain state, they just care about having that state. An orchestrator tries to maintain the infrastructure in a defined state. If you kill a vm on purpose, the orchestrator will try to spin up a new one to return to the desired state. This does not happen with a configuration tool. You could do this with Ansible, but you would have to program your playbooks in such a way to perform these checks and schedule the runs of these playbooks.

Tools, tools, tools

Finally, we can get to the tools! Below I describe three tools that I have some experience with and all three are of a completely different nature.

Ansible

Ansible is an open source tool developed and maintained by Red Hat (now IBM). The tool uses Python under the hood, but the user does not need to learn any Python. Ansible abstracts a lot of the underlying code and hardware away. Ansible does this through the use of modules for infrastructure components. It has a lot of modules for almost all the major networking vendors. It differs a bit per vendor how it works under the hood, but generally, you would configure your hosts to use the ‘network-cli’ which is essentially SSH for network devices.

If you have an environment with different type of gear, such as Cisco IOS, Cisco ACI, Juniper, Arista etc., then you do need to make some selection somewhere in your hosts/inventory, because each device needs its own module. This is not hard, as you can group your inventory based on location, names, software etc., but just be aware of this necessity.

As stated earlier, Python knowledge is not mandatory for Ansible and due to the plethora of supported modules, it’s great at supporting a lot of gear in your infrastructure. You can run Ansible core by firing of playbooks from the CLI, but there is also a GUI available. The community version is called Ansible AWX and the paid version is Ansible Tower. This software is more than just a GUI though. Although you still need to write your playbooks in some sort of file, you can do troubleshooting, inventory management, scheduling etc. through this software. It provides a lot more capabilities then just Ansible Core.

In Ansible, it’s possible to use commands to configure a device, but it really all comes together when you start using playbooks. Playbooks are essentially files that contain some actions that Ansible will then run in that order. Although the playbooks are parsed from top to bottom, it is possible to run them in parallel against multiple hosts. You can easily create your own playbooks and run them against your inventory, but there are a lot of pre-defined playbooks available through Ansible Galaxy. These are often written by more experienced users and are great for inspiration or just to see how to manage a more difficult config.

The playbooks in Ansible are configured in YAML. Anyone familiar with the name will know it’s an easy language. It’s sort of XML, but easier to read. However, this is also one of the biggest shortcomings of Ansible. It’s not as straight forward as you might expect. There is a difference between yes and 'yes'. One is interpreted as a boolean, the other as a string. This can seriously mess up your config if you need to pass a string somewhere instead. YAML has a lot of these things, more examples of which can be found here and here. The other issue with YAML is that, although accessible, you end up having the learn a Domain Specific Language (DSL). Not necessarily because of YAML itself, but because the way Ansible uses it. Some modules require data in one way, others in another form. You’re learning to talk to Ansible (programming in some way) via the use of YAML. This is knowledge that is not applicable outside of Ansible, you just know what yaml looks like but no other program will understand your code. An example below of Ansible code to set up Avahi, which is useless knowledge outside of Ansible.

- name: install avahi
  package:
    name: "{{ item }}"
    state: present
  with_items:
    - avahi
    - avahi-tools

- lineinfile:
    path: "/etc/avahi/avahi-daemon.conf"
    state: present
    regexp: '^enable-reflector'
    line: 'enable-reflector=yes'

Other issues are the troubleshooting itself. You can log verbose, but you still don’t exactly see what’s going on. I’ve spend many hours on several playbooks because I couldn’t figure out why something was happening. Sometimes this is the parsing of your YAML, sometimes you’re just missing something or addressing a variable wrong. Good luck figuring that out on the CLI.

In summary, the following are the positives and negatives of Ansible:

  • + Easy to get started
  • + A lot of available documentation
  • + A lot of available playbooks on Ansible Galaxy
  • + Mature environment
  • + Great community support and development
  • + Jinja2 support to create templates and use variables in generating configuration
  • - Incredibly slow compared to things such as Python - even when running in parallel
  • - Domain Specific Language (DSL) is a pain, because you need to invest time to learn, but you can never use it again outside of Ansible
  • - YAML is not strictyaml, data can interpreted differently than what you expected
  • - Troubleshooting is a pain

When you start with Ansible, be sure to check out the documentation.

Terraform

I’ll be a bit more brief on Terraform, also because I’ve only recently started working with this tool. Terraform in an orchestrator and all you have to do is describe the desire state that you want to achieve. Terraform also has a DSL, but it’s easier and more compact than the one from Ansible. I would state it’s easier to read as well, so even as a beginner, it’s quite easy to get started. If I compare the time I’ve spent (or lost) on specific Ansible syntax or tips & tricks compared to HCL (HashiCorp Configuration Language), then Ansible seems to be more complex. Specifically, Terraform automatically understands interdependencies and relations. For example, it knows the requirements for setting up a VM in Azure and will run those tasks first, whereas Ansible reads the file top to bottom and you better have it set up right from the start.

So, Terraform has a different approach compared to Ansible, but I would also state a different audience. It feels like the main focus is on the cloud, which it does very well. There are a lot of modules available for network gear as well, but there are fewer available compared to Ansible. To manage local servers, you need to use null resources with a local provider, which feels rather hacky. The amount of modules overall is more limited compared to Ansible, however there is another section with community modules that might satisfy some of your needs as well.

I’ve been using Terraform just for VMs in the cloud and will try it out soon with Cisco ACI as well, so more on that later. What is great though, is that Terraform has a built-in state tracker and a planner compared to Ansible’s –dry-run.

In the below example, I added two items and renamed a virtual network, the latter of which will be recreated.

➜  azure git:(dev) ✗ terraform plan -var-file=secrets.tfvars
Refreshing Terraform state in-memory prior to plan...
The refreshed state will be used to calculate this plan, but will not be
persisted to local or remote state storage.

azurerm_virtual_network.eve_vnet: Refreshing state... [id=/subscriptions/<subscription-id>/resourceGroups/eve-lab-rg/providers/Microsoft.Network/virtualNetworks/eve-vnet]
azurerm_resource_group.eve-lab-rg: Refreshing state... [id=/subscriptions/<subscription-id>/resourceGroups/eve-lab-rg]

------------------------------------------------------------------------

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  + create
  - destroy

Terraform will perform the following actions:

  # azurerm_public_ip.eve-public-ip will be created
  + resource "azurerm_public_ip" "eve-public-ip" {
      + allocation_method       = "Static"
      + domain_name_label       = "eve"
      + fqdn                    = (known after apply)
      + id                      = (known after apply)
      + idle_timeout_in_minutes = 4
      + ip_address              = (known after apply)
      + ip_version              = "IPv4"
      + location                = "northeurope"
      + name                    = "eve-public-ip"
      + resource_group_name     = "eve-lab-rg"
      + sku                     = "Basic"
    }

  # azurerm_subnet.eve-subnet will be created
  + resource "azurerm_subnet" "eve-subnet" {
      + address_prefix                                 = "10.0.1.0/24"
      + enforce_private_link_endpoint_network_policies = false
      + enforce_private_link_service_network_policies  = false
      + id                                             = (known after apply)
      + name                                           = "eve-subnet"
      + resource_group_name                            = "eve-lab-rg"
      + virtual_network_name                           = "eve-vnet"
    }

  # azurerm_virtual_network.eve-vnet will be created
  + resource "azurerm_virtual_network" "eve-vnet" {
      + address_space       = [
          + "10.0.0.0/23",
        ]
      + id                  = (known after apply)
      + location            = "northeurope"
      + name                = "eve-vnet"
      + resource_group_name = "eve-lab-rg"

      + subnet {
          + address_prefix = (known after apply)
          + id             = (known after apply)
          + name           = (known after apply)
          + security_group = (known after apply)
        }
    }

  # azurerm_virtual_network.eve_vnet will be destroyed
  - resource "azurerm_virtual_network" "eve_vnet" {
      - address_space       = [
          - "10.0.0.0/23",
        ] -> null
      - dns_servers         = [] -> null
      - id                  = "/subscriptions/<subscription-id>/resourceGroups/eve-lab-rg/providers/Microsoft.Network/virtualNetworks/eve-vnet" -> null
      - location            = "northeurope" -> null
      - name                = "eve-vnet" -> null
      - resource_group_name = "eve-lab-rg" -> null
      - tags                = {} -> null
    }

Plan: 3 to add, 0 to change, 1 to destroy.

------------------------------------------------------------------------

This plan was saved to: update-azure.tmp

To perform exactly these actions, run the following command to apply:
    terraform apply "update-azure.tmp"

To summarize, these are the plusses and minuses I’ve found thus far with Terraform:

  • + Declarative
  • + Easier to understand and read DSL
  • + Shows the changes ahead of time and creates a plan
  • + More intelligent in understand pre-requisites and order of a plan. Explicit dependencies are always still an option if you need them.
  • - There is certainly documentation, but it lacks the background and examples like Ansible has
  • - Main focus on cloud
  • - Fewer infrastructure components are supported compared to the other tools
  • - Troubleshooting is as limited as Ansible I would say
  • - Changing managed resources outside of Terraform causes issues with the State in Terraform. No decent process yet to update this State file.

When you get started, start out with one of the available guides to deploy a VM in the cloud.

Python

Python is not a tool in the traditional sense that you can buy or download a software package, click around a bit and have it run. Python might require the most time investment to learn, but it’s also the most powerful tool of all. Python skills are not just limited to network devices, but can be used to do almost anything. Python is one of the most popular languages at the moment due to the simplicity and power it has. It’s being used in Universities for deep-learning, AI etc. This also means that it’s more likely for people to know and be able to write Python than an Ansible playbook for example. Because of it’s wide audience and vast usage, it’s easy to find solutions to your problems on the internet and there is a lot of documentation and modules available.

Now, the problem with Python is that there are ten ways to do everything, including even opening a connection to your switches. Getting started is not exactly the easiest and if you would need to pick a course for either of these three tools, it should be Python. But once you understand the language, are able to write some basic logic, that’s where modules come in. There are dozens of great networking modules, often written by network engineers as well. Examples are the paramiko and netmiko modules to connect to network gear over SSH. But also more abstracted modules such as Napalm and Nornir that will allow you to write simple, compact code that will execute show commands or even config changes on your gear, largely independent of the underlying platform.

Nornir is perhaps my favorite module, because it can use Napalm as the underlying module to abstract config away from the platform and place it through Napalm/Netmiko. Nornir is the Ansible Core of Python I would say as it will allow you to configure inventory files and run scripts against (parts of) those. I have written Nornir scripts of under 30 lines that generate entire configs for me based on several variables. Nornir can use Jinja2 templates as well to easily generate config snippets and have that logic be inside of the jinja2 templates. This could provide a good “upgrade” path if you’ve used jinja2 extensively in Ansible and you’re looking for something more powerful.

  • + Mature environment
  • + Great community support and development
  • + Jinja2 support to create templates and use variables in generating configuration
  • + Python can be used for so many other things as well
  • + The most flexible of all
  • + There are libraries and packages for almost anything
  • - You still have to get familiar with each individual package or library and read documentation
  • - More difficult to start
  • - Easier to make mistakes such as forgetting validation or not applying best practices

Before you get started

Before you get started with any of these tools, I think the following aspects are very important for a successful automation journey:

  1. Familiarize yourself with your infra: before you start, know where the biggest wins are. How consistent is the infrastructure. If there are too many inconsistencies, different technologies etc., start in a small corner in the infrastructure and slowly broaden the scope.
  2. Know your tools: get familiar with your existing tools and what they can and cannot do and see which of these “automation systems” best works in conjunction with that. Don’t forget about stuff like git. It’s easy to get started and versioning is essential. It’s much better than Dropbox or Sharepoint, especially once multiple people work on the same code. So start right, start to git!
  3. Know your limitations: if this is your first time automating, mind tip 4 below. Get familiar with the technology by practicing and start with something that you’re comfortable with. A simple project could be to ensure that all network gear use the same name-servers. Step it up from there writing something to ensure a working NTP sync everywhere. After that, you can start thinking about VLANs, Interfaces and the problems that come up when trying to rework an ACL.
  4. Start small: When you’re just getting started, choose a tool that fits your goal and start out with a simple, repetitive task. This is the stuff that can give you some quick wins and you’ll be able to learn the pros and cons of the software you’re trying to use. Don’t start out by setting up an Ansible Tower or AWX server when you haven’t written your first playbook yet. If you need to ensure the same NTP servers on all your equipment in your infrastructure, start by doing this. You will soon discover that things can get complicated real fast and you need to keep things small and simple. Later on, you can combine playbooks in bigger Ansible roles if that is your jam.
  5. Don’t let all this text discourage you: Again, start small and experiment. That way you will learn, same as you had to learn to manage network devices. The tools are only getting better and more accessible, making it a great time to start learning. I have tried programming for a while and I didn’t really get the hang of it until I started automating my home and the networks I was working on. It takes time, but so does it for everyone else.

Conclusion

So, what’s the conclusion of this entire article then? I guess “it depends”..

I personally believe in “Practice what you preach”, so I actually use all three and git as version control. I’ve started with Ansible, but have been moving away to Python as it is more versatile and I can use it outside of work as well. However, I tend to quickly increase the project scope and lose track of things. Also, I have problems sometimes organizing my code so that I can quickly understand it again when I look at it in the future.

That’s why I started looking into Terraform, so I don’t have to figure out an API for every service myself and try to keep track of it’s changes. I’m not a developer after all! Right now, I’m using Terraform for my VMs in the cloud, Ansible for my local network and quick tasks and Python with Nornir for the bigger stuff such as configuration generation and ensuring certain code is on the device. Interacting with the Cisco SD-WAN API is still a Python job for now as no module exists in the other two tools. This is one of those strengths of Python, you just write it yourself and you don’t need to wait for a product update from anyone.

Each tool has its pros and cons and I try to balance which tool to use when. The downside of this approach is that I’m a master of none. However, in the future I hope I can drop Ansible and focus more on Terraform and Ansible as those two are my favorites for now. That is not to say that Ansible still has a place in my heart, it is how I got started on this rollercoaster called network programmability.

When you do start coding, and you’re looking for a good code editor, take a look at my previous article regarding VS Code.