A little more than three years ago, we embarked on a journey. The goal was simple: to reduce our delivery times and improve our build quality by using automation. Thanks to great team members as well as some awesome open source software, we’ve made major strides toward achieving that goal.

The whole story begins with trust and empowerment. A small team was formed to start automating our server-build processes with the eventual hope of bringing automation to our entire distributed compute environment, including servers in our corporate data centers and public clouds, as well as in our distribution centers and stores. To achieve this, we chose to work with Open Source Puppet 3. Management endorsed our decision, and so we got to work.

In the beginning, we didn’t know how many nodes we could run Puppet on. What we did know is that we wanted to unify and centralize the function, to have one process and one point of contact for as many of our build processes as possible. A major turning point occurred when we began building greenfield data center servers with Puppet around the summer of 2014. Management sent down a major challenge that in retrospect was surprisingly non-directive: “How many nodes can you get Puppet on before holiday?”

We came back with 2,000, which we thought was pretty aggressive. A small coalition of engineers from various infrastructure disciplines stood in our VP’s office, feeling confident that 2000 nodes would make a statement and could help us make a difference. The VP looked at the number 2,000, written on his dry erase board, calmly rubbed out the “2” and replaced it with a “7.” Our jaws actually dropped.

We were talking about taking a new technology from zero to about 500 stores in less than two months. It would have represented more than double our existing Puppet footprint as it existed at the time. Would the system even scale that big? We had no idea. And now we wanted to take the doubling and triple that.

Walmart_Puppet_Conference
Jackson discusses bringing automation to Walmart using Puppet.

Our VP said, “We have over 10,000 stores. 500 won’t make that big of a difference. It won’t be enough to change how the sysadmin function works.” One of the engineers in the room, who had worked on store systems nearly his whole career said, “Well, if we want to shoot for 7,000, why not do the whole chain?” That meant 30,000 nodes. The VP said, “Alright! That sounds like you’re dreaming big.”

We filed out of that office, not sure whether we could do it or not, but together we were going to try. If we failed, it wouldn’t be because we hadn’t given it our all.

Over the next two months, we deployed nodes, often 500 or 2,000 at a time. The first time we tried to go to 500, we overwhelmed our Puppet servers, so we learned to add splay and retry logic to our installation and bootstrap process. Several times, we broke our Puppet infrastructure and had to build more Puppet servers, or classifiers. Or we had to make the ones we had bigger.

At any point, our management could have said, “Thanks for trying, but it looks like this isn’t going to go the distance.” But they gave us the chance to work through the problems, and that made all the difference. The system held up. Then, right around the middle of October — two weeks before our deadline — we finished it. We had more than 30,000 nodes reporting into a single, load-balanced infrastructure.

Once we got Puppet into the store chain, it was fairly easy to get it onto the rest of our brownfield server footprint. We had answered the question of whether or not Puppet could scale. But much more importantly we had proven to ourselves and to the organization as a whole that we could use Puppet as a system to manage infrastructure change with both speed and quality.

We upgraded the agent software, and as we learned the ins and outs of how Puppet worked, we used it to manage more and change more. And we are still expanding what we manage with the tooling today. Success with the store chain rollout gave us the confidence to grow our environment even larger.

Once Puppet 4 came out and support for Puppet on Windows improved, our Windows teams decided they wanted to use Puppet as well. Today, we have more than 55,000 nodes reporting into a single administrative Puppet instance. We’ve upgraded the infrastructure three times, and we’re running the latest versions of all the tools.

But it all began with our leadership making the conscious decision to trust us. That led to an amazing level of pride and ownership in the solution we built, and that in turn led to some amazing results, for us and for our company.

Comments

  1. Any reason you choose Puppet over the Chef or SaltStack. What were compelling reason to fit the Puppet in WMT strategy?

    1. Hi Sushil,

      Thanks for the question!

      We did a “bake-off” with Puppet against other major configuration management systems that were available in mid-late 2013. One of the biggest reasons we chose Puppet in the beginning was the preferences of the people who were going to be writing content for us (i.e. they liked writing Puppet code, and they were ambivalent about other competing systems); this was driven by Puppet’s wide adoption rate in the community, the ease of getting forge modules, and its excellent documentation. We reasoned that any configuration management system we implemented would have to be easy to learn to write content for, and those were compelling advantages for Puppet.

      The other major factor that drove us to Puppet was its remarkably good mechanism for reporting on what changes it has done, and that this was a very fundamental part of the core product. It is hard to overstate the value of having objective data about what has changed, and why, when facing the early stages of a major operational incident.

      Thanks!

Comments are closed.

Register for the Latest News