As system engineers at Walmart, we’re tasked with providing creative — and sometimes unconventional — solutions to challenges. So when cloud-scale loads caused the company’s online B2B platform for supplier data, RetailLink, to suffer performance degradation and availability issues, it was our job to find a way to maintain uninterrupted access to the data that suppliers need to run their business.
The result was the creation of a dynamic, distributed caching solution that could withstand Walmart’s demands. We’re excited to share the open source release of that solution — zECS (Enterprise Cache Service), which runs on the z Systems Mainframe and its z/OS operating system.
RetailLink supports operations for tens of thousands of suppliers by providing visibility into their supply chain from distribution facilities all the way through customer checkouts. The RetailLink system had undergone modifications that included a distributed caching tier to support performance and robust session management capabilities for its web tier of .Net and Java applications. A distributed caching appliance was initially selected for this purpose, but the burden of Walmart-scale demands caused it to suffer unacceptable performance degradation.
The case for zECS
To create a dynamic, distributed caching solution accessible via web technologies, my colleague Randy Frerking and I innovated on a traditional (but unconventional for the purpose) platform: IBM’s z Systems mainframe and its z/OS operating system.
In enterprise operations, system failure can result in substantial loss. Combine possible system failure with an ever-changing tech environment, and it becomes crucial to create functions that complement the basic requirements of the application. For our purposes, these were…
- Reliability — The solution had to remain available under any load and in the event of facility outages. It also had to perform consistently, maintaining thousands of transactions per second at less than 2ms response times while handling relatively large session object sizes (8KB – 32KB). We employed some ingenuity through development of certain capabilities to handle parts of these requirements, but we also leaned heavily on the RAS characteristics (Reliability, Availability, & Serviceability) of the z/OS stack to achieve this.
- Platform Autonomy — As previously alluded to, constant change is normal and expected in tech environments, so we needed to ensure that our design did not impede future changes to the application. We elected to use HTTP-based RESTful APIs for the service to maintain a loose coupling with the application components, and subsequently abstract away any complexities of the hosting platform.
- Minimized burden on application developers — To avoid placing extra work on the development team, we mimicked the API of the previous solution to provide a transparent, drop-in replacement for the application developers. All that was required to begin using the new solution was a configuration change of the hostname.
- Transparent adjustments of HTTP methods — We also included transparent adjustment of HTTP methods (e.g. converted POST request to PUT or DELETE-POST request for pre-existing keys) and configurable default and individual Time-to-Live values for objects. All this was in relation to our patented auto-expiry process, which eliminates I/O operations and avoids common LRU (Least Recently Used) algorithms. In addition, a single URI endpoint eliminated the need for the application or an extra client library to manage and account for key ranges across shards.
Solution Design of zECS
The solution runs in CICS on z/OS and leverages numerous z/OS Parallel Sysplex concepts to satisfy the nonfunctional requirements we needed for the service. For example, WLM (Workload Manager)-managed Sysplex Distribution and Shared Ports are exploited to intelligently distribute work to the CICS region (i.e. virtual app server) within a cluster that is most capable of satisfying requests at that point in time and to provide elasticity that accommodates fluctuations in demand. Additionally, we decoupled the in-memory data structures from the servers and distributed them (via Coupling Facilities) with recovery logs to ensure “persistence” in the unlikely event of a full multisite outage.
The design employs global load balancers to distribute requests across geo-dispersed data centers, and zECS includes replication capabilities that incorporate patented synchronization (configurable as sync or async) mechanisms. The following diagram represents the overall architecture of the solution:
Evidence of success
The quick delivery of this caching solution allowed us to fulfill our obligations to our business partners, and ultimately, the zECS product provided a familiar, API-based solution that continues to allow our application developers to quickly meet other highly important business commitments. The solution also transparently benefits from the mainframe’s qualities of service, like inherent high-availability and dedicated robust I/O processing. The initial instance of the service has now been in production for about four years and processed some 21 billion requests with zero downtime. There have been no planned or unplanned service outages, even through numerous software and physical system upgrades.
Innovating for the future
The success of this initial product paved the way for our team to pursue the cloud computing service delivery model for subsequent deployments, and further improve the experience for our development teams. With tenets like broad network accessibility, elasticity and measured service already incorporated in the product, we shifted our focus to on-demand self-service for the product offering. We developed an AngularJS-based web-app portal into our automation framework (entirely hosted in CICS on z/OS) that provides provisioning and deployment services through JSON-APIs to enable full self-service capability to our application developers.
With this addition, our developers can acquire robust, elastic, easy-to-use, API-driven service within seconds that allows them to move directly to solving business problems.
Our engineering team has gone on to develop a suite of products that contribute to increasing developer productivity, and they are all based around similar architecture and design points. We are releasing a couple of these other products along with zECS to the community. The list of items being open sourced includes the following:
- zECS (z/OS-based enterprise cache service) — REST-API based service for transient object storage/caching
- zFAM (z/OS-based file access manager) — REST-API based persistent object store service with a rich set of DBMS-like features.
- zUID (z/OS-based unique identifier generation) — A REST-API based service that generates a unique identifier using a specialized patent-pending algorithm. It is guaranteed to generate 100 percent unique identifiers at extremely high volume without requiring a database system to manage.
Our decision to open source these products is based on the belief that they will potentially be useful to other organizations, and a hope that others may be compelled to contribute to these or other community projects that deliver value to all of our respective businesses. For continued learning, download the following Redbooks:
I’d like to extend a special thanks to the engineers who assisted on this project, including Randy Frerking, Michael Karagines and Trey Vanderpool. We’re always looking for talented engineers to join our teams. If interested, check out engineering opportunities in California or search all other locations.