Walmart Technology’s history with big data began even before it became a buzzword. We recognized early on that the data we could cull from our e-commerce and brick-and-mortar businesses provides the most effective tool for optimizing user experience and planning strategically for the future.
In the past few years, our big data capabilities have continued to evolve by leaps and bounds. Here are a few of the tools that have facilitated that growth.
Processing tens of petabytes of data requires a resilient framework capable of analyzing raw data from a multitude of sources, ranging from social media activity to individual search and purchase histories. No system does this quite as well as Hadoop. At Walmart, we run a cluster that’s online 24/7, parsing social and transactional data from hundreds of millions of customers. Owing to its robust parallel processing capabilities and the fact that it runs on commodity servers, Hadoop enables us to efficiently extract meaningful insight from a breadth of unstructured data.
“We use Apache Hadoop to deliver value from the massive amount of data generated by all of Walmart’s retail chains. Hadoop enables developers to process large structured and unstructured data sets using a variety of open source tools in clusters of commodity hardware,” said Toni L. LeTempt, Hadoop engineer for Walmart Technology. “With our Hadoop clusters, we are able to harness extreme computing power to help our customers save money and live better.”
In the context of data analysis, the primary advantage of Hive compared with other database management systems like Pig is it enables developers to take advantage of Hadoop with a minimal learning curve. Hive Query Language (HQL) is instantly familiar to any developer accustomed to relational DBs as Hive statements are syntactically very similar to conventional database code. This familiarity streamlines the process of executing jobs across Hadoop clusters.
“What I like most about working with Hive is that with our large data sets, Hive is a tool that enables us to solve many of our business requirements while requiring minimal efforts to support the common profile model (CPM) and common transaction model (flat CTM) applications,” said Kashyap Desai, senior programmer analyst for Walmart Technology.
An acronym for Yet Another Resource Negotiator, YARN is a distributed operating system for big data apps. Its primary advantages include high cluster utilization, scalability and a flexible resource model. One of YARN’s most distinctive features is the use of separate daemons for resource management and job scheduling. This allows for clusters to scale without negatively impacting performance.
While Hadoop offers a powerful analytics engine for unstructured data, it’s not very adept at handling web and mobile applications. That’s where Cassandra comes in. Offering high availability and the ability to distribute nodes across multiple physical locations, Cassandra is designed to process online workloads comprising large quantities of interactions. Used in conjunction with Kafka and Storm, Cassandra also enables the processing of customer-based events in real time.
“We use Apache Cassandra to build always-on, resilient applications with great performance that can take advantage of the vast amount of data generated by our retail operations,” said Andrew Weaver, senior technical expert for Walmart Technology. This tool helps us “give our customers the shopping experience they expect.”
Walmart Technology is currently looking for data scientists and engineers to analyze and transform big data, to innovate in new ways, and to help our customers save money and time. If you’re a Cassandra or Hadoop fanatic ready to dive into Walmart’s world of data, search Walmart Technology openings.