Just a few short years ago, most job descriptions for data science and machine learning positions stipulated the following:
“Minimum qualifications: Currently working toward a Ph.D. degree in computer science, electrical engineering, operations research or a related field.”
As someone with a background in hard science, I could not help but notice a few glaring omissions, namely mathematics, physics and chemistry. Data analytics jobs just weren’t very inclusive when it came to candidates with my type of background.
But let’s take a step back to the basics. What is hard science?
The terms hard science and soft science are used to differentiate fields on the basis of the level of perceived methodological rigor, exactitude and objectivity. The distinction therefore is not clearly defined but more a matter of perception. To me, hard science differs from soft science in that it relies on objective observations that can be described in mathematical terms; it is the study of fundamental truth.
My original background is in particle physics. While I loved math, I found physics more attractive because it offered both the formalism of math and a tangible, pragmatic goal—that of understanding and explaining reality.
I eventually pursued a Ph.D. in experimental physics, working with huge amounts of data recorded by the BaBar detector at the Stanford Linear Accelerator Center. The signal I was trying to uncover was extremely small, hidden in a mountain of noisy, highly dimensional data. In fact, I was trying to measure an asymmetry of just a few percent in a signal that could be captured in only about one case out of 1,000,000—not counting the inefficiencies of the data-collection process. Needless to say, extreme rigor and precision comprised the cornerstones of my work.
Today, new technologies allow for the collection of data at an accelerated rate. Many believe that this explosion of data will soon enable the data science community to answer tough questions that could not be answered previously. Some even say that more data means that no need for new algorithms will have to be developed, and that the future of data science is simply machine learning at scale.
There are, however, many reasons I do not adhere to that view. I believe the key to making sense of big data lies in big math, or the application of advanced mathematical techniques at scale. Here’s why:
- Precision and accuracy are two different things
Statisticians differentiate between two types of measurement errors:
The accuracy of the measurement is called a systematic error, or bias. This type of error is typically caused by a faulty measurement system, or an effect unaccounted for in a model. A systematic error refers to how wrong the result is.
On the other hand, the precision of a measurement, also called statistical error, defines how reliable the said measurement is. While it is called an error, it is not a mistake but an uncertainty due to random chance. Repeating an experiment several times or collecting more data will eventually reduce this type of error.
To make business decisions more data-driven, companies invest substantially in capturing increasingly larger data sets. But while collecting more data might help decrease statistical errors, it won’t help improve accuracy. A frequent example in e-commerce is that the probability of a customer clicking on a suggested item not only depends on the shown item itself, but the position in which the item is shown.
- Not all data is created equal
Mathematicians specializing in information theory frequently use the concept of entropy to measure the amount of information contained in a data set. The data contains “noise” and dealing with noisy data usually requires more refined algorithms. In that case, only more advanced algorithms can improve the results.
- More data often demands more complex models
In the race to collect more data, organizations like Walmart change their strategy. For example, to increase the number of customer reviews, a retailer might choose to incentivize its customers by prompting them for real-time feedback through pop-up windows. But this approach introduces a bias in that the customer who chooses to leave feedback as a result of being prompted might behave very differently than the typical one. This effect can only be dealt with through the use of a more mathematically complex model.
- With more data comes more responsibility
According to the laws of statistics, larger amounts of data lead to better precision. Unfortunately, it sometimes takes large quantities of data to improve the precision by just a fraction of a percent. But rather than improving the accuracy of existing models, the data Walmart collects from customers’ online traffic is best used to offer them a more personalized experience. As a retailer, Walmart’s first priority is customer satisfaction. It is often wiser from a business standpoint to use data to offer customers a new experience tailored to their own needs than to correct the position of an item shown on page five of the search results.
- Data phenomenology vs. data theory
Most current data analysis is geared toward the study of correlations between two variables. But to actually understand such correlations as opposed to merely assessing them requires a very different skillset than that of the traditional data analyst. While big data is necessary to measure these correlations (phenomenology), mathematics and critical thinking provide the tools to understand and interpret what they mean (theory).
The beginning of the era of big data is, rightly so, a reason for great excitement in the data community. It means better technology, more memory, cheaper systems and more ways to capture data. However, the use of mathematical rigor will be needed more than ever as new challenges arise. Mathematical minds will be in demand, and hiring managers will become increasingly aware that the real value of big data can be unleashed only through the use of big math.