by Nicola Phillips, Copywriter
1860 was a leap year.
In November, Abraham Lincoln was elected president, prompting the quick secession of seven southern states from the Union, changing America forever. The stage had already been set for dramatic change, however, when on February 29th a son was born to a German immigrant school teacher living in upstate New York.
Herman Hollerith has a devoted following, but it is restricted to the halls of science and technology. Ask the average person who Herman Hollerith was, or why we care, and they likely won’t have a clue.
If you have heard of Hollerith, it’s probably because of his machine — one that, half a century after its invention, gave rise to the modern computer.
Hollerith designed his tabulating machine as part of a contest hosted by the Census Bureau in 1888 to process data from the prior U.S. census. His machine utilized batch processing: running data sets that churn out answers in (as the name suggests) batches. The process is ideal for large data sets that require a lot of computational work but are not time-sensitive.
Hollerith went on to found a company that in 1911 merged with several others to form the Computing-Tabulating-Recording Company. In 1924, someone had the good sense to ditch the clunky name, and IBM (International Business Machines) was officially born.
The engineers at IBM today are working with machines that bear little resemblance to Hollerith’s invention, and for the most part, don’t engage in batch processing. In fact, the average person isn’t using batch processing much in their day-to-day life.
Batch processing is used by whoever handles your company’s payroll, and your bank uses batch processing to reconcile your monthly credit card transactions. Any job that uses data with consistent characteristics and that doesn’t need to happen in real-time can use batch processing. But that’s the kicker — we like things happening in real-time, even when the urgency is not immediate.
And because of this, batch processing is likely not a major factor in your life.
The web we’ve known since the mid-2000s, Web 2.0 (alternatively referred to as the social web or the real-time web), primarily uses stream processing: running data sets that churn out answers in real-time, as a steady, continuous flow. Around the turn of the century, data companies started utilizing stream processing to generate insights instantaneously. If you are scrolling through Twitter, watching Netflix, or on a Zoom call, you’re enjoying the wonders of stream processing.
There are tons of other uses of stream processing that aren’t social media. Cybersecurity, surveillance, fraud detection, traffic monitoring, geofencing, smart advertising…the list goes on and on.
So the question is, in a world where “speed will decide the winners and losers,” why would anyone return to the clunkiness of batch processing?
Enter Web 3.0, the Semantic Web. We don’t need to get into the weeds with Web 3.0. The key features are decentralization and the more intense use of artificial intelligence.
So, how does AI relate to batch processing?
The intelligence of machines is not like the intelligence of humans. Machines can’t grasp nuance or explain the implicit “why.” What machines are really good at is performing a task over and over very quickly. Machines “learn” through sheer computational grit.
And grit, it turns out, can take you really far.
The census example actually gives a great visual representation of what kinds of data batch processing is ideally suited for. Knowing a single data point in the census as it’s calculated doesn’t matter; what we care about is the end result. We want the census results on a timely basis, but we don’t need them in real-time. And, importantly, the census is processing a lot of data that has similar requirements — it’s looking again and again at a limited set of inputs: age, race, gender, location, etc.
There are lots of applications for a process like this. The next breakthrough drug will be discovered using batch processing. Pivotal forecasts for climate change, like this soil organic carbon model — and subsequent mitigation approaches — will be developed through batch processing. Batch processing could be the way we figure out how to identify cancerous cells before they turn cancerous, or how we prepare for the rising onslaught of climate-related natural disasters. Batch processing is ideally suited to these gigantic data sets that require a lot of computational power but are repetitive in nature — tasks that fall under the umbrella term machine learning, an application of AI.
Think of batch processing as a machine performing a single math function over and over again. Instead of executing at the rate of a human, the machine performs this same function at a rate of, say, 100 trillion times a second (some machines in crypto mining do just this).
Let’s think back to that census example again.
The impetus for the Census Bureau’s competition that led to Hollerith’s tabulating machine was the prior census of 1880. Without a way to efficiently tabulate results, it had taken 7 years (yes, years) to process the 1880 census. Hollerith’s machine revolutionized the census. Using batch processing, Hollerith completed the entire process, from data capture to tabulation, in 78 hours.
Herman Hollerith didn’t invent batch processing; he invented an application of batch processing — one that showcased the scale at which batch processing could function, and what it could achieve.
We’re betting on similar breakthroughs, which is why we’re building data centers specifically designed for batch processing. If you’re curious, take a look.