The amount of big data that Facebook handles never ceases to amaze. At a subdued press conference this week, attended by only a few reporters, the social media giant revealed a whole host of impressive stats about its data operations.
As thing stand right now, its system processes something close to 500+ terabytes of data and 2.5 billion pieces of content every single day. This includes more than 300 million new photos uploaded each day, plus a staggering 2.7 billion likes every 24 hours!
Such a voluminous amount of data means that Facebook is presented with some unique challenges, one of the biggest of them being how to create server clusters that can operate as a single entity even when they’re located in different parts of the globe.
At this week’s press conference, Facebook gave us some details on their latest infrastructure project, which they’ve codename “Project Prism”.
Jay Parikh, Facebook’s Vice President of Engineering kicked off by speaking about the huge importance that the social media company placed in the project:
“Big data really is about having insights and making an impact on your business. If you aren’t taking advantage of the data you’re collecting, then you just have a pile of data, you don’t have big data.”
Parikh explained that ‘taking advantage’ of data was something that had to be done in a matter of minutes, so that Facebook would be able to instantly understand user reactions and respond to them in something close to real time.
“With 950 million users, every problem is a big data problem, and one of the biggest challenges… is with MapReduce,” added Parikh.
For those who don’t know, MapReduce is one of the most widely-used implementations of Apache Hadoop, a model for processing large data sets using distributed computing and clusters of servers that was created by Facebook in conjunction with Yahoo. To begin with, MapReduce was the perfect system for Facebook to be able to handle the massive quantities of big data it handles, but as its mountain of data grew exponentially each year, it became clear that it was no permanent solution.
“As we got more data and servers over the years, we said, ‘Oh, crap, this isn’t going to fit in our data center. We’re running out of space,’” said Parikh.
“One of the big limitations in Hadoop today is, for the whole thing to work, the servers have to be next to each other. They can’t be geographically dispersed… The whole thing comes crashing to a halt.”
Project Prism has been designed to overcome this challenge. Essentially, the idea is that it will Facebook to take apart its monolithic storage warehouse and scatter it across different locations, whilst still maintaining a single view of all of its data.
“Prism basically institutes namespaces, allowing anyone to access the data regardless of where the data actually resides. … We can move the warehouses around, and we get a lot more flexibility and aren’t bound by the amount of power we can wire up to a single cluster in a data center,” Parikh explained.
Admittedly, Facebook are being kind of vague about how it all works; for now, their engineers are still trying to document the project, although they have promised to publish an engineering blog post about how it all works at a later date. We know one thing though – Facebook will likely make Project Prism open source soon enough.
““Given the other things we’ve done, we want to open source this stuff. These are the next scaling challenges other folks are going to face,” concluded Parikh.
And few can argue with that.