How big MNC’s like Google, Facebook, Instagram etc. stores, manages and manipulate Thousands of Terabytes of data with High Speed and High Efficiency.

Sourabhmiraje
15 min readSep 17, 2020

INTRODUCTION :

With the proliferation of online services and mobile technologies, the world has stepped into a multimedia big data era. In the past few years, the fast and widespread use of multimedia data, including image, audio, video, and text, as well as the ease of access and availability of multimedia sources, have resulted in a big data revolution in multimedia management systems.

Currently, multimedia sharing websites, such as Yahoo, Flickr, and YouTube, and social networks such as Facebook, Instagram , and Twitter, are considered as inimitable and valuable sources of multimedia big data. For example, to date, Instagram users have uploaded over 20 billion photos, YouTube users upload over 100h of videos every minute in a day, and 255 million active users of Twitter send approximately 500 million tweets every day. Another statistic shows that the Internet traffic through multimedia sharing has reached 6,130 petabytes every month in 2016 (Statista.Com 2015). It is predicted that the digital data rate will exceed 40ZB by 2020, which means every person in the world will produce almost 5,200 gigabytes of data.

Unlike traditional data with only texts and numbers, multimedia data are usually unstructured and noisy. Handling this huge amount of complex data is not feasible using conventional data analysis. Therefore, more comprehensive and sophisticated solutions are required to manage such large and unstructured multimedia data.

Multimedia analytics addresses the issue of manipulating, managing, mining, understanding, and visualizing different types of data in effective and efficient ways to solve real-world challenges. The solutions include but are not limited to text analysis, image/video processing, computer vision, audio/speech processing, and database management for a variety of applications such as healthcare, education, entertainment, and mobile devices.

The big data concept is essentially used to describe extremely large datasets. However, different scientists and technological enterprises have various definitions for this term. Bryant et al. (2008) used the “Big-Data Computing” term in 2008. Finally, in 2010, it was defined as “datasets which could not be captured, managed, and processed by general computers within an acceptable scope” by Apache Hadoop

A Day of Data

How much data is generated in a day — and what could this look like as we enter an even more data-driven future?

Here are some key daily statistics highlighted in the infographic:

  • 500 million tweets are sent
  • 294 billion emails are sent
  • 4 petabytes of data are created on Facebook
  • 4 terabytes of data are created from each connected car
  • 65 billion messages are sent on WhatsApp. WhatsApp had 450 million daily active users in Q2 2018.
  • Google gets over 3.5 billion searches daily.
    Google remains the highest shareholder of the search engine market, with 87.35% of the global search engine market share as of January 2020. Big Data stats for 2020 show that this translates into 1.2 trillion searches yearly, and more than 40,000 search queries per second.

By 2025, it’s estimated that 463 exabytes of data will be created each day globally — that’s the equivalent of 212,765,957 DVDs per day!

Facebook Statistics :

No matter what your online marketing strategy is, Facebook has to be a big part of it. With almost 1.5 billion active users, Facebook is by far the largest social media platform in the world. But you already know that, don’t you?! You’re probably more interested in some other, more exciting and more essential Facebook statistics, which is why you’re here.

Facebook Statistics (Editor’s Choice)

  • 74% of American Facebook users are on the site daily.
  • 83% of women and 75% of men use Facebook.
  • 45% of all US adults get news from Facebook.
  • 500 new users sign up for Facebook every single minute.
  • For 67% of marketers, Facebook is the most important social platform.
  • More than 270 million profiles on Facebook are fake.
  • Facebook got its one millionth registered user in December 2004.
  • Facebook ad spend increases between 1 p.m. and 3 p.m.
  • 93% of social media advertisers use Facebook Ads.

Facebook Usage Stats & Demographics

1. A total of 22% of the world uses Facebook.

This is almost a quarter of the entire world’s population, genuinely making Facebook the biggest social media platform ever. Upon seeing this figure, it’s no surprise that social media marketers make it a big part of their overall marketing strategy, and so should you. So if you’ve been wondering how many people are on Facebook, now you have the answer.

2. More than 270 million profiles on Facebook are fake.

Facebook has been “cleaning house” recently, but being such a large platform, it’s challenging to find and remove this massive number of fake accounts. At least they’ve admitted that there are this many (a number close to the population of Indonesia). It’s thus almost impossible to know for sure how many people use Facebook.

(Mashable)

3. 53% of Americans use Facebook a couple of times every day.

This number is the biggest in the US. All other social media platforms are used less, with Facebook-owned WhatsApp being the closest in number at 44%.

(Statista)

4. 74% of American Facebook users are on the site daily.

Considering the previous Facebook user stats, it’s quite impressive when you consider how consistently its users are on the platform.

(PewInternet)

5. Facebook is available in 142 different languages.

The list of these languages covers all of the world’s major languages — most of the languages spoken by several million people. Admittedly, some of these languages are also just different variations of the same one, which is why there are a few English languages on the list.

(Quora)

6. 64% of adults aged between 50 and 64 use Facebook.

People usually assume that most Facebook users belong to the younger generations and that they are the most active users. However, these Facebook user statistics prove that older people are also very present on this social network.

(PewInternet)

7. 88% of US adults aged between 18 and 29 use at least one social media site.

In other words, 9 out of every 10 people in the world in this age group interact on social media. Since Facebook is the king of social media, we all know where to find the majority of them.

(PewInternet)

8. 83% of women and 75% of men use Facebook.

Looking at some more detailed Facebook user demographics, it seems that women are slightly more interested in the platform. Men aren’t far behind, but still, the numbers point to some interesting facts that should be considered in any marketing strategy that targets Facebook.

(Social Sprout)

10. Facebook Lite has 200 million users worldwide.

Facebook Lite is Facebook’s smaller app, which runs similarly but uses less space or data, making it perfect for people who don’t want to spend too much of their monthly data package on Facebook. These 200 million users prove that this version of Facebook’s app is still quite useful and doesn’t compromise the quality of the overall experience.

(Engadget)

11. Facebook had 30,275 employees by the end of June 2018.

Keeping up with one-fifth of the world’s population is no easy feat to achieve, which is why Facebook employs such a large number of people. These employees are dispersed all across the globe as Facebook demographics point out that the platform is used in almost every country in the world. There have been some accusations that Facebook should employ more people for the various services they provide, but it remains to be seen if they will.

So now you got to know how big crowd is active on the Facebook daily and maintaining all the data of such large user is quite arduous, isn’t it ?

Small data challenges

Facebook’s News Feed requires 1000s of objects to render and it is personalized for everyone on the service. The requirements for these object fetches (multiplied by 1.3 billion people) lead to many challenges in designing and distributing the underlying caches and databases. Facebook’s social graph store TAO, for example, provides access to tens of petabytes of data, but answers most queries by checking a single page in a single machine.

TAO currently sustains billions of queries per second. Like traditional relational database updates, TAO updates must be durable and immediately visible. However, there are no joins or large scans in TAO. For most data types, we favor read and write availability over atomicity or consistency, choosing asynchronous replication between data centers and prohibiting multi-shard transactions. We have more than 500 reads per write, so we rely heavily on caching.

TAO and Memcache are the primary systems that handle caching our small data workload.

Nathan Bronson presented an overview of our current small data infrastructure and led a discussion of three challenges. First, we talked about how to modify Facebook’s data infrastructure to optimize for mobile devices, which have intermittent connectivity and higher network latencies than most web clients. Next, we discussed how to reduce the database footprint in data centers without increasing user-visible latency or reducing system availability. Finally, we explored the efficiency/available tradeoffs of RAM caches.

So lets know Big Data challenges,

Big data challenges :

Big data stores are the workhorses for data analysis at Facebook. They grow by millions of events (inserts) per second and process tens of petabytes and hundreds of thousands of queries per day. The three data stores used most heavily are:

1. ODS (Operational Data Store) stores 2 billion time series of counters. It is used most commonly in alerts and dashboards and for trouble-shooting system metrics with 1–5 minutes of time lag. There are about 40,000 queries per second.

2. Scuba is Facebook’s fast slice-and-dice data store. It stores thousands of tables in about 100 terabytes in memory. It ingests millions of new rows per second and deletes just as many. Throughput peaks around 100 queries per second, scanning 100 billion rows per second, with most response times under 1 second.

3. Hive is Facebook’s data warehouse, with 300 petabytes of data in 800,000 tables. Facebook generates 4 new petabyes of data and runs 600,000 queries and 1 million map-reduce jobs per day. Presto, HiveQL, Hadoop, and Giraph are the common query engines over Hive.

ODS, Scuba, and Hive share an important characteristic: none is a traditional relational database. They process data for analysis, not to serve users, so they do not need ACID guarantees for data storage or retrieval. Instead, challenges arise from high data insertion rates and massive data quantities.

Janet Wiener led the discussion of current big data challenges at Facebook, including anomaly detection, sampling accuracy, resource scheduling, and quantities of data too big for a single data center.

Now lets jump to overall BIG DATA problem :

Volume

You’re not really in the big data world unless the volume of data is exabytes, petabytes, or more. Big data technology giants like Amazon, Shopify, and other e-commerce platforms get real-time, structured, and unstructured data, lying between terabytes and zettabytes every second from millions of customers especially smartphone users from across the globe. They do near real-time data processing and after running machine learning algorithms to do data analysis on big data, they make decisions to provide the best customer experience.

When do we find Volume as a problem:

A quick web search reveals that a decent 10TB hard drive runs at least $300. To manage a petabyte of data that’s 100 x $300 USD = $30,000 USD. Maybe you’ll get a discount, but even at 50% off, you’re well over $10,000 USD in storage costs alone. Imagine if you just want to keep a redundant version of the data for disaster recovery. You’d need even more disk space. Hence the volume of data becomes a problem when it grows beyond the normal limits and becomes an inefficient and costly way to store on local storage devices.

Solution:

Amazon Redshift, which is a managed cloud data warehouse service by AWS is one of the popular options for storage. It stores data distributed across multiple nodes, which are resilient to disaster and faster for computations compared to on-premise relational databases like Postgres and MySql. It is also easy to replicate data from relational databases to Redshift without any downtime.

To know more about Redshift, take a look at Redshift vs relational databases, Redshift vs Hadoop, and Redshift vs traditional data warehouses.

Velocity

Imagine a machine learning service that is constantly learning from a stream of data, or a social media platform with billions of users posting and uploading photos 24x7x365. Every second, millions of transactions occur, and this means petabytes and zettabytes of data is being transferred from millions of devices to a data center every second. This rate of high volume data inflow per second defines the velocity of data.

When do we find Velocity as a problem:

High-velocity data sounds great because — velocity x time = volume and volume leads to insights, and insights lead to money. However, this path to growing revenue is not without its costs.

There are many questions that arise like, how do you process every packet of data that comes through your firewall, for maliciousness? How do you process such high-frequency structured and unstructured data on the fly? Moreover, when you have a high velocity of data, that almost always means that there are going to be large swings in the amount of data processed every second, tweets on Twitter are much more active during the Super Bowl than on an average Tuesday, how do you handle that?

Solution:

Fortunately, “streaming data” solutions have cropped up to the rescue. The Apache organization has popular solutions like Spark and Kafka, where Spark is great for both batch processing and streaming processing, Kafka runs on a publish/subscribe mechanism. Amazon Kinesis is also a solution, which has a set of related APIs designed to process streaming data. Google Cloud Functions (Google Firebase also has a version of this) is another popular serverless function API. All these are a great black-box solution for managing complex processing of payloads on the fly but they all require time and effort to build data pipelines.

Now, if you don’t want to deal with the time and expense of creating your own data pipeline, that’s where something like FlyData could come in handy. FlyData seamlessly and securely replicates your Postgres, MySQL, or RDS data into Redshift in near real-time.

Big Data Analytics :

Big Data Analytics largely involves collecting data from different sources, munge it in a way that it becomes available to be consumed by analysts and finally deliver data products useful to the organization business.

The process of converting large amounts of unstructured raw data, retrieved from different sources to a data product useful for organizations forms the core of Big Data Analytics.

Importance of Big Data Analytics

The Big Data analytics is indeed a revolution in the field of Information Technology. The use of Data analytics by the companies is enhancing every year. The primary focus of the companies is on customers. Hence the field is flourishing in Business to Consumer (B2C) applications. We divide the analytics into different types as per the nature of the environment. We have three divisions of Big Data analytics: Prescriptive Analytics, Predictive Analytics, and Descriptive Analytics. This field offers immense potential, and in this blog, we will discuss four perspectives to explain why big data analytics is so important today?

  • Data Science Perspective
  • Business Perspective
  • Real-time Usability Perspective
  • Job Market Perspective

Real-time Benefits of Big Data Analytics

There has been an enormous growth in the field of Big Data analytics with the benefits of the technology. This has led to the use of big data in multiple industries ranging from

  • Banking
  • Healthcare
  • Energy
  • Technology
  • Consumer
  • Manufacturing

How big data analytics works

In some cases, Hadoop clusters and NoSQL systems are used primarily as landing pads and staging areas for data. This is before it gets loaded into a data warehouse or analytical database for analysis — usually in a summarized form that is more conducive to relational structures.

More frequently, however, big data analytics users are adopting the concept of a Hadoop data lake that serves as the primary repository for incoming streams of raw data. In such architectures, data can be analyzed directly in a Hadoop cluster or run through a processing engine like Spark. As in data warehousing, sound data management is a crucial first step in the big data analytics process. Data being stored in the HDFS must be organized, configured and partitioned properly to get good performance out of both extract, transform and load (ETL) integration jobs and analytical queries.

Once the data is ready, it can be analyzed with the software commonly used for advanced analytics processes. That includes tools for:

  • data mining, which sift through data sets in search of patterns and relationships;
  • predictive analytics, which build models to forecast customer behavior and other future developments;
  • machine learning, which taps algorithms to analyze large data sets; and
  • deep learning, a more advanced offshoot of machine learning.

Text mining and statistical analysis software can also play a role in the big data analytics process, as can mainstream business intelligence software and data visualization tools. For both ETL and analytics applications, queries can be written in MapReduce, with programming languages such as R, Python, Scala, and SQL. These are the standard languages for relational databases that are supported via SQL-on-Hadoop technologies.

Solution : “Distributed Storage Solutions”

Storing data has evolved during the years in order to accommodate the rising needs of companies and individuals. We are now reaching a tipping point at which the traditional approach to storage — the use of a stand-alone, specialized storage box — no longer works, for both technical and economic reasons. We need not just faster drives and networks, we need a new approach, a new concept of doing data storage. At present, the best approach to satisfying current demands for storing data seems to be distributed storage.

This concept has appeared in different forms and shapes through the years. And while there is no commonly-accepted definition of what distributed storage system is, we can summarize it as:

“Storing data on a multitude of standard servers, which behave as one storage system although data is distributed between these servers.”

let understand it diagrammatically,

We have one Main Master Node at center and 8 Slave node who are sharing their resources to the Master Node.

for sake of now consider Master node and Slave node is the name of our computer.

let consider Master Node had No storage alone but he wanted to store some images then how he will gonna store that images without storage. also he has not much money to buy new storage but he has 8 very good friends who can share some storage for him at very little cost . so he found one idea and asked his friend nodes to share their storage with him and they agreed to share 10 Gb each.

initially Master Node had no storage but now he has 8 * 10Gb i.e 80 Gb storage with just little efforts.

this is how distributed storage works. huge amount of data is distributed among the slave node and it is retrieved in same way and user feels he has whole data stored in single storage.

By doing this we solved our issue of Volume and Velocity as well,

let me explain

Volume : as our Master Node got some storage from each Slave node he had to pay very little. also he can add as much as storage he want with same strategy as he did previously. thus we solved cost and volume problem.

Now Velocity ,

If Master take 8 minute to retrieve 80 Gb data alone it would be time consuming as he has to wait 8 Minutes. but now he can ask all 8 slaves to retrieve the data and everyone has to work only one minute because each has 10 Gb to retrieve and only 1 minute is requires to retrieve the 10 Gb.

so now Master can retrieve all data within 1 minutes, thus we saved 7 minutes. issue of velocity also solved.

Technologies :

In real world we have to use some technologies to do such distributed system. here are some …

  1. Hadoop.
  2. Cassandra.
  3. MongoDB.
  4. Apache Hive.

and many more…

I am writing such many article on the new technologies. So follow me on Medium . here is my LinkedIn profile…

--

--