HADOOP IS THE SOLUTION
Have you ever think how much data companies receive per data. i am mainly referring to companies like Facebook, google, twitter, banks, etc who receives tonnes of data per day. let me show you some statistics
1. Google gets over 3.5 billion searches daily
2. Whats App users exchange up to 65 billion messages daily.
3. Internet users generate about 2.5 Quintilian bytes of data each day
4. In 2019, there are 2.3 billion active Facebook users, and they generate a lot of data.
5. Twitter users send over half a million tweets every minute.
(search google for more amazing statistics)
This is known as big data problem
Now the main thing comes
Have you ever tried to open a file of 10gb only? how much time it takes to open? if you hadn’t try then do it today. don’t have a file so much big?then make your own. its easy to make in Linux!
My PC took about a minute! if i have a petabyte size file how much time it will take to open? how much time should FB take to show you a post which is in a file of this much size ? Theoretically it should take years. similarly when you search something on google this should happen.
But real fact is that if it take more than 5 seconds you will never be using google, Facebook, etc !
so, can you imagine how Facebook or google opens petabytes of file in seconds?
Here comes the concept of distributed storage cluster or distributed file system and to implement it we use a tool known as hadoop.
In very simple terms →Hadoop breaks a file into pieces and sends it to different servers(or laptops or hard disk) and since every system has its own speed of storing or reading we get speed 10 times if we are using 10 servers and so on.
Example- if we have a file of 10gb and our server has speed of 1gb/minute then it will take 10 minutes to store or read. but if we send the same file parallel to 10 servers then it would take only 1 minute! And this can be achieved by hadoop which connects lots of servers through network(known as cluster). This type of architecture or topology is famously known as master-slave architecture.
For transferring data between nodes HDFS (hadoop distributed file system) protocol is used.
Hadoop File System was developed using distributed file system design. It is run on commodity hardware. Unlike other distributed systems, HDFS is highly faulttolerant and designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines. These files are stored in redundant fashion to rescue the system from possible data losses in case of failure. HDFS also makes applications available to parallel processing.
Features of HDFS
- It is suitable for the distributed storage and processing.
- Hadoop provides a command interface to interact with HDFS.
- The built-in servers of namenode and datanode help users to easily check the status of cluster.
- Streaming access to file system data.
- HDFS provides file permissions and authentication.
Given below is the architecture of a Hadoop File System.
so, i think now you got idea how you can store a 10 gb file in two pendrives of 8gb.
similarly, like we have distributed storage , we also have distributed computing where you can share your ram/cpu also! And can make your 8gb pc to 80gb of ram or anything you want depending how much people/PCs/laptops/servers you have for sharing resources. This means that you can build your own mini supercomputer your own at your college/school.
Thanks to vimal daga sir who helped me to learn and achieve all these only on the 2nd or 3rd day of ARTH training and making me learnt technologies easier and interesting way.