Importance of good branding
24 May 2018A few words about Link Building
31 May 2018You can not talk about Big Data for too long without getting involved in topics related to the Hadoop elephant. Hadoop is an open source software platform managed by the Apache Software Foundation. It is very helpful in storing and managing huge amounts of data in a cheap and efficient way. In the further part of the article, we will introduce the subject of Hadoop, and exactly explain what is it, what does it consist of and what to use it for.
Say hello to Hadoop
The Apache Hadoop software library is a framework that allows distributed processing of large data sets in computer clusters using simple programming models. It was designed for scaling from individual servers to thousands of computers, each of which offers local calculations and mass storage. Rather than relying on high-availability hardware, the library itself is designed to detect and deal with failures at the application layer, providing a highly available service on a cluster of computers, each of which may be prone to failures.
Hadoop and Big Data
Formally known as Apache Hadoop, this technology was developed as part of an open source project under the Apache Software Foundation (ASF). Commercial Hadoop distributions are currently offered by four major suppliers of large data platforms: Amazon Web Services (AWS), Cloudera, Hortonworks and MapR Technologies. In addition, Google, Microsoft and other providers offer cloud-based services based on Hadoop and related technologies.
Hadoop was created by computer scientists Doug Cutting and Mike Cafarella, initially to support the processing of the open source search engine Nutch and the network robot. After Google published technical articles describing the Google file system (GFS) and MapReduceprogramming, respectively, in 2003 and 2004, two IT specialists modified earlier technology plans and developed a Java-based MapReduce implementation and a file system based on Google models. At the beginning of 2006, these elements were separated from Nutch and became a separate sub-project of Apache, which Cutting called Hadoop after his son’s stuffed elephant. At the same time, Cuttinga was hired by Yahoo, a web-based Internet services company that became the first production user of Hadoop in 2006. (Cafarella, then a graduate, became a university professor).
Over the next few years, the use of the platform increased along with three independent Hadoop suppliers: Cloudera in 2008, MapR a year later and Hortonworks as Yahoo spinoff in 2011. In addition, AWS launched the Hadoop cloud service called Elastic MapReduce in 2009. It all happened before Apache released Hadoop 1.0.0, which became available in December 2011 after a series of releases of 0.x.
Hadoop components (HDFS, MapReduce, YARN)
Apache Hadoop consists of the following modules:
- Hadoop Common – libraries and tools used by other modules;
- Hadoop Distributed File System (HDFS) – distributed file system;
Hadoop based on Java technology, implementing a clustered file system called HDFS, which allows cost-effective, reliable and scalable distributed computing. The HDFS architecture is highly fault-tolerant and has been designed to be implemented on low-cost hardware. Unlike relational databases, the Hadoop cluster allows you to store any file data, and then specify how you want to use it without having to reform it first. Multiple copies of data are automatically replicated in the cluster. The amount of replication can be configured for each file and can be changed at any time.
- Hadoop YARN – the platform for managing cluster resources;
The Hadoop YARN structure enables task scheduling and cluster resource management, which means that users can upload and delete applications using the Hadoop REST API. There are also internet interfaces to monitor the Hadoop cluster. The combination of all JAR files and classes needed to run MapReduce is called the task. You can send jobs to JobTracker from the command line or via HTTP by sending them to the REST API. These tasks include “tasks” that perform an individual map and reduce the number of steps. There are also ways to enable non-Java code when writing these tasks. If for some reason the node of the Hadoop cluster is disabled, the corresponding processing tasks will be automatically transferred to other cluster nodes.
- Hadoop MapReduce – implementation of the MapReduce paradigm to process large amounts of data;
Hadoop focuses on storing and distributed processing of large data sets in computer clusters using the MapReduce programming model: Hadoop MapReduce. In the case of MapReduce, the set of input files is divided into smaller parts, which are processed independently of each other. The results of these independent processes are then collected and processed as groups until the task is completed. If a single file is so large that it will affect the search time, it can be divided into several smaller ones.
Of course, Hadoop has many additional tools, including Ambari, Avro, Cassandra, Chukwa, HBase, Hive, Pig, Spark, ZooKepper.
How can it help us?
Hadoop is great for MapReduce analysis performed on huge amounts of data. Its specific applications include: data searching, data analysis, data reporting, large-scale file indexing (eg Log files or data from network robots) and other data processing tasks using colloquially known in the programming world as “Big Data”. ”
Please note that Hadoop infrastructure and Java-based MapReduce task programming require specialized technical knowledge in the field of proper configuration and maintenance. If these skills are too expensive to rent or operate, you can consider other data options for Big Data.