Hadoop Intro

2019年8月4日 22:53

What is Hadoop?

Apache Hadoop is a group of open-source software utilities used for distributed storage and processing of big data based on the MapReduce model. Hadoop is seen as a solution to manage, maintain and distribute large chunks of datasets over a cluster computer whenever datasets larger than 100TB are involved. The datasets are stored, replicated and distributed in Hadoop file distributed system (HDFS) by YARN (Yet Another Source Negotiator). Each dataset is sliced into 128MB chunks and replicated three times by default and distributed over the HDFS by YARN. We will delve into more of the Hadoop ecosystem in the coming sessions.

It's most often the case that additional software packages like Apache Spark, Apache Hive and etc. are installed alongside Hadoop to provide more functionalities such as data visualization, extract-transform-load (ETL) processing of data and code abstraction (code readability). For instance, Hive provides a higher code abstraction with an interface of SQL-like queries - known commonly as HQL, so that programmers with no prior knowledge of MapReduce, which is written in Java, can write lightning-fast SQL-like queries to extract and change data from the HDFS. The Hadoop ecosystem has grown large over the years to accommodate new software packages.

Hadoop Intro

いいなと思ったら応援しよう！