大数据&常用的技术结构

332 2024-09-07

数据 Hadoop 分布式 Hive HBase

我们非常重视原创文章，为尊重知识产权并避免潜在的版权问题，我们在此提供文章的摘要供您初步了解。如果您想要查阅更为详尽的内容，访问作者的公众号页面获取完整文章。

查看原文：大数据&常用的技术结构

文章来源：

小南瓜开发平台

扫码关注公众号

Hadoop Project Structure Summary

Hadoop Project Components:

HDFS: A distributed file system with high fault tolerance designed for low-cost hardware, providing high throughput data access for applications with large datasets.
MapReduce: A core framework for processing vast amounts of data, allowing users to develop distributed programs without knowledge of underlying details, utilizing cluster power for rapid computation and storage.
HBase: A NoSQL database built on top of HDFS, offering reliable, high-performance random read/write access and supporting large-scale data storage.
Hive: A data warehouse tool that maps structured data files to a database table, offering a simple SQL-like query function.
ZooKeeper: A distributed application coordination service used by Hadoop to maintain cluster status and coordinate work.

YARN Framework: Provides unified resource management and scheduling for applications, addressing resource management, utilization, and data sharing within the cluster. YARN separates resource management and job scheduling/monitoring, introducing the global ResourceManager (RM) and ApplicationMaster (AM) for applications.

Tez: An Apache open-source framework supporting DAG jobs, derived from MapReduce, allowing flexible combination of operations to form large DAG jobs, significantly improving performance.

MapReduce: A programming model for parallel computing on large datasets, popular for its ease of programming, scalability, high fault tolerance, and suitability for offline processing of massive data.

Hive: A data warehouse built on Hadoop, offering HiveQL for easy SQL-like queries, extensibility, high fault tolerance, and support for PB-level data processing. However, it lacks real-time query capabilities and may be slower than traditional relational databases.

Pig: A programming language for large-scale data analysis that runs on Hadoop clusters, offering SQL-like syntax for expressing data processing logic. It requires programming knowledge and skills for use.

Oozie: A workflow scheduling system for managing and scheduling Hadoop ecosystem jobs, allowing users to schedule a series of tasks like Map/Reduce and Hive as DAG workflows triggered by time or data availability.

ZooKeeper: An open-source distributed application coordination service widely used in large distributed systems like Hadoop, HBase, Kafka, and Dubbo, offering services such as configuration maintenance, domain name service, and distributed synchronization.

Sqoop: An open-source data migration tool for transferring data between Hadoop (Hive) and traditional relational databases, providing an easy-to-use interface for importing and exporting data.

Ambari: An open-source tool for managing and monitoring Hadoop clusters, offering a web-based UI and RESTful APIs for easy installation, configuration, management, and monitoring of various cluster components and services.

Flume: A distributed, reliable log collection system for aggregating and transporting large volumes of data to Hadoop or other storage systems, supporting real-time processing and analysis for business decision-making.

HBase: A distributed database based on Hadoop, suitable for storing unstructured and semi-structured data, characterized by high performance, reliability, and scalability, providing real-time data processing and a key-value database interface.

Distributed Locks: Cross-process, cross-node mutual exclusion locks used to ensure exclusive access to shared resources in a distributed environment, with features like exclusivity, reentrancy, and mechanisms to avoid deadlocks.

想要了解更多内容？

查看原文：大数据&常用的技术结构

文章来源：

小南瓜开发平台

扫码关注公众号