扫码阅读
手机扫码阅读

大数据&常用的技术结构

129 2024-09-07

我们非常重视原创文章,为尊重知识产权并避免潜在的版权问题,我们在此提供文章的摘要供您初步了解。如果您想要查阅更为详尽的内容,访问作者的公众号页面获取完整文章。

查看原文:大数据&常用的技术结构
文章来源:
小南瓜开发平台
扫码关注公众号

Hadoop Project Structure Summary

Hadoop Project Components:

  • HDFS: A distributed file system with high fault tolerance designed for low-cost hardware, providing high throughput data access for applications with large datasets.
  • MapReduce: A core framework for processing vast amounts of data, allowing users to develop distributed programs without knowledge of underlying details, utilizing cluster power for rapid computation and storage.
  • HBase: A NoSQL database built on top of HDFS, offering reliable, high-performance random read/write access and supporting large-scale data storage.
  • Hive: A data warehouse tool that maps structured data files to a database table, offering a simple SQL-like query function.
  • ZooKeeper: A distributed application coordination service used by Hadoop to maintain cluster status and coordinate work.

YARN Framework: Provides unified resource management and scheduling for applications, addressing resource management, utilization, and data sharing within the cluster. YARN separates resource management and job scheduling/monitoring, introducing the global ResourceManager (RM) and ApplicationMaster (AM) for applications.

Tez: An Apache open-source framework supporting DAG jobs, derived from MapReduce, allowing flexible combination of operations to form large DAG jobs, significantly improving performance.

MapReduce: A programming model for parallel computing on large datasets, popular for its ease of programming, scalability, high fault tolerance, and suitability for offline processing of massive data.

Hive: A data warehouse built on Hadoop, offering HiveQL for easy SQL-like queries, extensibility, high fault tolerance, and support for PB-level data processing. However, it lacks real-time query capabilities and may be slower than traditional relational databases.

Pig: A programming language for large-scale data analysis that runs on Hadoop clusters, offering SQL-like syntax for expressing data processing logic. It requires programming knowledge and skills for use.

Oozie: A workflow scheduling system for managing and scheduling Hadoop ecosystem jobs, allowing users to schedule a series of tasks like Map/Reduce and Hive as DAG workflows triggered by time or data availability.

ZooKeeper: An open-source distributed application coordination service widely used in large distributed systems like Hadoop, HBase, Kafka, and Dubbo, offering services such as configuration maintenance, domain name service, and distributed synchronization.

Sqoop: An open-source data migration tool for transferring data between Hadoop (Hive) and traditional relational databases, providing an easy-to-use interface for importing and exporting data.

Ambari: An open-source tool for managing and monitoring Hadoop clusters, offering a web-based UI and RESTful APIs for easy installation, configuration, management, and monitoring of various cluster components and services.

Flume: A distributed, reliable log collection system for aggregating and transporting large volumes of data to Hadoop or other storage systems, supporting real-time processing and analysis for business decision-making.

HBase: A distributed database based on Hadoop, suitable for storing unstructured and semi-structured data, characterized by high performance, reliability, and scalability, providing real-time data processing and a key-value database interface.

Distributed Locks: Cross-process, cross-node mutual exclusion locks used to ensure exclusive access to shared resources in a distributed environment, with features like exclusivity, reentrancy, and mechanisms to avoid deadlocks.

想要了解更多内容?

查看原文:大数据&常用的技术结构
文章来源:
小南瓜开发平台
扫码关注公众号

南瓜树基础能力低代码平台,助力中小企业进行数字化转型

122 篇文章
浏览 18.2K
加入社区微信群
与行业大咖零距离交流学习
软件研发质量管理体系建设 白皮书上线