Название: Apache Hudi: The Definitive Guide: Building Robust, Open, and High-Performing Data Lakehouses (Early Release) Автор: Shiyan Xu, Prashant Wason, Bhavani Sudha Saktheeswaran, Rebecca Bilbro Издательство: O’Reilly Media, Inc. Год: 2024-09-24 Язык: английский Формат: pdf, epub, mobi Размер: 10.1 MB
Overcome challenges in building transactional guarantees on rapidly changing data by using Apache Hudi. With this practical guide, data engineers, data architects, and software architects will discover how to seamlessly build an interoperable lakehouse from disparate data sources and deliver faster insights using their query engine of choice.
Authors Shiyan Xu, Prashant Wason, Sudha Saktheeswaran, and Rebecca Bilbro provide practical examples and insights to help you unlock the full potential of data lakehouses for different levels of analytics, from batch to interactive to streaming. You'll also learn how to evaluate storage choices and leverage built-in automated table optimizations to build, maintain, and operate production data applications.
When writing to a traditional relational database (Oracle, PostgreSQL, MySQL, SQLite, etc.) we prepare ourselves for more up-front data engineering work. In return, we expect straightforward queries and transactional guarantees. We know document databases (Mongo, Solr, CouchDB, etc) and key-value stores (e.g. Cassandra, HBase, Redis, RocksDB) will make writes a breeze, and scale horizontally but will eventually lead to headaches when it comes to tighter transactional guarantees and need for higher data consistency between multiple tables.
So we make architectural decisions in support of one business interest at the expense of others. For instance, when the underlying objects are subject to transactions, data scientists may struggle to seek back to the specific state of the database for a given model. Yet not many of us would be willing to build, say, an e-commerce tool or a financial application without a database that provided transactional guarantees, even if we know AI/ML is on the roadmap.
The data platform layer can then become a limiting factor for innovation, straining to provide data fresh enough for analytics, and slowing down use cases of Machine Learning and AI. Herein lies a key advantage of using Hudi to empower analytics for this next generation of data-intensive applications. Hudi is designed to provide native support for near real-time analytics as well as time travel, and this is most evident in the different ways in which data can be read from Hudi. In the previous chapter, we saw how to create and write to Hudi tables, focusing on key concepts related to data modeling and configuration. In this chapter, you will see how different table layouts (results of config options described in the previous chapter), offer different query capabilities that support a variety of analytic and AI/ML use cases, and see examples of each. In the following section, you’ll learn how query engines integrate into a Hudi table so that you can feel confident about architecting a performant lakehouse tuned to the unique consistency requirements of your downstream applications.
This book helps you:
Understand the need for transactional data lakehouses and the challenges associated with building them Get up to speed with Apache Hudi and learn how it makes building data lakehouses easy Explore data ecosystem support provided by Apache Hudi for popular data sources and query engines Perform different write and read operations on Apache Hudi tables and effectively use them for various use cases, including batch and stream applications Implement data engineering techniques to operate and manage Apache Hudi tables Apply different storage techniques and considerations, such as indexing and clustering to maximize your lakehouse performance Build end-to-end incremental data pipelines using Apache Hudi for faster ingestion and fresher analytics
Скачать Apache Hudi: The Definitive Guide (Early Release)