Название: Small Summaries for Big Data Автор: Graham Cormode, Ke Yi Издательство: Cambridge University Press Год: 2020 Страниц: 279 Язык: английский Формат: pdf (true) Размер: 10.1 MB
The massive volume of data generated in modern applications can overwhelm our ability to conveniently transmit, store, and index it. For many scenarios, building a compact summary of a dataset that is vastly smaller enables flexibility and efficiency in a range of queries over the data, in exchange for some approximation. This comprehensive introduction to data summarization, aimed at practitioners and students, showcases the algorithms, their behavior, and the mathematical underpinnings of their operation. The coverage starts with simple sums and approximate counts, building to more advanced probabilistic structures such as the Bloom Filter, distinct value summaries, sketches, and quantile summaries. Summaries are described for specific types of data, such as geometric data, graphs, and vectors and matrices. The authors offer detailed descriptions of and pseudocode for key algorithms that have been incorporated in systems from companies such as Google, Apple, Microsoft, Netflix and Twitter.
Data, to paraphrase Douglas Adams, is big. Really big. Moreover, it is getting bigger, due to increased abilities to measure and capture more information. Sources of big data are becoming increasingly common, while the resources to deal with big data (chiefly, processor power, fast memory, and slower disk) are growing at a slower pace. The consequence of this trend is that we need more effort in order to capture and process data in applications. Careful planning and scalable architectures are needed to fulfill the requirements of analysis and information extraction on big data. While the “big” in big data can be interpreted more broadly, to refer to the big potential of such data, or the wide variety of data, the focus of this volume is primarily on the scale of data.
"A very thorough compendium of sketching and streaming algorithms, and an excellent resource for anyone interested in learning about them, understanding how they work and deploying them in applications. Good job!" - Piotr Indyk, Massachusetts Institute of Technology