BigData Ecosystem

Apache Hadoop: an open-source software framework for storing and processing large volumes of data in a distributed computing environment.

Major components- HDFS (Hadoop Distributed File System) – for storing data, and MapReduce – for processing data.

Apache Pig: Works on top of Hadoop to write data analysis programs in an easier way.

Apache Hive: SQL-like query language for Hadoop.

Apache Storm: It is a distributed, real-time processing system for big data. It has the ability to process data in real time, which makes it well-suited for use cases such as real-time analytics for stocks, fraud detection, and event-driven applications.

Apache Spark: It provides an in-memory data processing engine, which makes it faster and more flexible than Hadoop’s MapReduce for many use cases.

4 Vs of Big Data:

  • Volume: which normal RDBMS databases cannot store or are not meant to process.
  • Velocity: Speed at which data is getting added.
  • Variety: Structures, Semi-structured and unstructured. Traditional systems are meant only for Structured data. For example, reviews, comments, images, etc are nonstructured data
  • Veracity: Non-verified data. Data that may or may not be useful. Inconsistent data which cannot be used straight away.