Apache Hadoop: an open-source software framework for storing and processing large volumes of data in a distributed computing environment.
Major components- HDFS (Hadoop Distributed File System) – for storing data, and MapReduce – for processing data.
Apache Pig: Works on top of Hadoop to write data analysis programs in an easier way.
Apache Hive: SQL-like query language for Hadoop.
Apache Storm: It is a distributed, real-time processing system for big data. It has the ability to process data in real time, which makes it well-suited for use cases such as real-time analytics for stocks, fraud detection, and event-driven applications.
Apache Spark: It provides an in-memory data processing engine, which makes it faster and more flexible than Hadoop’s MapReduce for many use cases.
4 Vs of Big Data:
- Volume: which normal RDBMS databases cannot store or are not meant to process.
- Velocity: Speed at which data is getting added.
- Variety: Structures, Semi-structured and unstructured. Traditional systems are meant only for Structured data. For example, reviews, comments, images, etc are nonstructured data
- Veracity: Non-verified data. Data that may or may not be useful. Inconsistent data which cannot be used straight away.