What is Big Data?

In this lecture, we’re going to dive into Big Data and explore how it led to the development of distributed storage and processing systems.

Big data refers to the massive volumes of data that organizations deal with daily. This data has grown in both size and complexity, making it difficult to manage, store, and analyze using traditional business intelligence tools. As the internet and digital technologies have evolved, so has the volume of data—now measured in zettabytes, which is an enormous scale of data storage. For context, a zettabyte equals 1 trillion gigabytes.

Challenges of Traditional Data Storage

In the early days, data was stored in data warehouses on-premises. When organizations faced the challenge of increasing data volumes, they would add more hardware—known as vertical scaling. However, this solution quickly ran into several limitations:

Cost: Adding more powerful hardware becomes prohibitively expensive.
Finite Limits: There’s a ceiling to how much a single server can handle in terms of storage and processing power.

The Shift to Distributed Systems

As data continued to grow beyond what traditional systems could handle, a new approach was needed. In 2002, Google published a paper that introduced a novel concept for storing large data sets in a distributed file system, called the Google File System (GFS). This system broke large datasets into smaller chunks and stored them across multiple computers or servers.

In 2003, Google released another paper on MapReduce, a framework that allowed the parallel processing of large data sets across distributed systems. This approach allowed data to be processed more efficiently by distributing the workload across multiple nodes in a cluster.

These innovations paved the way for the development of Hadoop, a framework built for distributed storage and distributed processing:

Hadoop Distributed File System (HDFS): Stores data in chunks across different computers in a cluster.
Hadoop MapReduce: Breaks processing into smaller tasks that can be performed simultaneously, leveraging parallel computing.

Horizontal vs. Vertical Scaling

With these distributed systems, we now had the option of horizontal scaling—adding more computers (or nodes) to the cluster as needed. This approach is much more scalable than vertical scaling, allowing companies to manage increasing data volumes without hitting the physical limits of traditional data centers.

Beyond Hadoop: Apache Spark

In recent years, frameworks like Apache Spark have further enhanced big data processing. Spark improves upon the MapReduce model by processing data in memory rather than on disk, making it up to 100 times faster than traditional disk-based processing. Spark’s ability to handle both batch and real-time data makes it a powerful tool for modern big data applications, which we’ll cover in detail in the next lecture.

Conclusion

The transition from traditional on-premises storage to distributed storage systems has revolutionized how we manage and process large datasets. Tools like Hadoop and Spark have played a pivotal role in this evolution, making it possible to work with big data efficiently. By leveraging distributed storage and processing, organizations can scale their infrastructure to match growing data demands.

Azure Databricks with PySpark

Curriculum