Profile photo for Reynold Xin

There have been lots of updates since most of the answers were written for this question. Let me provide an update on multiple dimensions mentioned in other answers.

Project Goals

Spark is positioned as a fast and general engine for Big Data. It generalizes the MapReduce model and is poised to replace MapReduce and other runtimes. This means it should:
- work well with data size ranging from GBs to PBs
- work well with a varying algorithmic complexity, from ETL to SQL to machine learning
- work well from low-latency streaming jobs to long batch jobs
- work well with data regardless of storage medium, be it disks, SSDs, or memory

Community and Vendor Support

As of mid 2014, Spark is the most active Big Data project. According to OpenHub, there are 355 current active contributors and 431 total contributors. In any given month, there are ~ 100 engineers contributing patches. As a result, the project is progressing rapidly.

On the commercial side, all leading Hadoop vendors now ship Spark.

On-disk, In-memory, and Network Performance

One example of the fast pace development is on-disk performance of Spark jobs. Spark can outperform Hadoop MR by orders of magnitude when data is in-memory. However, it should also outperform Hadoop MR when data is on-disk.

We recently used Spark to run a 100TB sort and 1PB sort, and reported that our experiment outperformed the best publicly known Hadoop MR result by 3X even though we used only 1/10th of the nodes: Spark the fastest open source engine for sorting a petabyte

Also, tons of examples have demonstrated that for communication intensive workloads Spark outperforms Hadoop MR. Just try a communication intensive algorithm such as ALS (alternating least squares).

Ecosystem

In addition to runtime performance, Spark also enables higher developer productivity.

Spark's RDD programing abstraction is vastly superior to MapReduce's. As a result, many new projects are being developed on Spark at speed that was not possible with MapReduce before.

On the machine learning side, MLlib in Spark is maturing and becoming the de facto standard of distributed machine learning. Mahout is moving away from MapReduce and switching to Spark as the backend execution engine.

Many other projects in the Hadoop ecosystem are being ported to run directly on top of Spark. Hive, Pig, Cascading, etc.

I'm sure you can still find niche use cases where Hadoop MapReduce might be better, but the pace of development is progressing so fast that Spark is dominating in almost all dimensions.

View 9 other answers to this question
About · Careers · Privacy · Terms · Contact · Languages · Your Ad Choices · Press ·
© Quora, Inc. 2025