Sandy Ryza's answer to When would someone use Apache Tez instead of Apache Spark, or vice versa? Do their use cases overlap to a large extent?

When would someone use Apache Tez instead of Apache Spark, or vice versa? Do their use cases overlap to a large extent?

Apache Hadoop PMC, software engineer at Cloudera · Updated 11y ·

Originally Answered: When would someone use Apache Tez instead of Apache Spark, or vice versa? ·

In a nutshell, Spark is a more mature version of Tez, plus much much more. If Tez comes with your version of Hive or Pig, use it as the backend execution engine over MapReduce. If you're planning to directly use the APIs, whether to write a data-transformation job, implement a distributed machine learning algorithm, or write your own higher-level data processing language, use Spark, hands down.

Disclaimer: I work at Cloudera, which just started offering Spark support, so I would say a lot of the things I'm about to say. I also hope this doesn't come off as too disparaging to the Tez project. Though I see it as somewhat of a misdirected effort, I've worked on YARN with many of the engineers working on Tez, and think highly of them.

Tez is a ~1.5 year-old implementation of a 2007 paper from Microsoft that generalized the MapRedude distributed compute framework. Spark is a ~4 year-old implementation of a 2010 paper from Berkeley that built on the Microsoft paper. Spark adds "Resilient Distributed Datasets" (RDDs), an abstraction that makes it easy to work with distributed in-memory data.

As mentioned in the question, both Tez and Spark provide a distributed execution engine that can handle arbitrary DAGs, targeted towards processing large amounts of data. Both can read and write data to and from Hadoop using any MapReduce input or output format. The main focus of Tez so far has been providing a faster engine than MapReduce under Hadoop's traditional data-processing languages like Hive and Pig. Spark has these capabilities, but also spent a lot of effort on a clean user-facing API with a rich set of operators. It can express wordcount in 3 lines of Scala or 15 lines of Java. It also provides an interactive shell (REPL) and a Python API, which are especially great for data sciency audiences and facilitate development in general. Tez exposes an API for constructing a data flow DAG - you define vertexes and edges and the way that data gets transferred between them. Their wordcount example runs over 300 lines. I believe it also supports the Hadoop MapReduce API, but if you're using that you're not taking advantage of the arbitrary-DAG capabilities.

From a community/adoption/staying-power standpoint, Spark boasts over a hundred contributors from a diverse set of companies like DataBricks, Intel, Yahoo, and Cloudera. The mailing lists are constantly overflowing my inbox. Nearly all Tez contributions come from a single company (Hortonworks).

Spark and Tez are both distributed data-processing engines that target similar use cases. My opinion is that Spark's maturity, cleaner, richer APIs, thriving community, and first-class support for RDDs and in-memory data make it a superior choice in nearly every situation.

64.5K views ·

View upvotes

View 3 shares

1 of 11 answers

Something went wrong. Wait a moment and try again.

View 10 other answers to this question

About · Careers · Privacy · Terms · Contact · Languages · Your Ad Choices · Press ·