Is Yahoo going to open source its S4 "Real-Time MapReduce" project?

Question

Gil Yehuda · Accepted Answer

Yes. We published this as open source, we contributed it to Apache, and, updating this answer yet again I'll add that we have pretty much moved a lot of our focus onto Storm and other technologies too. Note: We publish updates on our work with Storm and other technologies on our Tumblr here. http://yahoohadoop.tumblr.com/post/98751512631/the-evolution-of-storm-at-yahoo-and-apache

Previously Updated Answer: This project <strike>has been approved for being Open Sourced and will be</strike> is available on Github at http://github.com/s4 . You should see the S4 codebase available to you <strike>s

Previously Updated Answer:  This project %3Cstrike%3Ehas been approved for being Open Sourced and will be%3C/strike%3E is available on Github at http://github.com/s4 . You should see the S4 codebase available to you %3Cstrike%3Esoon%3C/strike%3E now. (the code and website content is being staged by the team at http://s4.io/).

We are glad to learn that people are interested in this and we hope that you consider contributing bug fixes and the like to improve the code over time.

I hope this answer helps.

Another update:
S4 has been contributed to the Apache Software Foundation Incubator:
See http://incubator.apache.org/projects/s4.html and http://wiki.apache.org/incubator/S4Proposal

So going forward, look to the Apache process for updates to S4.

And I'll note that we don't call it "Real-Time MapReduce" -- we call it "Stream-processing"  and we think this is clearer and more accurate.

Pavan Yara · Answer

Update: http://s4.io is live. The GitHub page is at https://github.com/s4/core

Looks like S4 is more of a stream processing system rather than mapreduce.

How does this compares with Google's Percolator?

Adel Galal · Answer

Hi

* In fact, its 32,000-node cluster is still the largest in the world. Now the Web giant is souping up its massive investment in Hadoop to give it a deep learning environment that’s the envy of the valley. With more than 600 petabytes of data spread across 40,000 Hadoop nodes, Yahoo is a big believer in all things Hadoop.

Gil Yehuda · Answer

Today  (April 2, 2012)

Listen in to JSConf 2012 (or follow the tweetstream) http://2012.jsconf.us/#/schedule where you'll get an update about Mojito and the Open Source plans. We put the code on http://github.com/yahoo/mojito

Read it, fork it, play with it, test it out, enhance it, suggest enhancements, add to it, make great stuff with it, and let's take Node.JS to the next level, as a community.

Chris Gianelloni · Answer

There were about 36,000 nodes when I left in April 2010 and we were building out a new data center for Hadoop, in addition to the existing two data centers from 2009.

Nitendra Gautam · Answer

If you have been carefully following yahoo since 2009, you might be knowing that they have sold large part of business along with the brand yahoo to Oath(subsidiary of Verizon). So, Yahoo might not have the large cluster as they use to have.

There are other organizations like Facebook,Instagram and SnapChat which have a large user base and might have bigger cluster right now.

Kah Keng Tay · Answer

If you just want to write and run MapReduce jobs locally without a Hadoop cluster, you could consider the HackReduce starter kit. At one of the HackReduce events, I was able to use the configuration the organizers provided to quickly get started coding a MapReduce program in Java/Eclipse, testing locally and eventually deploying the program to run on the EC2 clusters that were provided.

Their source code is available on GitHub at https://github.com/hopper/HackReduce.

To run a job locally, you could follow one of the commands beginning with "java -classpath" in the middle of their README file. Un

Their source code is available on GitHub at https://github.com/hopper/HackReduce.

To run a job locally, you could follow one of the commands beginning with "java -classpath" in the middle of their README file. Unfortunately, I don't think there's a way to run your jobs on an actual Hadoop cluster as they don't seem to provide an AMI for their EC2 setup, and these clusters are probably only made available at HackReduce events.

If you want to explore setting up your own Hadoop cluster, their resources page (http://www.hackreduce.org/hadoop-and-mapreduce/) has several links to MapReduce and Hadoop tutorials which you might find useful.

Mayank Singhal · Answer

I can see many reasons:
1. Simplicity - The simplicity makes it incredibly easy to understand.
2. Applicability - The paradigm is so simple that a very large number of already defined algorithms can be implemented using one or more Map-Reduce steps. This makes it very easy to implement large scale distributed systems for things that required a lot more information regarding distributed systems than before.
3. Scalability - The abstraction is implicitly scalable which takes makes it very useful for cases where the amount of data processing is huge.
4. Redundancy -  Error correction can be implemented pretty easily using redundancy. Many other paradigms find it hard to implement fault tolerance without significant work, Map-Reduce makes it easier. Most Map-Reduce deployments are on Distributed Filesystems (HDFS/BigTable), making the data retrieval/deposit steps fault tolerant as well.
5. Infrastructure Requirements - Are not high. The underlying infrastructure is generic; a network of simple processing units with disk access.

Todd Lipcon · Answer

No, MapReduce is not currently exposed on AppEngine. Some creative users have implemented "Map" on top of app engine here: http://code.google.com/p/appengine-mapreduce/  but they haven't gotten to the "reduce" side yet, apparently.

Cosmin Negruseri · Answer

Distributed sorting is a great benchmark for mapreduce implementations because it exercises all the parts of the framework well.

The most time consuming part probably is sending the data over the network from the mappers to the right reducers.

Over time a number of results on Distributed Sort benchmarks have been reported by Yahoo, Google and Microsoft:
 * yahoo08	TeraByte Sort on Apache Hadoop http://sortbenchmark.org/YahooHadoop.pdf
 * google08	Sorting 1PB with MapReduce http://googleblog.blogspot.com/2008/11/sorting-1pb-with-mapreduce.html
 * yahoo09	Winning a 60 Second Dash with a Yellow Elephant	http://sortbenchmark.org/Yahoo2009.pdf
 * google11	Sorting Petabytes with MapReduce - The Next Episode	http://googleresearch.blogspot.com/2011/09/sorting-petabytes-with-mapreduce-next.html
 * microsoft12	Data in the Fast Lane	http://research.microsoft.com/en-us/news/features/minutesort-052112.aspx

Each benchmark was sorting through keys of 100 bytes in length.

Here's is a summary of the results (hope I didn't mess this up :) ): 
cores was only reported in the Yahoo results and it was 8 cores per node.

I computed throughput as the average of how much data is sorted by a CPU per second.

The MinuteSort result is interesting because it has the highest throughput per machine. The purpose of the benchmark is sort as much data as possible in 60 seconds, so I wonder if using just 250 machines is a limitation of their system.

As for hadoop vs google map reduce in 2011, there's no new benchmark from Yahoo.

It's impressive how Google MR has improved between 2008 and 2011.  Can use double the number of nodes, sort through 10 times the data and have a 5 times higher node throughput.

The google11 article mentions:

%3E We are excited by these results. While internal improvements to the MapReduce framework contributed significantly, a large part of the credit goes to numerous advances in Google's hardware, cluster management system, and storage stack.

Anonymous · Answer

The list of people who worked on and/or directly influenced Hadoop as part of their official duties at Yahoo! pre-2009 is around 40 or so people. A lot of them have spread out to other companies/projects. Many of them are no longer directly involved with Hadoop. The best way to find out if someone was actually, directly involved is to ask which org they were in. If it wasn't specifically Grid, then they were likely (but not 100% guaranteed) just a user.

(or if they remember how many bugs we hit when we upgraded Mithril to support users and permissions in HDFS. haha.)

BTW, Mike was never an em

(or if they remember how many bugs we hit when we upgraded Mithril to support users and permissions in HDFS.  haha.)

BTW, Mike was never an employee of Yahoo!.  Like him, there are a handful of people at that time that were working on Hadoop outside of Y!.  They are rare, but they exist. ;)

Anonymous · Answer

Flume [ https://www.quora.com/topic/Flume ] is comparable; see How does S4 compare to Flume? [ https://www.quora.com/How-does-S4-compare-to-Flume ] for the differences.

Todd Lipcon · Answer

At the most recent HBase hackathon, one idea we discussed doing in 2011 is something that would allow HBase to very closely approximate Haystack in a lot of ways.

The general idea is the same as what databases have been doing for a long time as external blob storage. When any row is written with a larger amount of data, we would append it to a "haystack" and then just write the metadata into the table itself. On retrieval, the metadata would be small enough to fit in cache and we would perform a single seek into the haystack to read the data.

The advantages here are:
 - the large data wouldn't ever have to hit the HLog, and hence would increase throughput
 - the large data wouldn't have to be compacted (or could be compacted much more infrequently) while the metadata could continue to be compacted in order to provide efficient lookup
 - the haystack files would be immutable over a long period, allowing HDFS improvements like HDFS RAID to reduce storage requirements.

Anonymous · Answer

* Ceph [ http://ceph.com/ ]
 * QFS by Quantcast [ https://www.quora.com/topic/Quantcast ]: http://quantcast.github.com/qfs/
 * Cloud-Crowd by DocumentCloud [ https://www.quora.com/topic/DocumentCloud ]:  https://github.com/documentcloud/cloud-crowd

* HPCC by LexisNexis [ https://www.quora.com/topic/LexisNexis ]: http://hpccsystems.com/
 * Condor: http://www.cs.wisc.edu/condor/ (application notes: www.cse.nd.edu/~ccl/research/pubs/pbragahe-thesis.pdf [ http://www.cse.nd.edu/~ccl/research/pubs/pbragahe-thesis.pdf ] )
 * Storm by Nathan Marz [ https://www.quora.com/profile/Nathan-Marz ]:  https://github.com/nathanmarz/storm
 * HaLoop: An offshoot of Hadoop to support iterative data processing: http://code.google.com/p/haloop/
 * MapRejuice: the distributed client-side computing implementation of Map Reduce built on top of Node.js: https://github.com/ryanmcgrath/maprejuice
 * GoCircuit: Paradigm for developing and sustaining Big Data apps [ http://www.gocircuit.org/ ]

Gil Yehuda · Answer

Subsequent to this question and two answers being provided, Yahoo! has indeed "spun off" a significant part of the Hadoop engineering team, forming the "Hortonworks" company http://www.hortonworks.com/ -- dedicating to delivering success with Hadoop. Yahoo! continues its significant investment in Hadoop engineering internally as this technology is very important to the company.

This is viewed as a "win-win" situation which should greatly benefit the Apache Hadoop community. It will help address the huge demand (and short supply) for expertise and services in this very sophisticated area of t

This is viewed as a "win-win" situation which should greatly benefit the Apache Hadoop community.  It will help address the huge demand (and short supply) for expertise and services in this very sophisticated area of technology.  And it provides a nice influx into the economics of Hadoop for Yahoo! and other ecosystem participants.

Anonymous · Answer

Hadoop consists of many subprojects: HDFS, MapReduce, Hive, Pig, HBase, and Avro. I believe this question refers to the MapReduce implementation, which can operate over a variety of storage systems. I agree that Hadoop could use decent competition; having used the software daily for many years, I disagree that it's becoming more complex with each release. As an example, see the heroic efforts of Chris Douglas on https://issues.apache.org/jira/browse/MAPREDUCE-64 to remove some of the tuning parameters for a MapReduce job. I suppose that's for another question though.

Both MapReduce implementati

Both MapReduce implementations mentioned above (CouchDB and MongoDB) require that data live in the data stores specified and present far different semantics than those described in the Google paper.

Here are some MapReduce-ish implementations, all of which are either coupled to a single storage system, a single programming language, or implement only a small subset of the features of a mature MapReduce implementation:
 * Disco: http://discoproject.org
 * Misco: http://www.cs.ucr.edu/~jdou/misco/
 * Phoenix: http://mapreduce.stanford.edu
 * Cloud MapReduce: http://code.google.com/p/cloudmapreduce
 * bashreduce: http://blog.last.fm/2009/04/06/mapreduce-bash-script
 * Qizmt: http://code.google.com/p/qizmt
 * HTTPMR: http://code.google.com/p/httpmr
 * Galago's TupleFlow: http://www.galagosearch.org/guide.html
 * Skynet: http://skynet.rubyforge.org
 * Sphere: http://sector.sourceforge.net
 * Riak: http://riak.basho.com/mapreduce.html
 * Starfish: http://rufy.com/starfish/doc/
 * Octopy: http://code.google.com/p/octopy/
 * MPI-MR: http://www.sandia.gov/~sjplimp/mapreduce.html
 * Filemap: http://mfisk.github.com/filemap/
 * Plasma MapReduce: http://projects.camlcity.org/projects/plasma.html
 * Mapredus: http://rubygems.org/gems/mapredus
 * Mincemeat: http://remembersaurus.com/mincemeatpy/
 * MapReduceTitan: http://www.kitware.com/InfovisWiki/index.php/MapReduce
 * GPMR: http://www.idav.ucdavis.edu/research/projects/mgpu_mapreduce
 * Elastic Phoenix: https://github.com/adamwg/elastic-phoenix
 * Preregrine: http://peregrine_mapreduce.bitbucket.org/
 * R3: http://heynemann.github.com/r3/

Also, Microsoft's DryadLINQ is available under an academic license (not quite open source) at http://research.microsoft.com/en-us/projects/Dryad.

Disclosure: I founded Cloudera, a company that provides commercial support for a distribution of Hadoop, among other things.

Greg Gaughan · Answer

The pipe2py open-source project runs Yahoo! Pipes locally using Python (https://github.com/ggaughan/pipe2py). This can then be run on Google App Engine for scalability (http://www.wordloosed.com/running-yahoo-pipes-on-google-app-engine).

Tobin Baker-Jones · Answer

I believe the answer is "No". Contrary to some answers, the superiority of MPP systems for analyzing relational data has little to do with "ACID compliance"--indeed, many of these systems are used entirely offline, where consistency concerns are irrelevant. They are faster than Hadoop because they are optimized for relational operations, and because they don't make some of the dubious design decisions (such as materializing all intermediate results for fault tolerance) that appeared in the original MapReduce paper. Even later Hadoop iterations (e.g., Hive on Tez) don't perform as well on relational queries as purpose-built MPP systems like Impala or Presto.

Shameless plug: the University of Washington database group is working on such an MPP system that is compatible with HDFS: http://myria.cs.washington.edu

Ryan Cox · Answer

In short, yes you can do MR on AppEngine. Here are some links:

* AppEngine-MapReduce: http://code.google.com/p/appengine-mapreduce/
 * Success Story: http://www.google.com/buzz/bslatkin/6SXDRWXFWkN
 * Google I/O 2010 Video: http://www.youtube.com/watch?v=_7fJotosrNQ

Anonymous · Answer

Bing uses a file system called Cosmos [1] that is conceptually similar to HDFS. Dryad [2] is their execution infrastructure; it's more expressive than MapReduce. The majority of the queries run over Cosmos are done with SCOPE [3], which is conceptually similar to Hive.

[1] http://www.goland.org/whatiscosmos
[2] http://dl.acm.org/citation.cfm?id=1273005
[3] http://dl.acm.org/citation.cfm?id=1454166