Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Programming Businesses Software The Internet Yahoo! Apache IT Technology

Yahoo Releases Open Source Hadoop Distribution 49

ruphus13 writes "Yahoo has been a vociferous Apache Hadoop user and supporter for several years now, and uses it extensively within its Search technologies. Hadoop has been gaining popularity in the Cloud Computing space, with companies like the NYTimes converting 4TB and 11 million articles to PDFs in under 24 hours using Hadoop and EC2 in late 2007. Hadoop has been made available in Amazon's cloud and Yahoo has now released its own Hadoop version. From the article: 'At today's Hadoop Summit in Silicon Valley, Yahoo! announced the availability of the Yahoo! Distribution of Hadoop, a source-only version of Apache Hadoop that Yahoo! uses within its own search engine. [Hadoop] is an open source software framework that helps process very large data sets, and is widely used in large-scale data mining applications as well as in search tools at sites like Facebook and many others. For developers and users interested in Hadoop, it's worth noting that the Yahoo! Distribution of Hadoop has been widely tested and developed at Yahoo! for years now.'"
This discussion has been archived. No new comments can be posted.

Yahoo Releases Open Source Hadoop Distribution

Comments Filter:
  • Timely article (Score:3, Informative)

    by C18H27NO3 ( 1282172 ) on Wednesday June 10, 2009 @06:14PM (#28286291)
    Perhaps the Ask Slashdot inquirer in this [slashdot.org] thread will find this news usefull.
  • Hadoop? (Score:5, Insightful)

    by ickleberry ( 864871 ) <web@pineapple.vg> on Wednesday June 10, 2009 @06:19PM (#28286345) Homepage
    Can we bring back the ordinary, sensible pre-Web 2.0 names please?
  • Hadoop is awesome (Score:5, Informative)

    by fancellu ( 712538 ) on Wednesday June 10, 2009 @06:22PM (#28286363)
    Not only is it used by Yahooo, but also by Facebook, who get 15TB of new data a day to handle. Checkout the very useful free vids from Cloudera. http://www.cloudera.com/hadoop-training-thinking-at-scale [cloudera.com] You can download a canned VM preloaded with Hadoop/Pig/Hive goodness, even a copy of Eclipse preconfigured. http://www.cloudera.com/hadoop-training-virtual-machine [cloudera.com]
    • Re: (Score:3, Interesting)

      by zerocool^ ( 112121 )

      We also use it extensively at Rackspace Email division. We generate about 200GB/day of logs from postfix and dovecot installs, and hadoop with mapreduce allows us to pull all sorts of metrics and diagnostic information in very short timeframes. It helps our customer facing support reps, as well as allows us to give more demanding customers the statistics and metrics that they want, plus it helps us with capacity planning and a bunch of other stuff.

      And it's designed to run on commodity hardware.

      http://high [highscalability.com]

  • I think HBase 0.20 is being released today as well, with a new and much faster file format, better memory management and better availability.

  • by Anonymous Coward

    Java is slow. How could it possibly be used to process so much data.

    • Re: (Score:3, Informative)

      by dintech ( 998802 )

      Java has it's faults but being slow is no longer one of them. You should do some googling.

    • by mini me ( 132455 )

      Hence the need for a 10,000-machine Hadoop cluster to do the work of a single machine running a C++ application. Or something like that.

  • Yahoo! and OSS (Score:5, Insightful)

    by Alethes ( 533985 ) on Wednesday June 10, 2009 @06:40PM (#28286561)

    Yahoo! really does get a lot of flack around here, but I have to say, they have contributed quite a bit of free and open-source software for developers to use. The list of of APIs and web services that are available is quite impressive and many of them are better than Google's similar offerings (BOSS vs Google's AJAX search, for example). For anybody who's interested, I really recommend checking out the Yahoo! Developer Network [yahoo.com] site.

    • Re: (Score:3, Interesting)

      by linguizic ( 806996 )
      THANK YOU!!!! I have found YDN enormously useful.

      It's also worth noting that Yahoo has made major contributions to PHP as Rasmus is a Yahoo himself.
      • Comment removed (Score:5, Interesting)

        by account_deleted ( 4530225 ) on Wednesday June 10, 2009 @08:50PM (#28287737)
        Comment removed based on user account deletion
        • by sznupi ( 719324 )

          From your description and from trying it out I seem to have an impression that it's not really different from Google Search features. Have you tried it semi-recently? It also has autocompletion/suggestion and related searches.

          • Comment removed based on user account deletion
            • by micheas ( 231635 )

              I would assume that yahoo tracks your search history so they can give you semi personalized results, just like google does.

              This would result in the search engine that you frequently use normally giving you better results.

              It also explains why sometimes when you cannot find something with google or yahoo changing to the search engine that you infrequently use gets the result easier, as you are getting a more generic less personalized search.

              • Comment removed based on user account deletion
                • by micheas ( 231635 )

                  Google tracks the search terms from by ip address, ostensibly so that they can customize search results.

                  It would make sense for yahoo to do the same thing.

                  The opt out is only for ad tracking, and you need a cookie to make it stick, which implies that they do the same ip address tracking of search queries as google.

                  For example, assume that everyone at the San Jose Earthquakes (a professional soccer team in the USA) offices uses google, it would make sense for the search engine to learn that the term football

            • I just did as a matter of fact, and IMHO it just isn't as good. The related searches area is at the bottom

              That depends on how you use Google Search. If you have the Search Options on, and have "Related searches" selected, the related searches area is at the bottom.

              For example when I looked up "The Dark Knight" in Yahoo under the "more/related concepts" tab I found really good interviews with the cast, the director, and a nice overview of Heath Ledger's film career. While I might have found the Gary Oldman i

    • Re: (Score:1, Funny)

      by Anonymous Coward

      Well I! love Yahoo! because I! believe that all proper nouns, as well as first person pronouns, should be followed by an exclamation mark. Imagine if we! all followed suit -- there would be Google! and Microsoft! and Linux! And Slashdot! The world would be a much more exciting place!

  • by Eric Smith ( 4379 ) on Wednesday June 10, 2009 @08:57PM (#28287795) Homepage Journal
    Does the world need another Hadoop distribution? In a case like this, isn't a "distribution" just a fork going by a different name that has a more positive connotation? there some good reason they did it this way rather than just pushing their changes upstream to Apache? Did Apache not want them?

    I'll admit to knowing basically nothing about Hadoop, but if I saw the same article with "Hadoop" replaced by "GCC", "Postfix", or "OpenOffice", I wouldn't see it as being a good thing.

    • Re: (Score:3, Informative)

      by shadow42 ( 996367 )
      As far as I can tell, the distribution Yahoo is offering is just the vanilla Hadoop, but with Yahoo's patches on top of it. Yahoo is very involved in Hadoop's development (the project's founder is now employed by them), so a lot of their patches get incorporated back into Hadoop's source tree. Most of the changes Yahoo made are just performance/stability patches that haven't been incorporated into an official release yet. You could probably get the same distribution just by grabbing SVN trunk.
    • Re: (Score:3, Interesting)

      by linguizic ( 806996 )
      Does the world need another Linux distribution? The folks at Ubuntu thought so, and they've made an indelible mark on Linux. Just like Yahoo! is doing with Hadoop [slashdot.org].
  • Usually I only need to google new technology terms that I haven't heard before. Today I had to google vociferous. I was thinking it sounded like a condition that you need to take Levitra for. It didn't really make sense in the sentence but thinking about Yahoo suffering from erectile dysfunction has it's own childish humor when your on your 5th beer.
  • I've evaluated Hadoop (and Cloudbase, HBase and a few other things) for transaction log mining purposes and found it to be VERY inefficient. Basically, if your machine has a decent RAID array (by "decent" I mean 500-700MB/sec linear read throughput, and 300-500MB/sec write throughput), you will need 12-15 8 core Hadoop boxes to even come close to a single machine's performance. This, IMO, is fucked up. I expected it would be much more efficient than it is.

    Therefore, my conclusion was that Hadoop only makes

    • You should try it on the cloud. Amazon's EC2 crunching data from S3.
      There are many study cases available. Yup, it is no silver bullet, but has its uses.
    • So which do you prefer using performance wise then...of all the ones you tested...?
      Just asking

      • by melted ( 227442 )

        They aren't comparable. Cloudbase is a simple SQL layer on top of Hadoop that operates on flat files. HBase is an open source BigTable (i.e. you can't really do SQL with it). Hive is kind of like Cloudbase. In the end all of these systems have a common strength/weakness - Hadoop. The strength is that they scale, if you're willing to pay, the weakness is that their scalability seems to be piss poor on smaller clusters.

        In the end we went with bare Hadoop operating directly on LZO compressed log chunks. Log ch

On the eighth day, God created FORTRAN.

Working...