Tag Archives: Cloudera

Mar. 25, 2011 SDF Next Wave Analytics

On March 25, 2011 in Palo Alto at Stanford University, SDForum held “Analytics – The Next Wave.”  Basically the huge amount of data will force us to rethink how we analyze it and make decisions.

The first panel discussion was on  “New Sources of Data and Usage.” John Fraser of Accenture moderated panelists Ralph Clark of ShotSpotter, Zia Yusuf of Streetline and  Ilen Zazueta-Hall of Enphase Energy. The morning fireside chat was with Tom Peck of Levi Strauss and Sanjay Poonen of SAP. The second panel discussion was on “New Deployment and Data Architecture Models.” Eileen Boerger of Agilis Solutions moderated panelists Scott Burke of Yahoo, Anant Jhingran of IBM and Oliver Ratzesberger of eBay. Exhibitors attending were Agilis, Cloudera, IBM, Karmasphere and SAP.

The afternoon fireside chat was with Simon Khalaf of Flurry and Sharon Wienbar of Scale Venture Partners. The third panel discussion was on “The Investor Perspective.” Harold Yu of Orrick moderated panelists Asheem Chandna of Greylock Partners, Vispi Daver of Sierra Ventures and Lars Leckie of Hummer Winblad Venture Partners. Eric Peterson delivered the afternoon keynote on “Web Analytics Demystified.” The fourth panel discussion was on “Consumer Targeting.” Ted Shelton of Open-First moderated panelists Jim Dai of CalmSea, Phil Davis of RapLeaf, Josh McFarland of TellApart and Dennis Yu of BlitzLocal. Bill Schlough of the San Francisco Giants gave the closing keynote.

Note: Max Darby of BlitzLocal says that Dennis Yu is not affiliated with Webtrends.

Copyright 2011 DJ Cline All rights reserved.

Oct. 26, 2010 SDF Membase NoSQL Cloud Solution

On October 26, 2010 in Menlo Park at Orrick, the SDForum Cloud Services SIG hosted Matt Ingenthron of Membase’s presentation “Scale out NoSQL Solution for the Cloud.”

Apps are moving from the desktop to the cloud, with millions of users expecting instantaneous response. Membase offers new techniques to store and retrieve data in the NoSQL space. They offer a simple, fast, elastic key-value database based on the memcached engine interface and compatible with existing memcached clients and applications. This is Open Source under an Apache 2.0 license and can scale in public or private clouds.

It was originally used to speed up access to authoritative data as a distributed hashtable. It can scale linearly by adding nodes without losing access to data and still maintain consistency when accessing. The scaling flattens out the cost and performance curves that would make traditional DBMS grind to a halt. This proved crucial when supporting such social games a Farmville and Mafia Wars. The Membase server database can handle 500,000 ops per second.

Ingenthron wrapped up by discussing Project Arcus, Moxi (memcached proxy), vBucket mapping, and clustering. He mentioned their partnership with Cloudera using Sqoop and Flume with Hadoop. Membase offers the distributed OLTP solution and Cloudera offers the distributed OLAP solution. As an example, an ad targeting to a particular user that might normally take 40 milliseconds could take only one millisecond. This could make a big difference when traveling across a crowded mobile network.

Copyright 2010 DJ Cline All rights reserved.

Sept. 22, 2010 SDF Hadoop Scaling

On September 22, 2010 in Mountain View at LinkedIn, the SDForum SAM SIG hosted Ken Krugler of Bixo Labs presentation “Thinking at Scale with Hadoop.” In the past Krugler has worked with Nutch, Lucene and Solr. Now he uses Bixo, Hadoop, and Cascading, an open source project to speed up large scale web mining and analytics.

When working with Hadoop, traditional SQL developers should think in the map-reduce paradigm of fields/operations/pipes. The logical architecture is a computing cluster providing storage and execution. The execution is divide and conquer. The process steps are map, shuffle and reduce. A value is a set of fields. The key value pairs are exchanged between to the user-defined map and reduce algorithms. The map translates the input to keys and values. The system groups each unique key with all its values. The reducer translates the values of each unique key to new keys and values. Because map and reduce functions only care about current key and value, any number of mappers or reducers can invoke on an arbitrary number on any number of nodes.

Complex workflows require multiple map reduce jobs. Errors can occur when connecting and synchronizing data between them. Not seeing intent from all the low level data can make MR optimization harder. There are ways to deal with this complexity. He thinks you should model the problem as a workflow with operations on records. Cascading has tuples, pipes and operations, FlumeJava has classes and deferred execution. Pig or Hive treat it like a big database and are good for query-centric problems. Datameer (DAS) and BigSheets treat it like a big spreadsheet and are good for easier analytics.

There are experts in Hadoop and experts in SQL database development. It is nearly impossible to find someone who is an expert at both. Find an expert SQL database developer who wants to add some new tools to their toolbox. To get those tools, Krugler recommended the following Hadoop references:

Mailing list:

common-user@hadoop.apache.org

Book:

Tom White’s “Hadoop: The Definitive Guide”

Training:

Scale Unlimited: http://scaleunlimited.com/courses

Cloudera: http://www.cloudera.com/hadoop-training

Cascading: http://www.cascading.org

DAS: http://www.datameer.com

Copyright 2010 DJ Cline All rights reserved.

Apr. 9, 2010 SDF Analytics Revolution

SDF logo2009 copyAwadallah Amr copyBishop Stacey copyChandna Asheem copyCheng Jie copyDaver Vispi copyEfrusy Kevin copyFarago Peter copyHall Martin copyJain Sumeet copyKlahr Josh 2 copyKohavi Ronny copyKreulen Jeff copyLeckie Lars copyLewin Danl copyMcLaughlin Thomas copyMinich Jeff copyNorvig Peter copyPatil DJ copyPhillips James 2 copyPoonen Sanjay copyRudin Ken copySaundaresan Neel copySenSarma Joydeep copySteier David copySuermondt Jaap copyThomas Owen copyVenugopal Anand copyWeil Kevin 2 copy

On Friday April 9, 2010 in Mountain View at the Microsoft Auditorium SDForum held “The Analytics Revolution Conference.” The fact that you can now do large-scale analytics changes the way you model and run your company. Text from DJCline.com

Dan’l Lewin of Microsoft did the welcome and introduced the opening keynote speaker Ronny Kohavi of Microsoft formerly of Amazon. Kohavi presentation “Online Controlled Experiments: Listening to the Customers, not to the HiPPO.” The Highest Paid Person in an Organization is a HiPPO and while they may sign the paychecks, it is the customer that sends him the money. If you don’t properly analyze the data you will miss important cues that drive more sales. Ask what you are optimizing for.

David Steier of PricewaterhouseCoopers moderated panelists DJ Patil of LinkedIn, Ken Rudin of  Zynga, Neel Sundaresen of Ebay and Kevin Weil of Twitter. They discussed Competing on Analytics at the Highest Level.” The demand for professionals with solid database development is increasing. Look for people with experience in Oracle data warehousing, SQL, Cloudera, Vertica, Tableau, Hadoop, Pig and Memcache D. Start budgeting and being very nice to the database people you hire.

Sanjay Poonen of SAP gave the second keynote presentation “Leading the Analytics Revolution.” You can now do analytics from mobile devices like the iPhone using SAP apps.

Owen Thomas of VentureBeat moderated panelists Amr Awadalla of Cloudera, Joshua Klahr of Yahoo, James Phillips of Northscale and Joydeep Sen Sarma of Facebook. They discussed “Analyzing Big Data.” Cloud computing frees you from poorly structured datasets tied to old hardware. Learn Hadoop and MapReduce to process big data, awesome data and stupendous amounts of data.

Before and during lunch there were short pitches from exhibitors and startups like Karmasphere, Accept Software, Agilis Solutions, Aster Data, CTPartners, Dyyno, Execustaff, IBM, KXEN, Medallia and MergerTech.

Peter Norvig of Google gave the third keynote presentation “The Unreasonable Effectiveness of Data.” Believe it or not, more data means better results. The closer two points are to each other, the more likely they might share the same characteristics. The original picture of Mona Lisa will be at the center of a cluster.

Brett Sheppard of BigDataNews.com moderated panelists Jie Cheng ofAcxiom, Vispi Daver of Sierra Ventures, Peter Farago of Flurry, Tom McLaughlin of Accept Software and Jeff Minich of CalmSea. They discussed “New Frontiers for Analytics.” The breakthroughs in analytics are speeding up business cycles.

Jeff Kreulen of  IBM gave the fourth keynote presentation “Analytics: An Applied Researcher’s Perspective”

Harold Yu, Orrick, Herrington & Sutcliffe LLP moderated panelists Stacey Curry Bishop of Scale Ventures, Asheem Chandna of Greylock, Kevin Efrusy of Accel, Sumeet Jain of CMEA and Lars Leckie of Hummer Winblad. They discussed “The Investor Perspective.” They don’t want invest in anything that will quickly become a generic commodity. Companies want more than a small incremental lift. They want analytics to give them a dramatic change in the way they do business.

Jaap Suermondt of HP Labs gave the fourth keynote presentation “Research in Analytics for Operational Impact at HP.” A commitment to R&D at HP is producing clear improvements to everyday operations.

IMG_7285DJClinecom copy04-09-10 crowd1 copy04-09-10 crowd2 copy04-09-10 panel1 copy04-09-10 panel2 copy04-09-10 panel3 copy04-09-10 panel4 copyIMG_7363Karmasphere copyIMG_7268Accept copyIMG_7270Agilis copyIMG_7279Aster copyIMG_7422KXED copyIMG_7271DJClinecom copyIMG_7272DJClinecom copyIMG_7274DJClinecom copyIMG_7276DJClinecom copyIMG_7278DJClinecom copyIMG_7280DJClinecom copyIMG_7282DJClinecom copyIMG_7281DJClinecom copyIMG_7288DJClinecom copyIMG_7284DJClinecom copyIMG_7291DJClinecom copyIMG_7292DJClinecom copyIMG_7293DJClinecom copyIMG_7294DJClinecom copyIMG_7399DJClinecom copyIMG_7403DJClinecom copyIMG_7419DJClinecom copyIMG_7420DJClinecom copyIMG_7424DJClinecom copyIMG_7426DJClinecom copyIMG_7431DJClinecom copyIMG_7457DJClinecom copyIMG_7462DJClinecom copyIMG_7463DJClinecom copy

Video of the conference can be seen at:

www.dyyno.com/sdforum

Copyright 2010 DJ Cline All rights reserved.