Tag Archives: Hadoop

Aug. 7, 2013 Hive Hadoop MapReduce And SQL

hivelogo Erickson Justin 2 Gates Alan 2 Kaushik Sausheel 2 Patel Priyank Ramakrishnan Raghu Ravi TM Shiran Tomer 4

On Wednesday, August 7, 2013 in Sunnyvale at NetApp, The Hive Held an event discussing Big Data, Hadoop, Hive, MapReduce, Pig and SQL. Raghu Ramakrishnan of Microsoft moderated panelists Justin Erickson of Cloudera, Alan Gates of Hortonworks, Sausheel Kaushik of Pivotal, Priyank Patel of Teradata Aster, and Tomer Shiran of MapR. Grabbing large batches of data with MapReduce is fine, but businesses still want SQL for interactive and real-time queries. The result will be a hybrid of new and old strategies. The best strategy is to hire an experienced SQL developer with a strong ETL background and let them learn the new tools. You will get the information you need when you need it.

IMG_5959DJClinecom IMG_5968DJClinecom IMG_5969DJClinecom IMG_5971DJClinecom IMG_5972DJClinecom IMG_5974DJClinecom IMG_5976DJClinecom IMG_5978DJClinecom IMG_5982DJClinecom IMG_5985DJClinecom IMG_5986DJClinecom IMG_5987DJClinecom IMG_5995DJClinecom IMG_5996DJClinecom IMG_5997DJClinecom IMG_5998DJClinecom IMG_5999DJClinecom IMG_6001DJClinecom IMG_6003DJClinecom IMG_6004DJClinecom IMG_6008DJClinecom IMG_6009DJClinecom IMG_6018DJClinecom IMG_6019DJClinecom IMG_6020DJClinecom IMG_6021DJClinecom IMG_6030DJClinecom IMG_6041DJClinecom IMG_6045DJClinecom IMG_6076DJClinecom IMG_6083DJClinecom IMG_6108DJClinecom IMG_6212bDJClinecom IMG_6268DJClinecom IMG_6272DJClinecom IMG_6278DJClinecom IMG_6293DJClinecom IMG_6322DJClinecom

Copyright 2013 DJ Cline All rights reserved.


Jun. 7, 2013 Big Data Cloud

IMG_1951DJClinecom Bawa Mayank Bhandarkar Milind Das Dj Satish Sourabh Siebel Thomas

On June 7, 2013 in Mountain View, the Computer History Museum hosted the Big Data Cloud event. Speakers were Tom Siebel of C3energy, Mayank Bawa of Teradata Aster, Sourabh Satish of Symantec, Milind Bhandarkar of Pivotal, Dr. Konstantin (Cos) Boudnik of WANdisco, Gail Ennis CEO, Karmasphere , Nanda Vijaydev of Karmasphere, Bruno Aziza Vice President of Sisense, Jim Blomo of Yelp,  David P. Mariani of  Klout, Ken Rudin of Facebook, Vikram Makhija of Ayasdi, Chirag Mehta, Ed Abbo of C3 Energy, John Steinberg of EcoFactor, Amit Narayan of AutoGrid, – Aparajeeta Das, TM Ravi of Hive Data,  Deepak Kamra of Canaan Partners, Dharmesh Thakker of Intel Capital and Jishnu Bhattacharjee of Nexus Venture Partners. Topics included Hadoop, smart grid, security, Topological Data Analyst (TDA) and, of course, how to make money.

IMG_1952DJClinecom IMG_1954DJClinecom IMG_1958DJClinecom IMG_1960DJClinecom IMG_1967DJClinecom IMG_1973DJClinecom IMG_1984DJClinecom IMG_2000DJClinecom IMG_2002DJClinecom IMG_2007DJClinecom IMG_2008DJClinecom IMG_2031DJClinecom IMG_2036DJClinecom IMG_2044DJClinecom IMG_2048DJClinecom IMG_2071DJClinecom IMG_2075DJClinecom IMG_2087DJClinecom

Copyright 2013 DJ Cline All rights reserved.

Feb. 21, 2012 SVForum Pervasive Database Decisions

On February 21, 2012 in Palo Alto at SAP, SVForum’s Business Intelligence SIG Chair Corrinne Kahler introduced John Akred of Accenture. His topic was “Pervasive Data-Based Decisions.” Despite talk about SQL versus noSQL, the rise of big data does not mean the end of relational databases. Experts recognize the two worlds must coexist. Structured or unstructured, data is still data and needs to be turned into useful information. Big data is like the ocean, there is a lot of water but it must be processed before you can drink any of it. Now there will be even more demand for database ETL professionals willing to dive into it.

Big data will be incorporated to existing data structures adding more value through better context. GPS tracking data on delivery trucks gives insight into employee productivity and customer satisfaction. Tracking customer behavior informs how they make purchasing decisions.

Akred thinks people need to understand the difference of geometric and linear scalability. Most relational databases scale geometrically with costs of processing and storage increasing geometrically. With big data technologies like Hadoop, Cassandra or Amazon’s DynamoDB, the first terabyte of storage and processing power costs the same as the last.
Tools created by Asterdata, Greenplum and Microsoft SQL Server Azure can then bring this linear scalability to the relational world.

In short, companies are building better funnels for the fire hose of data heading your way.

Copyright 2012 DJ Cline All rights reserved.

Feb. 28, 2011 Silicon Valley Jobs

Two weeks ago president Obama had dinner with the titans of Silicon Valley. One of the topics was employment. Since then I have heard from many recruiters and companies looking for talent. I am hearing of companies wanting to hire thousands of workers. Of course, this is after three years of letting thousands of people go and creating 12 percent unemployment, so a lot more people will have to be hired to signal a sustained recovery. Below are a few of the job events I have covered so far.

On Wednesday February 23, 2011 in Mountain View at Fenwick and West,  Tanya Okmyansky of the Jewish High Tech Community, hosted Scott Gilfoil of Apple and Brian Curtin of ICG who talked about the kind of talent they were looking for with current skills. On that same night at the Computer History Museum, I met with Christen Kent of HP. Their Par3 division held a job fair  looking for data storage experts.

On Feb. 25, 2011, in Atherton at the Silicon Valley STC meeting, Andrew Davis of Content Rules spoke at about how the brakes have come off hiring. You should demonstrate current skills in an online portfolio that quickly explains your role in solving a client’s problems. Writers demonstrating a strong software development document can get more money. Volunteering with a professional organization is a great way to show how you have kept your skills and helped your community. Build a standalone website that can show your portfolio. Join social media sites like LinkedIn.

I also talked with ACM chair Greg Weinstein, a experienced Silicon Valley hand who confirmed this pent-up hiring trend. Employers now regularly attend ACM meetings looking for talent.

Mobile technology and social media are driving this, but it is not a free for all like the dotcom boom. Companies are looking for recent experience and advanced degrees. They want people who are already here. They do not want to pay relocation so don’t pack your bags. Candidates should be US citizens or have the right to work in the US.

If this is a sustained recovery, I would hope that smart companies would see the potential in all sorts of people and not stick to a strict shopping list or stereotype. Anyone who has been looking for work over the past three years and is still living in Silicon Valley has got to be pretty clever at surviving and that might be useful for your company. I’ve already run into one clueless (and nameless) recruiter who wanted someone with ten years experience with Hadoop. I told them that since Hadoop was not that old, the only qualified candidate would drive a DeLorean to work.

Copyright 2011 DJ Cline All rights reserved.

Feb. 23, 2011 SDF Yahoo Lei Tang Hadoop

On February 23, 2011 in Mountain View at LinkedIn, the SDForum Software Architecture and Platform SIG chairs Waiming Mok and Megha Chawla hosted Lei Tang of Yahoo! Labs. Tang is the author of “Community Detection and Mining in Social Media.” His presentation this evening was “Large-Scale  Community  Detection for Social Computing.” He described how a social network analyst could use new tools and techniques like Hadoop to discover and target virtual communities.

The slides of his presentation are available here:

Copyright 2011 DJ Cline All rights reserved.


The Definitive Guide

By Tom White

This second edition confirms this book’s place as the official textbook for Hadoop. It is for stone cold coders. As a matter of fact I would not venture into Hadoop without this book and an experienced Oracle developer.

Copyright 2011 DJ Cline All rights reserved.

Oct. 27, 2010 SDF LinkedIn Avatara

On October 27, 2010 in Mountain View at LinkedIn, the SDForum SAM SIG hosted Chris Riccomini, Senior Data Scientist on LinkedIn’s Product Analytics team to talk about “Scalable Analytical Processing with LinkedIn’s Avatara.”

Formerly of PayPal, Riccomini worked on LinkedIn’s ”People You May Know” feature and “Who’s Viewed My Profile” product. He talked about the competing priorities between high throughput and low latency and the solution of Hadoop and Voldemort. LinkedIn needed something that would support offline aggregation, event-time aggregation and query time aggregation. It had to run through a Map/Reduce shared interface to power customer facing data products. He described a layered structure from top to bottom. On the top layer is the engine. Below that are the cube, and then the cube query. At the bottom are three elements: transform, aggregator and comparator.

The result was Avatara, a real time scalable Online Analytical Processing (OLAP) System already in production. It has features like select, count, where, group, having, order and limit. Riccomini described the architecture and implementation of its storage and aggregation layers.

One new term I heard was AvatarSQLishBuilder. Apparently, even in a NoSQL environment, the code should still have the look and structure of SQL. My advice for anyone heading into Hadoop territory is to take an experienced SQL database developer with you. Java is not enough in this Wild West show.

Another new term is Yahoo Cloud Serving Benchmark (YCSB). This is a way to compare various cloud products. I thought they were talking about yogurt. More explanation is at:

Richard Taylor was there and has written a splendid article about it at:


Copyright 2010 DJ Cline All rights reserved.

Oct. 26, 2010 SDF Membase NoSQL Cloud Solution

On October 26, 2010 in Menlo Park at Orrick, the SDForum Cloud Services SIG hosted Matt Ingenthron of Membase’s presentation “Scale out NoSQL Solution for the Cloud.”

Apps are moving from the desktop to the cloud, with millions of users expecting instantaneous response. Membase offers new techniques to store and retrieve data in the NoSQL space. They offer a simple, fast, elastic key-value database based on the memcached engine interface and compatible with existing memcached clients and applications. This is Open Source under an Apache 2.0 license and can scale in public or private clouds.

It was originally used to speed up access to authoritative data as a distributed hashtable. It can scale linearly by adding nodes without losing access to data and still maintain consistency when accessing. The scaling flattens out the cost and performance curves that would make traditional DBMS grind to a halt. This proved crucial when supporting such social games a Farmville and Mafia Wars. The Membase server database can handle 500,000 ops per second.

Ingenthron wrapped up by discussing Project Arcus, Moxi (memcached proxy), vBucket mapping, and clustering. He mentioned their partnership with Cloudera using Sqoop and Flume with Hadoop. Membase offers the distributed OLTP solution and Cloudera offers the distributed OLAP solution. As an example, an ad targeting to a particular user that might normally take 40 milliseconds could take only one millisecond. This could make a big difference when traveling across a crowded mobile network.

Copyright 2010 DJ Cline All rights reserved.

Sept. 22, 2010 SDF Hadoop Scaling

On September 22, 2010 in Mountain View at LinkedIn, the SDForum SAM SIG hosted Ken Krugler of Bixo Labs presentation “Thinking at Scale with Hadoop.” In the past Krugler has worked with Nutch, Lucene and Solr. Now he uses Bixo, Hadoop, and Cascading, an open source project to speed up large scale web mining and analytics.

When working with Hadoop, traditional SQL developers should think in the map-reduce paradigm of fields/operations/pipes. The logical architecture is a computing cluster providing storage and execution. The execution is divide and conquer. The process steps are map, shuffle and reduce. A value is a set of fields. The key value pairs are exchanged between to the user-defined map and reduce algorithms. The map translates the input to keys and values. The system groups each unique key with all its values. The reducer translates the values of each unique key to new keys and values. Because map and reduce functions only care about current key and value, any number of mappers or reducers can invoke on an arbitrary number on any number of nodes.

Complex workflows require multiple map reduce jobs. Errors can occur when connecting and synchronizing data between them. Not seeing intent from all the low level data can make MR optimization harder. There are ways to deal with this complexity. He thinks you should model the problem as a workflow with operations on records. Cascading has tuples, pipes and operations, FlumeJava has classes and deferred execution. Pig or Hive treat it like a big database and are good for query-centric problems. Datameer (DAS) and BigSheets treat it like a big spreadsheet and are good for easier analytics.

There are experts in Hadoop and experts in SQL database development. It is nearly impossible to find someone who is an expert at both. Find an expert SQL database developer who wants to add some new tools to their toolbox. To get those tools, Krugler recommended the following Hadoop references:

Mailing list:



Tom White’s “Hadoop: The Definitive Guide”


Scale Unlimited: http://scaleunlimited.com/courses

Cloudera: http://www.cloudera.com/hadoop-training

Cascading: http://www.cascading.org

DAS: http://www.datameer.com

Copyright 2010 DJ Cline All rights reserved.

Apr. 9, 2010 SDF Analytics Revolution

SDF logo2009 copyAwadallah Amr copyBishop Stacey copyChandna Asheem copyCheng Jie copyDaver Vispi copyEfrusy Kevin copyFarago Peter copyHall Martin copyJain Sumeet copyKlahr Josh 2 copyKohavi Ronny copyKreulen Jeff copyLeckie Lars copyLewin Danl copyMcLaughlin Thomas copyMinich Jeff copyNorvig Peter copyPatil DJ copyPhillips James 2 copyPoonen Sanjay copyRudin Ken copySaundaresan Neel copySenSarma Joydeep copySteier David copySuermondt Jaap copyThomas Owen copyVenugopal Anand copyWeil Kevin 2 copy

On Friday April 9, 2010 in Mountain View at the Microsoft Auditorium SDForum held “The Analytics Revolution Conference.” The fact that you can now do large-scale analytics changes the way you model and run your company. Text from DJCline.com

Dan’l Lewin of Microsoft did the welcome and introduced the opening keynote speaker Ronny Kohavi of Microsoft formerly of Amazon. Kohavi presentation “Online Controlled Experiments: Listening to the Customers, not to the HiPPO.” The Highest Paid Person in an Organization is a HiPPO and while they may sign the paychecks, it is the customer that sends him the money. If you don’t properly analyze the data you will miss important cues that drive more sales. Ask what you are optimizing for.

David Steier of PricewaterhouseCoopers moderated panelists DJ Patil of LinkedIn, Ken Rudin of  Zynga, Neel Sundaresen of Ebay and Kevin Weil of Twitter. They discussed Competing on Analytics at the Highest Level.” The demand for professionals with solid database development is increasing. Look for people with experience in Oracle data warehousing, SQL, Cloudera, Vertica, Tableau, Hadoop, Pig and Memcache D. Start budgeting and being very nice to the database people you hire.

Sanjay Poonen of SAP gave the second keynote presentation “Leading the Analytics Revolution.” You can now do analytics from mobile devices like the iPhone using SAP apps.

Owen Thomas of VentureBeat moderated panelists Amr Awadalla of Cloudera, Joshua Klahr of Yahoo, James Phillips of Northscale and Joydeep Sen Sarma of Facebook. They discussed “Analyzing Big Data.” Cloud computing frees you from poorly structured datasets tied to old hardware. Learn Hadoop and MapReduce to process big data, awesome data and stupendous amounts of data.

Before and during lunch there were short pitches from exhibitors and startups like Karmasphere, Accept Software, Agilis Solutions, Aster Data, CTPartners, Dyyno, Execustaff, IBM, KXEN, Medallia and MergerTech.

Peter Norvig of Google gave the third keynote presentation “The Unreasonable Effectiveness of Data.” Believe it or not, more data means better results. The closer two points are to each other, the more likely they might share the same characteristics. The original picture of Mona Lisa will be at the center of a cluster.

Brett Sheppard of BigDataNews.com moderated panelists Jie Cheng ofAcxiom, Vispi Daver of Sierra Ventures, Peter Farago of Flurry, Tom McLaughlin of Accept Software and Jeff Minich of CalmSea. They discussed “New Frontiers for Analytics.” The breakthroughs in analytics are speeding up business cycles.

Jeff Kreulen of  IBM gave the fourth keynote presentation “Analytics: An Applied Researcher’s Perspective”

Harold Yu, Orrick, Herrington & Sutcliffe LLP moderated panelists Stacey Curry Bishop of Scale Ventures, Asheem Chandna of Greylock, Kevin Efrusy of Accel, Sumeet Jain of CMEA and Lars Leckie of Hummer Winblad. They discussed “The Investor Perspective.” They don’t want invest in anything that will quickly become a generic commodity. Companies want more than a small incremental lift. They want analytics to give them a dramatic change in the way they do business.

Jaap Suermondt of HP Labs gave the fourth keynote presentation “Research in Analytics for Operational Impact at HP.” A commitment to R&D at HP is producing clear improvements to everyday operations.

IMG_7285DJClinecom copy04-09-10 crowd1 copy04-09-10 crowd2 copy04-09-10 panel1 copy04-09-10 panel2 copy04-09-10 panel3 copy04-09-10 panel4 copyIMG_7363Karmasphere copyIMG_7268Accept copyIMG_7270Agilis copyIMG_7279Aster copyIMG_7422KXED copyIMG_7271DJClinecom copyIMG_7272DJClinecom copyIMG_7274DJClinecom copyIMG_7276DJClinecom copyIMG_7278DJClinecom copyIMG_7280DJClinecom copyIMG_7282DJClinecom copyIMG_7281DJClinecom copyIMG_7288DJClinecom copyIMG_7284DJClinecom copyIMG_7291DJClinecom copyIMG_7292DJClinecom copyIMG_7293DJClinecom copyIMG_7294DJClinecom copyIMG_7399DJClinecom copyIMG_7403DJClinecom copyIMG_7419DJClinecom copyIMG_7420DJClinecom copyIMG_7424DJClinecom copyIMG_7426DJClinecom copyIMG_7431DJClinecom copyIMG_7457DJClinecom copyIMG_7462DJClinecom copyIMG_7463DJClinecom copy

Video of the conference can be seen at:


Copyright 2010 DJ Cline All rights reserved.

Aug. 26, 2009 SDF LinkedIn Voldemort

SDF logo2009 copyBansal Bhupesh copyKreps Jay copy

On August 26, 2009 in Palo Alto the SDForum SAM SIG hosted LinkedIn engineers Bhupesh Bansal and Jay Kreps to present “Project Voldemort: Scalable Fault Tolerant Distributed Storage at LinkedIn”.

LinkedIn takes web scale computing to extremes. They store and manage high read/write loads of massive data sets. Their applications need high scalability and performance but not necessarily features seen in relational databases. Project Voldemort is a distributed, highly scalable Key Value Storage System based on the Amazon Dynamo project using Hadoop and Pig. They talked about its architecture, design choices and future serving many data intensive applications. They want to build a community to solve the huge challenges they face. (I recommend using LinkedIn to find them.)

I think the use of the name Voldemort is sign of a new generation taking over the nomenclature from retiring nerds who used names from Star Trek and Star Wars. I just hope they are not hoping for a magic solution. :-)

At one point they discussed a process that takes several days to run. That needs fixing pronto. I can recommend a good database expert that might be able tune that to a few seconds.

For a more detailed analysis I suggest Richard Taylor’s post at:http://bandb.blogspot.com/2009/09/project-voldemort.html

08-27-09 cubberly copy08-27-09 crowd copy

Copyright 2009 DJ Cline All rights reserved.

Aug. 18, 2009 SDF Business Intelligence in the Cloud

SDF logo2009 copyGali Lenin copyGuanlao Arnel copy

On August 18, 2009 in Palo Alto at SAP, SDForum presented “Cutting Edge Business Intelligence in the Cloud” with Lenin Gali of ShareThis. ShareThis has a widget that allows people to share what they find on the web with others on their social network. It doesn’t matter if it is FaceBook, Twitter, MySpace, or LinkedIn. Their clients include Fox Media, UsMagazine, Wired, ESPN, and movies.com. They built their IT on Amazon EC2, Cascading, Hadoop, Hive and MicroStrategy. They use Aster Data for their Data Warehouse. Text from DJCline.com

If you come from a traditional database IT background, I guarantee that you have never seen an operation like this. Cascading is the processing API for Hadoop Clusters. There are pipes, flows, branches and groups. You get event notification, can write scripts and control it at the tuple level. Hive is the data warehouse built on top of Hadoop. It supports non-complex SQL using HQL. You can build a custom map/reduce jobs for complex analytics. You can still make adhoc queries for large data sets. The Aster Data DW in the cloud is scalable commodity hardware with an Massively Parallel Processing (MPP) Architecture. It uses SQL, Map/Reduce, JDBC, ODBC, and is compatible with Extract Transfer and Load (ETL) tools. Aster Data architecture uses PostgreSQL and has a beehive heirarchy. Queens control the cluster and hold metadata while workers process and store it. If the queen fails it is replaced immediately. Text from DJCline.com

They think that all of this is easier to use and lowers their costs. They keep their headcount down and their revenue up. It works for them. The question is whether it will work elsewhere. Text from DJCline.com

08-18-09 SAP copy08-18-09 crowd pan1 copy08-18-09 sharethisslide copy

Copyright 2009 DJ Cline All rights reserved.

Apr. 21, 2009 SDF Apache Mahout

SDForum copy.jpgeastman-jeff-copy.jpghoffman-suzanne-copy.jpg

On April 21, 2009 at SAP in Palo Alto, SDForum’s Business Intelligence SIG hosted “BI Over Petabytes: Meet Apache Mahout” by Jeff Eastman. Suzanne Hoffman of Star Analytics talked about what she learned at the Gartner conference. Performance management is making a comeback as people try to make better use of the information they may already have. The leaders in BI are IBM Cognos, Microsoft and Oracle. One visionary is TIBCO.

Eastman thinks machine learning is a subfield of artificial intelligence concerned with algorithms that optimize computer performance. It is used in search clustering, knowledge management, mapping social networks, transforming taxonomies, analyzing markets, filtering unwanted e-mail and detecting fraud.

The Apache Mahout project is dedicated to the production of open source Machine Learning tools on the Apache Hadoop supercomputing platform orchestrating thousands of computers to analyze huge volumes of data in reasonable time. Mahout currently offers highly scalable programs for classifying (is this spam?), clustering (are these similar?), recommending (if you like X you might also like Y) and other tasks that can improve their performance by learning from past experiences. Coupled with cost-effective cloud computing infrastructures such as Amazon’s EC2/S3, this means that it is now practical for even small companies to distill Business Intelligence from Internet-sized datasets. The world needs scalable implementations of machine learning under open license and that is what Mahout aims to do.

Coptright 2009 DJ Cline All rights reserved.