Tag Archives: SDForum SAM SIG

Dec. 9, 2010 SDF Netflix AWS Cloud

December 9, 2010 in Mountain View at LinkedIn, the SDForum SAM SIG presented “Netflix is moving to AWS Cloud” with Hien Luu of Netflix. Luu gave a quick overview of AWS services that are used at Netflix and how they are used in their cloud architecture. He talked about the lessons learned and best practices when moving to the cloud like service request instruments every step in the call. He recommends Automating the deployment process, learn to deal with failure, network latency, exponential backoff and read/connect timeout. Develop a performance strategy  by rethinking storage with SimpleDB, S3 and RDS, sharding and eventual consistency.

Copyright 2010 DJ Cline All rights reserved.

Oct. 27, 2010 SDF LinkedIn Avatara

On October 27, 2010 in Mountain View at LinkedIn, the SDForum SAM SIG hosted Chris Riccomini, Senior Data Scientist on LinkedIn’s Product Analytics team to talk about “Scalable Analytical Processing with LinkedIn’s Avatara.”

Formerly of PayPal, Riccomini worked on LinkedIn’s ”People You May Know” feature and “Who’s Viewed My Profile” product. He talked about the competing priorities between high throughput and low latency and the solution of Hadoop and Voldemort. LinkedIn needed something that would support offline aggregation, event-time aggregation and query time aggregation. It had to run through a Map/Reduce shared interface to power customer facing data products. He described a layered structure from top to bottom. On the top layer is the engine. Below that are the cube, and then the cube query. At the bottom are three elements: transform, aggregator and comparator.

The result was Avatara, a real time scalable Online Analytical Processing (OLAP) System already in production. It has features like select, count, where, group, having, order and limit. Riccomini described the architecture and implementation of its storage and aggregation layers.

One new term I heard was AvatarSQLishBuilder. Apparently, even in a NoSQL environment, the code should still have the look and structure of SQL. My advice for anyone heading into Hadoop territory is to take an experienced SQL database developer with you. Java is not enough in this Wild West show.

Another new term is Yahoo Cloud Serving Benchmark (YCSB). This is a way to compare various cloud products. I thought they were talking about yogurt. More explanation is at:
http://research.yahoo.com/node/3202

Richard Taylor was there and has written a splendid article about it at:

http://bandb.blogspot.com/2010/10/new-olap.html


Copyright 2010 DJ Cline All rights reserved.

Sept. 22, 2010 SDF Hadoop Scaling

On September 22, 2010 in Mountain View at LinkedIn, the SDForum SAM SIG hosted Ken Krugler of Bixo Labs presentation “Thinking at Scale with Hadoop.” In the past Krugler has worked with Nutch, Lucene and Solr. Now he uses Bixo, Hadoop, and Cascading, an open source project to speed up large scale web mining and analytics.

When working with Hadoop, traditional SQL developers should think in the map-reduce paradigm of fields/operations/pipes. The logical architecture is a computing cluster providing storage and execution. The execution is divide and conquer. The process steps are map, shuffle and reduce. A value is a set of fields. The key value pairs are exchanged between to the user-defined map and reduce algorithms. The map translates the input to keys and values. The system groups each unique key with all its values. The reducer translates the values of each unique key to new keys and values. Because map and reduce functions only care about current key and value, any number of mappers or reducers can invoke on an arbitrary number on any number of nodes.

Complex workflows require multiple map reduce jobs. Errors can occur when connecting and synchronizing data between them. Not seeing intent from all the low level data can make MR optimization harder. There are ways to deal with this complexity. He thinks you should model the problem as a workflow with operations on records. Cascading has tuples, pipes and operations, FlumeJava has classes and deferred execution. Pig or Hive treat it like a big database and are good for query-centric problems. Datameer (DAS) and BigSheets treat it like a big spreadsheet and are good for easier analytics.

There are experts in Hadoop and experts in SQL database development. It is nearly impossible to find someone who is an expert at both. Find an expert SQL database developer who wants to add some new tools to their toolbox. To get those tools, Krugler recommended the following Hadoop references:

Mailing list:

common-user@hadoop.apache.org

Book:

Tom White’s “Hadoop: The Definitive Guide”

Training:

Scale Unlimited: http://scaleunlimited.com/courses

Cloudera: http://www.cloudera.com/hadoop-training

Cascading: http://www.cascading.org

DAS: http://www.datameer.com

Copyright 2010 DJ Cline All rights reserved.

May 26, 2010 SDF kaChing

On Wednesday May 26, 2010 in Mountain View at LinkedIn, the SDForum SAM SIG hosted David Fortunato and Pascal-Louis Perez of kaChing presenting “Applied Lean Startup Ideas: Continuous Deployment at kaChing.”

kaChing is an online platform that connects investors with investment managers. They see continuous deployment is a way of life and adopted lean methodologies from the start. They can now can commit-to-production in five minutes. They described the mechanics of an automated release from check-in to production. They can clean build with full regression testing in less than 3 minutes, package and deploy by automatically redirecting traffic using ZooKeeper to coordinate and monitoring the release. Their service oriented platform is dubbed “kawala” and is being open sourced at http://code.google.com/p/kawala

Video by Steve Mezak can be seen at:

http://www.slideshare.net/accelerance/continuous-deployment-part-1 and

http://www.slideshare.net/accelerance/continuous-deployment-part-2

Copyright 2010 DJ Cline All rights reserved.