Tag Archives: NoSQL

Oct. 27, 2010 SDF LinkedIn Avatara

On October 27, 2010 in Mountain View at LinkedIn, the SDForum SAM SIG hosted Chris Riccomini, Senior Data Scientist on LinkedIn’s Product Analytics team to talk about “Scalable Analytical Processing with LinkedIn’s Avatara.”

Formerly of PayPal, Riccomini worked on LinkedIn’s ”People You May Know” feature and “Who’s Viewed My Profile” product. He talked about the competing priorities between high throughput and low latency and the solution of Hadoop and Voldemort. LinkedIn needed something that would support offline aggregation, event-time aggregation and query time aggregation. It had to run through a Map/Reduce shared interface to power customer facing data products. He described a layered structure from top to bottom. On the top layer is the engine. Below that are the cube, and then the cube query. At the bottom are three elements: transform, aggregator and comparator.

The result was Avatara, a real time scalable Online Analytical Processing (OLAP) System already in production. It has features like select, count, where, group, having, order and limit. Riccomini described the architecture and implementation of its storage and aggregation layers.

One new term I heard was AvatarSQLishBuilder. Apparently, even in a NoSQL environment, the code should still have the look and structure of SQL. My advice for anyone heading into Hadoop territory is to take an experienced SQL database developer with you. Java is not enough in this Wild West show.

Another new term is Yahoo Cloud Serving Benchmark (YCSB). This is a way to compare various cloud products. I thought they were talking about yogurt. More explanation is at:

Richard Taylor was there and has written a splendid article about it at:


Copyright 2010 DJ Cline All rights reserved.

Oct. 26, 2010 SDF Membase NoSQL Cloud Solution

On October 26, 2010 in Menlo Park at Orrick, the SDForum Cloud Services SIG hosted Matt Ingenthron of Membase’s presentation “Scale out NoSQL Solution for the Cloud.”

Apps are moving from the desktop to the cloud, with millions of users expecting instantaneous response. Membase offers new techniques to store and retrieve data in the NoSQL space. They offer a simple, fast, elastic key-value database based on the memcached engine interface and compatible with existing memcached clients and applications. This is Open Source under an Apache 2.0 license and can scale in public or private clouds.

It was originally used to speed up access to authoritative data as a distributed hashtable. It can scale linearly by adding nodes without losing access to data and still maintain consistency when accessing. The scaling flattens out the cost and performance curves that would make traditional DBMS grind to a halt. This proved crucial when supporting such social games a Farmville and Mafia Wars. The Membase server database can handle 500,000 ops per second.

Ingenthron wrapped up by discussing Project Arcus, Moxi (memcached proxy), vBucket mapping, and clustering. He mentioned their partnership with Cloudera using Sqoop and Flume with Hadoop. Membase offers the distributed OLTP solution and Cloudera offers the distributed OLAP solution. As an example, an ad targeting to a particular user that might normally take 40 milliseconds could take only one millisecond. This could make a big difference when traveling across a crowded mobile network.

Copyright 2010 DJ Cline All rights reserved.

Sept. 21, 2010 SDF Analytics: SQL or NoSQL

On September 21, 2010 in Palo Alto at SAP, the SDForum Business Intelligence SIG hosted SenSage‘s Richard Taylor presentation “Analytics: SQL or NoSQL.” From his early days at Cambridge, Taylor’s research projects in parallel and distributed computing for DEC, Data-Cache, RedBrick Systems, Informix, and IBM are well known to experts in the business intelligence community. That is why the room was packed when he chose to talk about the new challenge to relational databases called the NoSQL movement.

Started as SEQUEL in 1974, it evolved into SQL. Adopted by Oracle, it became the standard for relational databases using schema, multi-version concurrency control, isolation levels and analytics extensions to deal with the complexity of structured data. The relational model created a world of normalized data in rows and columns with tables selected, projected or joined using primary or foreign keys. It had handled transaction processing very well but complicated cases became repetitive. Scaling was difficult.

By 2000, the rise of unstructured data on the web created new levels of complexity and the need for a new approach. Coined by Eric Evans in June of 2009, the NoSQL movement is seen in the development of Google’s Big Table, Amazon’s Dynamo and Facebook’s Cassandra. All of these used a tuple, one table consisting of a structured key with a column timestamp and an unstructured value. The two functions were map and reduce. Map input a tuple and output a list of tuples. Reduce input a key and list of values then output a list or tuple. You specified clusters, input and tuple stores as the framework did the rest. While there is no need to normalize large amounts of semi-structured data and it is cheaper to implement, it still requires some programming ability. There is no guidance from schema or model for historical data.

Taylor gave examples of how SQL and NoSQL would handle the same problems. Each had its advantages and disadvantages. I urge you to read Taylor’s work and listen to him speak on this subject.

Frankly, I would still want an experienced database developer with a strong background in SQL to deal with NoSQL because only they would be able sense when something was wrong. Big data is no place for amateurs.

Note: A delegation from Peru was in the audience. Picture below.

Copyright 2010 DJ Cline All rights reserved.