New Features in Hypertable 0.9.5.3
0.9.5.3 is a patch release, but it also includes a few new features that are worth mentioning.
Pagination with OFFSET and CELL_OFFSET
With this release you can specify an OFFSET or CELL_OFFSET option to a SELECT clause to skip a number of rows or cells in the query. This is often used in combination with LIMIT and CELL_LIMIT to implement pagination, i.e. when you only display the first 20 results of a query in a web page and then let users navigate to the next or previous pages.
SELECT * FROM table OFFSET 40 LIMIT 20;
The OFFSET and CELL_OFFSET options are also available for the C++ API (ScanSpecBuilder::set_row_offset and ScanSpecBuilder::set_cell_offset) and the Thrift APIs.
Chronological Timestamps
By default, Hypertable sorts Timestamps in reverse chronological order (newest on top). Many users have expressed wishes to reverse this order, therefore we added a new column family option TIME_ORDER DESC. A SELECT of a column with this option will always return the oldest values of each cell. See below how you can use this behavior to create unique user IDs or to simulate "AUTO_INCREMENT" fields.
CREATE TABLE test (cf1 TIME_ORDER DESC);
For completeness' sake we also added TIME_ORDER ASC which specifies the default behavior.
Unique Values (for a scalable "AUTO_INCREMENT")
A common usage scenario is to create unique IDs, i.e. for users signing up to a web page, for items being stored in your catalogue etc. Traditional SQL databases have options like "AUTO_INCREMENT" which automatically assign a new ID to a row whenever you insert one. Usually they use a counter which is then incremented. Such a counter would be inefficient in a distributed environment since it always has to be synchronized with the other nodes. Therefore we came up with a new design, one that scales and does not require synchronization.
The first step is to create a column for the unique values. A unique value can never be overwritten once it was created, therefore we use TIME_ORDER DESC in combination with MAX_VERSIONS 1 (because we're not interested in other values than the oldest one).
CREATE TABLE user_profiles (user_ids TIME_ORDER DESC MAX_VERSIONS 1);
The actual insert operation consists of two parts: first insert a unique key; second verify that it was really inserted (to make sure that no other node in the cluster has inserted an identical key in the meantime).
The following HQL tries to create a unique User ID for user "alice".
INSERT INTO user_profiles VALUES ("alice", "user_ids", "random_unique_id");
SELECT user_ids FROM user_profiles WHERE ROW = "alice";
# now verify that the SELECT returned cell value "random_unique_id"
For convenience we added a new HQL function GUID() which creates globally unique IDs and which can be used to create row keys or cell values:
INSERT INTO user_profiles VALUES ("alice", "user_ids", GUID());
For even more convenience we created a new helper library (HyperAppHelper) which can create GUIDs and insert unique values. The functions are declared in HyperAppHelper/Unique.h. These functions are also exported to the Thrift interface. Our PHP microblogging sample, which implements much of Twitter's functionality, uses this function when new users sign up.
Posted By: Christoph Rupp, Senior Software Developer, Hypertable Inc.
Hypertable Regression Testing
Now that Hypertable version 0.9.5.0 is out the door, I thought I would take a moment to describe some of the testing that goes into the Hypertable release process to ensure the highest product quality. The number one priority of the Hypertable development team is to make Hypertable 100% stable and correct. As part of that effort, an extensive suite of tests have been built to verify that code changes do not destabilize the product or introduce correctness problems.
Hypertable undergoes many levels of testing during the course of its development. For example, prior to each release, long running system verification tests are run to verify that the system performs correctly under various realistic workloads. In addition to these long running tests, an exhaustive set of regression tests are run prior to every code check-in. These regression tests are a mix of unit, functional, performance, and system tests and provide good test coverage. By running these tests frequently (i.e. prior to every check-in) bugs are caught sooner in the development cycle which minimizes their impact and speeds overall development progress.
Induced Failures
To verify that the various server processes maintain correctness under failure scenarios, we've instrumented the code with failure induction points. These are places in the code that can be forced to fail by supplying a –induce-failure switch to server startup command, for example:
Hypertable.RangeServer --induce-failure=metadata-split-4:exit:0
The format of the value for this switch is <tag>:<errortype>:<iteration>. <tag> identifies the failure point, <errortype> indicates the type of failure (exit or throw), and <iteration> indicates after how many iterations of hitting the failure point the error should occur. We use this mechanism to rigorously test for correctness of the range split process as can be seen in the test output listing below.
Test Output Listing
Running tests... Start processing tests Test project /opt/hypertable/build/release 1/122 Testing Common-Exception Passed 2/122 Testing Common-Logging Passed 3/122 Testing Common-Serialization Passed 4/122 Testing Common-ScopeGuard Passed 5/122 Testing Common-InetAddr Passed 6/122 Testing Common-PageArena Passed 7/122 Testing Common-Properties Passed 8/122 Testing Common-init Passed 9/122 Testing MD5-Base64 Passed 10/122 Testing Common-StatsSystem-serialize Passed 11/122 Testing Common-StringCompressor Passed 12/122 Testing Common-TimeInline Passed 13/122 Testing Common-BloomFilter Passed 14/122 Testing Common-Hash Passed 15/122 Testing HyperComm Passed 16/122 Testing HyperComm-datagram Passed 17/122 Testing HyperComm-timeout Passed 18/122 Testing HyperComm-timer Passed 19/122 Testing HyperComm-reverse-request Passed 20/122 Testing BerkeleyDbFilesystem Passed 21/122 Testing MasterOperation-TestSetup Passed 22/122 Testing MasterOperation-Proccessor Passed 23/122 Testing MasterOperation-Initialize Passed 24/122 Testing MasterOperation-SystemUpgrade Passed 25/122 Testing MasterOperation-CreateNamespac Passed 26/122 Testing MasterOperation-DropNamespace Passed 27/122 Testing MasterOperation-CreateTable Passed 28/122 Testing MasterOperation-RenameTable Passed 29/122 Testing MasterOperation-MoveRange Passed 30/122 Testing FileBlockCache Passed 31/122 Testing QueryCache Passed 32/122 Testing TableIdCache Passed 33/122 Testing CellStoreScanner Passed 34/122 Testing CellStoreScanner-delete Passed 35/122 Testing AG-garbage-tracker Passed 36/122 Testing Hypertable-Lib-TestSetup Passed 37/122 Testing Schema Passed 38/122 Testing LocationCache Passed 39/122 Testing LoadDataSource Passed 40/122 Testing LoadDataEscape Passed 41/122 Testing BlockCompressor-BMZ Passed 42/122 Testing BlockCompressor-LZO Passed 43/122 Testing BlockCompressor-NONE Passed 44/122 Testing BlockCompressor-QUICKLZ Passed 45/122 Testing BlockCompressor-ZLIB Passed 46/122 Testing CommitLog Passed 47/122 Testing MetaLog Passed 48/122 Testing Client-large-block Passed 49/122 Testing Client-async-api Passed 50/122 Testing Client-future Passed 51/122 Testing Client-row-delete Passed 52/122 Testing Client-periodic-flush Passed 53/122 Testing NameIdMapper Passed 54/122 Testing StatsRangeServer-serialize Passed 55/122 Testing HyperDfsBroker Passed 56/122 Testing Hyperspace Passed 57/122 Testing Hypertable-shell Passed 58/122 Testing Hypertable-shell-ldi-select Passed 59/122 Testing prune-tsv Passed 60/122 Testing RangeServer Passed 61/122 Testing ThriftClient-cpp Passed 62/122 Testing ThriftClient-ruby Passed 63/122 Testing ThriftClient-perl Passed 64/122 Testing ThriftClient-python Passed 65/122 Testing ThriftClient-php Passed 66/122 Testing ThriftClient-java Passed 67/122 Testing Balancing-mechanics Passed 68/122 Testing Balancing-load Passed 69/122 Testing Client-scanner-abrupt-end Passed 70/122 Testing Client-future-abrupt-end Passed 71/122 Testing Client-future-mutator-cancel Passed 72/122 Testing Client-random-write-read Passed 73/122 Testing Client-no-log-sync Passed 74/122 Testing CellStore-garbage-collection Passed 75/122 Testing RangeServer-commit-log-gc Passed 76/122 Testing AG-garbage-collection-max-vers Passed 77/122 Testing AG-garbage-collection-deletes Passed 78/122 Testing AG-garbage-collection-ttl Passed 79/122 Testing Dual-instances Passed 80/122 Testing RangeServer-load-exception Passed 81/122 Testing MasterClient-TransparentFailov Passed 82/122 Testing METADATA-split-recovery-0 Passed 83/122 Testing METADATA-split-recovery-1 Passed 84/122 Testing METADATA-split-recovery-2 Passed 85/122 Testing METADATA-split-recovery-3 Passed 86/122 Testing METADATA-split-recovery-4 Passed 87/122 Testing METADATA-split-recovery-5 Passed 88/122 Testing METADATA-split-recovery-6 Passed 89/122 Testing METADATA-split-recovery-7 Passed 90/122 Testing METADATA-split-recovery-8 Passed 91/122 Testing METADATA-split-recovery-9 Passed 92/122 Testing METADATA-split-recovery-10 Passed 93/122 Testing METADATA-split-recovery-11 Passed 94/122 Testing METADATA-split-recovery-12 Passed 95/122 Testing METADATA-split-recovery-13 Passed 96/122 Testing RangeServer-maintenance-thread Passed 97/122 Testing RangeServer-row-overflow Passed 98/122 Testing RangeServer-rowkey-ag-imbalanc Passed 99/122 Testing RangeServer-sequential-load Passed 100/122 Testing RangeServer-split-recovery-1 Passed 101/122 Testing RangeServer-split-recovery-2 Passed 102/122 Testing RangeServer-split-recovery-3 Passed 103/122 Testing RangeServer-split-recovery-4 Passed 104/122 Testing RangeServer-split-recovery-5 Passed 105/122 Testing RangeServer-split-recovery-6 Passed 106/122 Testing RangeServer-split-recovery-7 Passed 107/122 Testing RangeServer-split-recovery-8 Passed 108/122 Testing RangeServer-split-merge-loop10 Passed 109/122 Testing RS-group-commit-split-recovery Passed 110/122 Testing RS-group-commit-split-recovery Passed 111/122 Testing RS-group-commit-split-recovery Passed 112/122 Testing RS-group-commit-split-recovery Passed 113/122 Testing RS-group-commit-split-recovery Passed 114/122 Testing RS-group-commit-split-recovery Passed 115/122 Testing RS-group-commit-split-recovery Passed 116/122 Testing RS-group-commit-split-recovery Passed 117/122 Testing RS-group-commit-split-recovery Passed 118/122 Testing RangeServer-bloomfilter-rows Passed 119/122 Testing RangeServer-bloomfilter-rows-c Passed 120/122 Testing RangeServer-ScanLimit Passed 121/122 Testing ThriftClient-reconnect-hypersp Passed 122/122 Testing Thrift-table-refresh Passed 100% tests passed, 0 tests failed out of 122 Built target alltests real 47m25.639s user 1m29.231s sys 1m56.113s
By requiring that these regression tests pass prior to each code check-in, we catch most bugs very early in the development process when they are the easiest to track down. This leads to a much more stable code base and allows us to make rapid forward progress.
The Hypertable project was started in early 2007 and has had the benefit of many real-world deployments, including Baidu (Nasdaq: BIDU), China's leading search engine and, Rediff.com (Nasdaq: REDF), the largest India-owned and operated web portal and one of the top ten e-mail providers worldwide. Our dedication to quality along with years of real-world production deployment experience has made Hypertable the rock solid, production-quality database that it is today. If you have need for high performance, scalable database, now would be a great time to give Hypertable a try.
Posted By: Doug Judd, CEO, Hypertable Inc.
Opportunities for AI in Hypertable to be presented at AAAI 2011 Workshop
This Sunday, August 7th, the Hypertable development team will be presenting at AAAI 2011 Workshop – AI For Data Center Management and Cloud Computing, in San Francisco, CA. The objective of this workshop is to bring together researchers and technologists from academia and industry to explore the applications of artificial intelligence to the most pertinent technical challenges in data center management and cloud computing. The Hypertable team will be giving the following talks.
Opportunities for AI in a Cloud Database: Hypertable
Doug Judd, Hypertable, Inc (invited speaker)
A significant percentage of cloud data center hardware is dedicated to running cloud storage services. As the Big Data revolution continues and cloud databases become more sophisticated, the percentage of data center equipment dedicated to storage services is likely to grow. In this talk, Doug Judd, the original creator of Hypertable, will present opportunities for AI in cloud databases, using Hypertable as an example.
Load Balancing in Hypertable
Gordon Rios, University College Cork, Constraint Computation Centre
In Hypertable ranges of table data are stored and accessed on different nodes and allows for flexible management of the underlying hardware. Overall performance is sensitive to the balance of range load across the cluster. The project developers aim to create a simple interface to allow researchers to design experimental load balancing strategies that incorporate machine learning and optimization. This talk introduces the load balancing problem and presents it as a challenge problem for AI and machine learning.
If you have an interest in constraint optimization, machine learning, or AI in general, and would like to apply these techniques in Hypertable, we encourage you to attend the workshop and/or get involved in the project by joining the Hypertable Developers mailing list.
Stability Improvements in the Hypertable 0.9.5.0 pre-release
We recently announced the Hypertable 0.9.5.0 pre-release. Even though we've labelled it as a "pre" release, it is one of the biggest and most important Hypertable releases to date. Among other things, it includes a complete re-write of the Master, to fix some known stability problems. It represents a significant amount of work as can be seen by the following code change statistics:
- 512 files changed
- 30,633 line insertions
- 14,354 line deletions
The following describes problems that existed in prior releases and how they were solved, and highlights other stability improvements included in the 0.9.5.0 pre-release.
Duplicate range load. In prior releases, when a Range Server decided to give up a range (e.g. after a split), it would inform the master by calling the Master::move_range() method and then record the move in its meta log (RSML). Unfortunately, this logic contained a race condition. If the range server called Master::move_range(), but died before it got a chance to record the move in the RSML, and then the Master was stopped (e.g. sysadmin restart of the system), all record of the move was lost. When the RangeServer came back up, it would re-attempt to move the range, causing it to get loaded by two different range servers. With the introduction of the Master MetaLog (MML) and a two-phase Master::move_range() operation, this problem has been resolved.
Overlapping ranges. In prior releases, the Master would ask a range server to load a range by calling the RangeServer::load_range() method and would rely on the ALREADY_LOADED response code to handle situations where the acknowledgement was lost (e.g. range server or master died at an inopportune moment) and the RangeServer::load_range() call was re-issued. This logic also contained a race condition. When a range was loaded and the acknowledgement was lost, the loaded range could split before the Master re-attempted to load the range. When RangeServer::load_range() call was re-issued, the RangeServer happily loaded the range because it no longer contained the range in its live set (due to the split). With the introduction of a two-phase load range operation, this problem has been resolved.
Hypertable Leverages Bigtable’s High Performance Pattern Matching Code
We're thrilled to announce that as of the 0.9.4.3 release we've knocked out a popular feature request: regular expression based filtering. Queries can now filter cells by regular expression matches on the row key, column qualifier, and value.
To implement this feature, we decided to use Google's RE2 regular expression engine. Although there are a number of excellent regular expression engines to choose from, RE2 fits perfectly with Hypertable's high performance philosophy and the fact that it powers Bigtable, Sawzall and a host of other Google projects made it a great candidate. Our reasons for picking RE2:
- RE2 supports most Perl/PCRE syntax and allows us to provide feature rich regular expression matching.
- It is blazingly fast (guaranteed linear run time), so we don't have to worry about ad-hoc queries hogging up too much CPU. With a 110MB dataset consisting of about 4.5M unique URLs and our tests showed RE2 was 3X-50X faster as compared to java.util.regex.Pattern.
- It uses a small, fixed amount of memory so no query can bring down the server with unbounded memory usage.
- RE2 allows Hypertable to deliver powerful pattern matching capability at the lowest hardware cost.
- The fact that RE2 powers Bigtable, Sawzall and other critical Google code speaks for itself.
Example
Let's walk through an example using the DMOZ dataset. This data contains a title, description, and category tags for a set of URLs. The domain components of the URL have been reversed so that URLs from the same domain sort together. In the schema, the row key is a URL and Title, Description and Topic are column families. The DMOZ dataset is about 2GB uncompressed. Here's a small sample from the dataset:
com.awn.www Title Animation World Network
com.awn.www Description Provides information resources to the …
com.awn.www Topic:Arts
com.awn.www Topic:Animation
Download the dataset, which is in the .tsv.gz format and can be directly loaded into Hypertable without unzipping
wget https://s3.amazonaws.com/hypertable-data/dmoz.tsv.gz
Create the dmoz table and populate it with the downloaded data file using the hypertable shell:
/opt/hypertable/current/bin/ht shell
USE "/";
CREATE TABLE dmoz(Description, Title, Topic, ACCESS GROUP topic(Topic));
LOAD DATA INFILE "dmoz.tsv.gz" INTO TABLE dmoz;
Row Filtering
Suppose you want a subset of the URLs where the phrase linux appears in the domain name, you could run:
SELECT Title FROM dmoz WHERE ROW REGEXP "^[^/]*linux" KEYS_ONLY;
ar.com.carreralinux.www
ar.com.linux-way.www
…
Column Filtering
To match all topics that start with write (case insensitive):
SELECT Topic:/(?i)^write/ FROM dmoz;
13.141.244.204/writ_den Topic:Writers_Resources
ac.sms.www/clubpage.asp?club=CL001003004000301311 Topic:Writers_Resources
…
Value Filtering
The next example shows how to query for data where the description contains the word game followed by either foosball or halo:
SELECT CELLS Description FROM dmoz WHERE
VALUE REGEXP "(?i:game.*(foosball|halo)\s)";
com.armchairempire.www/Previews/PCGames/planetside.htm Description Preview by Mr. Nash. "So, on the one side the game is sounding pretty snazzy, on the other it sounds sort of like Halo at its core."com.digitaldestroyers.www Description Video game fans in Spartanburg, South Carolina who like to get together and compete for bragging rights. Also compete with other Halo / Xbox fan clubs
…
Conclusion
The mission of the Hypertable project is to couple a rich feature set with optimum performance. The incorporation of Google's RE2 engine is yet another example of our continuing efforts towards that end. We'd like to thank Google and RE2 team for contributing this excellent code to the open source community.
Hypertable Use Case: Tribalytic
Tribalytic is a market research tool that delivers instant insight into brand and marketing campaigns by analyzing Twitter data. Given any set of keywords, Tribalytic will provide analysis results on key market research metrics.
Let’s take the phrase "coffee machine" as an example. With Tribalytic, you can quickly see that people usually tweet about coffee machines during the following times — around 9:00, 14:00, 21:00. This information is crucial for a marketer who can now identify the top keyword “broken” and the time frames this word should be linked to their coffee machine marketing campaign. The powerful Tribalytic research tool allows companies to see the percentage of all twitter users in Australia that have mentioned “coffee machine” within any given time period, and target their marketing campaigns not only based on key words, but on the exact times their audience is engaged.
Information like that obtained in the example above provides crucial insight for the marketing decision making process — i.e. how long a campaign should run, how to measure the campaign effectiveness, and whether the campaign has reached the majority of the target audiences.
Tribalytic is revolutionary because it requires no setup. Compared to our competitors, who require a sales call and manual setup per keyword monitoring. With Tribalytic, all you need to do is to type a keyword, check a few boxes and you can start getting benefits right away. This instantaneous feedback experience has been a 'killer' feature to our customers.
Hypertable is one of the key technologies that make "the slick Tribalytic experience" possible. This post will talk about our system architecture, how we integrate Hypertable into our system, and why we chose Hypertable over HBase and MongoDB.
System Architecture
The latest tweets for the people in the sample pool are downloaded and stored into a MySQL database. An indexer process reads the new tweets, parses them and stores the results into Hypertable. When a user submits a query, data is read from Hypertable, an in-memory analysis is performed, and the results are presented back to the user.
Currently, we have only one table called "hits", which is essentially an inverted index for quick keyword lookups. Let's say the user wants to research “coffee” for the last two weeks. All of the users who mentioned coffee during that period and their related tweets are loaded into memory so that our analytic engine can do its job. The speed of reading data from disk into memory is the key to the overall performance. After we switched the data store to Hypertable, plus rewriting the analytic engine in c++, we successfully reduced query time for coffee from 200 seconds to 0.5 second. 1000 times faster!
Tribalytic tracks 200,000 Australian twitter users. On a typical day, we will process up to 3,500,000 tweets and about 50,000,000 new records will get pushed into the 'Hits' table. If you consider that 9.837% people in Australia talk about coffee at least once a week, there is a lot of data loaded into memory for analysis. Because Hypertable stores data for the same keyword consecutively on disk, reading them in one shot does not incur excessive disk seeks, which is usually the root cause for slow query speed. In our coffee example, we spend less than 250 ms to load all the data into memory, leaving another 350 to 550 ms for analysis and still have plenty time left for the normal 'web' stuff.
How we integrated Hypertable
The indexer is written in python. It saves data to Hypertable via the python thrift interface. Initially we found that Hypertable’s array version of the interface worked best for us, since it avoids unnecessary python object construction/destruction overhead.
The query and analytic part are both written in c++ using the Hypertable c++ client library. The analyzer exposes the query functionality via a thrift socket interface to front-end python/django processes.
Why we chose Hypertable
One of the primary reasons we chose a scalable NoSQL database was so that the system would be architected to scale up as we grow. We can add more countries, increase sample sizes, and sustain more traffic without “hitting the wall”.
Our initial prototype was built using MongoDB. Because MongoDB uses a B-tree as its underlying data store, it's just not the right technology for sequential data reading. In our coffee benchmark, it would take 50 – 75 seconds to read all data into memory. We still use MongoDB to serve more traditional profile queries that require more random reads over small data sets.
Before we adopted Hypertable, we built a prototype using HBase. Given the apache "brand", and the Hadoop, Mahout, and Hive projects out there, how could we not? The main reason we chose Hypertable instead of HBase was because of the increased performance Hypertable was able to deliver, which reduced our hardware footprint and thereby lowered cost.
Doug and his team chose c++ over Java for good reasons — to garner greater performance output. We've also observed similar significant performance boost when switching from HBase to Hypertable. C++ offers much finer control over how the memory can be used more efficiently. We use two Google open source libraries sparsehash and perftools extensively in our analytical engine to achieve our performance goal.
One mis-conception about C++ is the perceived inevitability of the segfault. We had an extensive testing suite built using boost.unit and gcover to make sure all the corners have been covered. Since we deployed the new service 3 months ago, we haven't had any hiccups.
About the code and the team
The Hypertable team has been very supportive. Most questions were answered within 24 hours. The code has been exceptionally clean and well documented. Back in Feb, one of the Hypertable community members ran a c++ static analyzer on the code and only found two issues. For more than half of my questions, I could find the answers by just sorting through the code.
Posted By: Alex Dong, CTO, BinaryPlex Pty. Ltd.
Hypertable T-Shirt Preview
In a week, we'll be printing high quality Hypertable t-shirts and will be passing them out to all of the attendees at the upcoming Hypertable meetup at the Facebook Headquarters in Palo Alto, CA. As computer geeks, we obtain a substantial portion of our wardrobe at conferences and trade shows, so we understand the value of a good FREE t-shirt. So, for this one, we spared no expense. Check out the following mock-up and details below.
The front …


The Design
We hired the design firm Blue Coast Web to put together the design, the same firm that we used to put together the hypertable.org site. They came up with a number of creative t-shirt design ideas which we spent a fair amount of time deliberating over. We ended up going with a design that we felt would be appealing to broadest cross-section of Hypertable supporters.
The Shirt
We chose high quality American Apparel shirts (Combed fine jersey 4.3 oz, 100% cotton) at an additional cost of $2/shirt. The reason we went with the American Apparel is because we like the cut (more stylish) and the fabric felt better than the other options.
The Printing Process
The shirts will be printed by Graphic Sportswear in San Francisco. It will (likely) be a seven color screenprint (to achieve a four color process gradient look). Colors will likely be:
- black for tesseract outline
- light green/grey of tesseract "panes"
- light green
- medium green
- dark green
- light grey in hypertable text, to achieve gradient to the…
- dark grey in hypertable text
If you're interested in Hypertable and want to get your hands on one of these t-shirts, come join us at the upcoming Hypertable meetup at the Facebook office in Palo Alto on September 16th. We'll be discussing the recently developed Hypertable+Hive extension and will present the results of the Hypertable vs. HBase Performance Evaluation. Hope to see you there!
Posted By: Rebecca Ritter, VP, Corporate Communications, Hypertable, Inc.
ArchCamp: Scalable Databases (NoSQL)
On Friday night, Sanjit Jhala (Lead Hypertable Engineer) and I attended the ArchCamp unconference at HackerDojo in Mountain View, CA put on by Adam Hitchcock from Skype and Zach Steindler from Olark. I've been to a number of unconferences and this one was definitely one of the better ones. There was plenty of pizza, beer, and great discussions. I enjoyed the conversations I had with the other attendees, including Paco Nathan from KwizineLoc and Todd Hoff of High Scalability fame. These are my notes from a session we led entitled, "Scalable Databases (NoSQL)"
The session started out free-form, but shaped up pretty quickly into a discussion of the popular open source scalable NoSQL databases and the architectural categories in which they belong. The following is a description of each category that was considered.
Key/Value store – This category applies to Distributed Hash Table (DHT) technology, a scalable database architecture that primarily supports GET, SET, and DELETE operations on key/value pairs. It's a bit of a misnomer because a system like Bigtable supports these operations, but the term is most commonly associated with DHT technology.
Column-oriented – This category refers to database systems that store data physically grouped on disk by column (see Column-oriented DBMS). They can achieve very high throughput column selects because only the data for the selected columns is transferred from the disk subsystem. Most people categorize Bigtable as a column oriented data store. This is somewhat of a misnomer because Bigtable supports what's known as Locality Groups, giving the schema designer control of how column data is physically stored on disk. Bigtable can be configured to be column-oriented by putting each column in its own locality group, or row-oriented by putting all columns in a single locality group, or some combination of the two (e.g. two columns in a locality group and the rest in another).
Document-oriented - This category refers to systems that facilitate the storage and retrieval of semi-structured documents. Systems in this category provide APIs that allow applications to easily store documents in a format such as JSON and can auto-index fields.
In-memory – This category describes databases that store their entire data set in memory. They can deliver excellent performance, but their capacity is limited by the aggregate amount of physical RAM available system-wide. Many of these systems backup data to disk via a journal and/or snapshotting.
Auto-sharding – Sharding is what most people do when they have no scalable database (e.g. Facebook). Table data is partitioned into horizontal "shards" and each shard is stored on a separate machine in a traditional database (e.g. MySQL). To achieve high availability, each shard server can be replaced with a traditional master/slave replica group. Typical homegrown sharding systems require a fair amount of "glue code" to route requests to shard servers. Auto-sharding systems automate this process.
Dynamo – Dynamo is a distributed hash table (DHT) architecture developed by Amazon.com specifically for their shopping cart system. One of the key requirements behind the design was high availability writes, which was born out of the desire to avoid system failure when a customer adds an item to their cart. To that end, it employs a technique called Evenutal Consistency whereby writes succeed when a subset of replicas acknowledge. This can lead to inconsistencies in the face of machine failures which are reconciled during reads.
Bigtable – Bigtable is a scalable database system developed by Google. It is designed to operate within the context of their broader scalable computing stack which includes the Google File System (GFS) and Map/Reduce. All state in Bigtable is persisted in the GFS. The system provides "immediate" consistency at the expense of latency spikes during machine failures.
| Database | Architecture | Implementation Language |
| Hypertable | Bigtable, Column-oriented | C++ |
| HBase | Bigtable, Column-oriented | Java |
| Cassandra | Dynamo (with Bigtable data model) | Java |
| MongoDB | Auto-sharding | C++ |
| Riak | Dynamo, Key/Value store | Erlang |
| CouchDB | Document-oriented (non-scalable) | Erlang |
| Redis | In-memory, Key/Value store | C |
| Project Voldemort | Dynamo | Java |
| Tokyo Cabinet/Tyrant | Key/Value store | C |
| Dynomite | Dynamo | Erlang |
| Amazon S3 | Hosted Key/Value store | Java |
| Amazon RDS | Hosted MySQL | C |
We scribbled notes up on a whiteboard that was like a conveyer belt turned on its end. There was a button on a control panel below the board which, when pressed, caused the conveyer belt to rotate, scan the board, and dispense a printout of the board contents from a print roller attached to the control panel. Here's the printout of the board for our session (sorry for the chicken scratch, I should've been a Doctor!):

Many thanks to Adam and Zach for organizing the conference. We look forward to the next one!
Hypertable will be holding its next meetup at Facebook in Palo Alto, CA. Please come join us for pizza, beer, and a free Hypertable t-shirt (be sure to RSVP with your size).
Posted By: Doug Judd, CEO, Hypertable, Inc.
Hypertable vs. HBase Performance Evaluation
In our initial blog post, Why We Started Hypertable, Inc., we mentioned that achieving optimum performance has been one of the project’s most dominant guiding principles. As a result, we made the decision to do the implementation of Hypertable in C++. Today, we are pleased to publish our first set of benchmark results, comparing the performance of Hypertable with that of HBase. These results demonstrate that implementation language matters, and that C++ delivers a significant performance advantage. A detailed test report can be found through the following link.
Hypertable vs. HBase Performance Evaluation Test Report
A sample of the results are presented in the table below.
| Test | Hypertable Performance Improvement Relative to HBase (%) |
| Random Read Uniform 80 GB | 396 |
| Random Read Uniform 20 GB | 424 |
| Random Read Uniform 2.5 GB | 97 |
| Random Read Zipfian 80 GB | 925 |
| Random Read Zipfian 20 GB | 777 |
| Random Read Zipfian 2.5 GB | 100 |
| Random Write 10000 byte values | 51 |
| Random Write 1000 byte values | 102 |
| Random Write 100 byte values | 427 |
| Random Write 10 byte values | 931 |
| Sequential Read 10000 byte values | 1060 |
| Sequential Read 1000 byte values | 68 |
| Sequential Read 100 byte values | 129 |
| Scan 10000 byte values | 2 |
| Scan 1000 byte values | 58 |
| Scan 100 byte values | 75 |
| Scan 10 byte values | 220 |
The first observation that I'd like to make has to do with the performance difference of the two systems as the size of the value decreases. One key target application for a Bigtable-like scalable database is Web-scale analytics. This type of application typically operates on hundreds of millions or billions of very small values. Think session click counts. We feel that over time, this will be a principal driver of demand for the technology. As can be seen by the results, Hypertable's performance relative to HBase grows considerably as the size of the value decreases. This puts Hypertable at a distinct advantage over HBase when it comes to supporting this important class of application.
The second observation I would like to make has to do with the suitability of these systems in cloud environments. Traditional datacenters see utilization rates that are often below 20%, whereas cloud computing providers routinely report utilization rates at nearly 80%. A scalable, multi-tenant database service offering in a cloud environment can expect to experience similarly high utilization rates. At these levels, performance wins translate directly into cost savings. With these kinds of performance multiples, Hypertable can deliver multi-tenant database capacity at a fraction of the cost of a system like HBase.
And finally I would like to comment on the sustainability of these results. Some of the performance difference between the two systems may have to do with better design choices on the part of Hypertable, but the bulk of the performance difference can be attributed to the implementation language. Bigtable-like database systems are very memory intensive and CPU intensive. Java, in comparison to C++, is notoriously poor when it comes to leveraging these two resources. As both of these systems evolve, they will provide more CPU intensive functionality (data types, aggregates, etc.). This means that over time, the disparity in performance will continue to grow.
We're excited by these results and look forward to working with you to help build your next-generation, big data applications.
Posted By: Doug Judd, CEO, Hypertable, Inc.
Why We Started Hypertable, Inc.
Welcome to the Hypertable blog. We will be using this blog to communicate what’s happening with Hypertable – both the company and the project – and to express our opinions on issues being discussed in the world of “big data” applications.
It has been about six months since we formed Hypertable, Inc. so we thought, for our first blog post, it would be appropriate to discuss why we decided to start the company. Back in 2006, I was brought on as an Architect at Zvents. Our goal was to do for “local” search what Google had done for global search. This meant collecting growing amounts of query log, click log, and crawl data, doing analytics over that data, and using the results to fuel our ranking, recommendation, and ad targeting systems. One of the key requirements was to build a system that could keep pace with growth in traffic, and to that end, we decided to forgo traditional RDBMS technology and build the back-end system entirely on top of scalable computing technology.
At the time, Hadoop existed, which provided the scalable file system and a MapReduce framework, but there was no scalable indexing implementation for serving large data sets to live applications. Google’s BigTable was the obvious choice, and the design jibed with my experience building large-scale crawling and indexing systems at Inktomi. Not long after we decided to go down the BigTable path, the Dynamo paper was published. We seriously considered how they handled inter-data center replication through the use of eventual consistency, but felt that it made more sense to employ these techniques at a higher level. We felt that the BigTable design was much more versatile. Not only could it handle simple key/value look-ups, but being a sorted data structure, it could handle range queries very efficiently, critical for analytics applications. And being fully consistent, it was much simpler to reason about. Fortunately for the project, the BigTable design has proven to work well in practice. BigTable powers nearly every major service at Google (over 100 of them) and is one of their most important pieces of scalable computing infrastructure.
Achieving optimum performance has been one of the project’s most dominant guiding principles, and the reason is simple. Efficiency gains scale linearly with the system, which dramatically reduces hardware requirements and lowers overall operational costs. It was within this context that we decided to do the implementation in C++. (We will be announcing some benchmark results soon which makes us think we made the right decision.) In addition, since we were building general purpose technology that applied to much more than just local search, we decided to open source the project and reap all of the benefits that the open source development model has to offer. You can see more details on the Hypertable project here.
As we saw Hypertable being downloaded and used by more and more companies – and big data applications becoming an integral and rapidly growing part of our economy – we realized it would be important to have an organization to provide the capabilities and support enterprises need for production implementations of Hypertable. We also felt that, as the primary committers to the Hypertable project, we would be best equipped to address that need. As a result, after much contemplation, we decided to leave Zvents to form Hypertable, Inc.
The mission of Hypertable, Inc. is to provide support for organizations that run Hypertable as part of their business critical applications. The company provides a host of professional services including architecture consulting, training, and 24/7 support so that enterprises can have a service level guarantee that their scalable database infrastructure will be up and operational.
We look forward to using this blog to keep you updated.
Regards,
Doug Judd
CEO and Co-Founder
Hypertable, Inc.

