Skip to content

June 24, 2010

16

Hypertable vs. HBase Performance Evaluation

In our initial blog post, Why We Started Hypertable, Inc., we mentioned that achieving optimum performance has been one of the project’s most dominant guiding principles.  As a result, we made the decision to do the implementation of Hypertable in C++.  Today, we are pleased to publish our first set of benchmark results, comparing the performance of Hypertable with that of HBase.  These results demonstrate that implementation language matters, and that C++ delivers a significant performance advantage.  A detailed test report can be found through the following link.

Hypertable vs. HBase Performance Evaluation Test Report

A sample of the results are presented in the table below.

 Test  Hypertable Performance
 Improvement Relative to HBase (%)
 Random Read Uniform 80 GB  396
 Random Read Uniform 20 GB  424
 Random Read Uniform 2.5 GB  97
 Random Read Zipfian 80 GB  925
 Random Read Zipfian 20 GB  777
 Random Read Zipfian 2.5 GB  100
 Random Write 10000 byte values  51
 Random Write 1000 byte values  102
 Random Write 100 byte values  427
 Random Write 10 byte values  931
 Sequential Read 10000 byte values   1060
 Sequential Read 1000 byte values  68
 Sequential Read 100 byte values  129
 Scan 10000 byte values  2
 Scan 1000 byte values  58
 Scan 100 byte values  75
 Scan 10 byte values  220

The first observation that I'd like to make has to do with the performance difference of the two systems as the size of the value decreases.  One key target application for a Bigtable-like scalable database is Web-scale analytics.  This type of application typically operates on hundreds of millions or billions of very small values.  Think session click counts.  We feel that over time, this will be a principal driver of demand for the technology.  As can be seen by the results, Hypertable's performance relative to HBase grows considerably as the size of the value decreases.  This puts Hypertable at a distinct advantage over HBase when it comes to supporting this important class of application.

The second observation I would like to make has to do with the suitability of these systems in cloud environments.  Traditional datacenters see utilization rates that are often below 20%, whereas cloud computing providers routinely report utilization rates at nearly 80%.  A scalable, multi-tenant database service offering in a cloud environment can expect to experience similarly high utilization rates.  At these levels, performance wins translate directly into cost savings.  With these kinds of performance multiples, Hypertable can deliver multi-tenant database capacity at a fraction of the cost of a system like HBase.

And finally I would like to comment on the sustainability of these results. Some of the performance difference between the two systems may have to do with better design choices on the part of Hypertable, but the bulk of the performance difference can be attributed to the implementation language. Bigtable-like database systems are very memory intensive and CPU intensive. Java, in comparison to C++, is notoriously poor when it comes to leveraging these two resources.  As both of these systems evolve, they will provide more CPU intensive functionality (data types, aggregates, etc.).  This means that over time, the disparity in performance will continue to grow.

We're excited by these results and look forward to working with you to help build your next-generation, big data applications.

Posted By:  Doug Judd, CEO, Hypertable, Inc.

Read more from Performance
16 Comments Post a comment
  1. Jun 24 2010

    Hey Doug,
    Congrats on founding the new company, and glad to see that Hypertable is still alive and well. Having friendly competition between alternative Bigtable implementions keeps us all on our toes and is hugely beneficial for the user community.
    Do you plan on releasing the performance evaluation software? A comparison benchmark is only fair if you provide the means with which others can reproduce the results. It also seems as if there were some configuration errors made on the HBase side (eg if you bump the heap up to 5GB, it's standard to tune the block cache percentage, GC, etc). Reproducing the test on more modern hardware is also important in my mind – no one runs datacenters with dual core Opterons anymore in my experience.
    If you cannot release the benchmark code, I would encourage you to write a plugin for Yahoo's YCSB benchmark to test Hypertable.
    Lastly, I would encourage you to do a post comparing HBase and Hypertable on a basis of features and project integration capability. Recently the HBase team has been focusing less on performance and more on reliability and features (eg we have replication nearly finished, new integration with MapReduce for bulk loads, major load balancing improvements on the way, etc). Obviously these things are harder to compare objectively, but it would still be interesting to hear about the state of the world and whether the two projects have similar or diverging road maps. In my experience, feature set and roadmap are more important than raw performance in the minds of most educated customers, especially those intending to make a long term platform investment (though I am very impressed by your raw numbers above!)
    Thanks, and hope to catch up next week if you're around at the Hadoop Summit.
    -Todd

  2. admin
    Jun 24 2010

    Hi Todd,

    The report contains a link to a document that has detailed instructions on exactly how to reproduce the test.  It includes pointers to all of the configuration files that we used in the test.  As you can see, we did use a 5GB heap and tuned the jvm with UseConcMarkSweepGC and XX:+CMSIncrementalMode.  We did not tune the block cache percentage on a per-test basis because workloads on the same database can vary and we feel that a database system should adapt to those variations.

    The reason that we've been focused on performance is because that is something that cannot be added to a system after the fact.  Choices made early on in a project, that can't easily be undone, can have a huge impact on the performance of the system.  Our feature set continues to grow and over time the feature sets of both systems will converge.  

    As far as reliability goes, the test results show that HBase has reliability problems in the sense that it loses data.  Read the last section in the report entitled, "Individual Test Reports".  You'll see that in five of the tests, HBase returned less data than it had written.  On the Hypertable project, we take consistency very seriously and have an exhaustive set of regression tests and system tests to verify that the system never loses data.

    I do enjoy the friendly rivalry we've got going on between the projects and Hypertable is certainly better because of it.  I will be at the Hadoop Summit and look forward to catching up.

    - Doug

  3. Jun 24 2010

    Hey Doug,
    Thanks for the response. I missed the link that had pointer to the source code for the test. When we shift gears towards optimization I'll give it a try and see what the results look like on recent hardware, properly tuned HBase, etc.
    Regarding the reliability question, the majority of reliability work I was speaking of is going on in HBase trunk, not 0.20 series. Lost edits will be considered a serious bug in the next version of HBase – it's a shame that they weren't up until this point, but I agree that it's the #1 priority.
    As for performance not being able to be added later to a system, I entirely disagree. Isn't that exactly what people mean by "premature optimization?" There is plenty of performance room left in HBase – I don't think Java vs C++ is any kind of golden rule that C++ will always win by a large margin.
    Anyway, blog comments are not the best place for a technical discussion, but look forward to talking next week. I'm sure there are some design ideas that we can toss back and forth and benefit both projects from the sharing.
    -Todd

  4. Jun 24 2010

    I tend to disagree too :-)
     
    I think C++ will exhibit better performance, at least when sorting.
     
    http://verify.stanford.edu/uli/java_cpp.html
     
    It's a little bit out of date but still worth reading.

  5. Sanjit Jhala
    Jun 24 2010

     
    Hi Todd,
    I read that HBase had a very successful performance optimized release (0.20.0) which  dramatically improved performance by "unJavafying" the code base:
    http://developer.yahoo.net/blogs/theater/archives/2009/07/hadoop_summit_hbase_goes_realtime.html
    http://www.scribd.com/doc/16735075/HBase-Goes-Realtime

    We look forward to competing with HBase in terms of feature set , performance and "unJavafication" in upcoming releases :)
    -Sanjit

  6. anonymous
    Jun 24 2010

    @Mateusz, that article is 9 years old!  Ignoring a decade of improvements in JIT compiler technology to make a not-very well supported language statement is kind of odd. 

  7. Jun 24 2010

    @anonymous: surprisingly a decade of improvements in JIT technology did not make a lot of difference: 
     
    http://shootout.alioth.debian.org/u64/benchmark.php?test=all&lang=gpp&lang2=java
     
    surprisingly it consistently uses more memory, more time, less code to write software in java.
     

  8. Isaac Gouy
    Jun 25 2010

    @Mateusz – now look at the source code for the C++ programs and the source code for the Java programs, and see what they are actually doing.
    Interesting how similarly the similar n-body programs perform ;-)
    As evidence of "uses more memory" only a handful of the tasks are interesting – the others are just showing the default JVM usage – so k-nucleotide, reverse-complement, regex-dna, and binary-trees.
    As evidence of uses less code, add the caveat uses less code when performance is the priority. "Remember – those are just the fastest C++ GNU g++ and Java 6 -server programs"

  9. Jun 25 2010

    At this time are there any plans to port Hypertable to Plan 9?

  10. admin
    Jun 25 2010

    @Bob We currently have no plans for Plan 9, but would welcome a port if someone in the community is up for it.
    - Doug

  11. Michael Stack
    Jul 1 2010

    Hey Doug:
    Please undo your purchase of the hbase adword.  There are plenty of open forums for disseminating your message and on which hbasers can respond should they wish but your purchase of the hbase adword leaves us only one response and thats to outbid you.  The only  winner in that game will be google.
    Please write me privately if you intend to persist (I'm away from computers for next week and more).
    Thanks,
    St.Ack

  12. Jul 20 2010

    This is not a Java vs. C++ competition but performance first vs. feature set first approach fight. Surely, HBase has a lot of room for performance improvement, but I am afraid the more HBase team postpone this optimization the harder it will be to implement in a future.
    @Todd Lipcon. Focusing on reliability of HBase is a must because the current version (0.20.5) is unstable (we have been observing a lot of region servers shutdowns under heavy load) 

  13. Jul 22 2010

    @St.Ack: on the adword campaign hypertable.com is a support company advertising on keyword concepts on the internet (such as "hbase" or "hypertable" etc.) just as google or cloudera might advertise on concepts such as "hadoop" — it's simply a way to promote the commercial support product.
     

  14. Scott Carey
    Oct 11 2010

    Its not uncommon for me to dig into a Java app and get 3x to 10x performance with a few days to weeks of tuning and refactoring.   In C++, performance first is more critical because such refactoring is far more dangerous and time consuming.   Most of the things that make a typical Java program slow or memory-bloated can be fixed without a lot of work.  On the other hand, an application as large as HBase can take a while to improve.

    I can't speak for HBase itself, but most Hadoop related Java projects were not written for best performance first and still have many significant gains left to make.  It would not surprise me if there is a LOT left to gain performance wise in HBase that has nothing at all to do with Java versus C++ — but rather to do with the most common Java design patterns (easiest to code or most familiar) to solve various problems.
    So while this may be a fair test comparing the current state of two similar technologies, it has almost nothing to do with C++ versus Java unless one project was a direct port of the other, with the same features and algorithmic choices.   I'm sure that if both projects focused on small value accessm the results would differ.   That data alone screams that much or most of this has nothing to do with the languages used.

    I have been involved with ports of code from C++ to java that yeilded 2x performance gains, and ones that yielded slowdowns.   The JIT is not good at all things, but is extremely good at certian things.  The GC, when tuned and well understood, is often far faster than C++ memory allocation for long lived programs that do a lot of allocation and destruction.  For 'in place' algorithms with lots of array access and byte/bit twiddling, C/C++ can come out 4x faster in some cases (a LZ78 style compression decompressor) or equivalently fast (CRC32).

    Also Sorting was mentioned.  Java 7's sorts are ~1.2 to 2x faster than Java 6.  Why?  Algorthm changes.
    Algorithm choice trumps language as a performance factor for C++ versus JIT'd Java.  When the exact same algorithm is used, performance is very frequently the same.  C++ wins biggest when it can do something Java can't — like traverse a byte array and cast byte pointers to int pointers and enforce 4-byte aligned access and therefore trigger the more efficient processor instructions for loading 4 consecutive bytes into a register as a 32 bit int.

Trackbacks & Pingbacks

  1. Tweets that mention http://blog.hypertable.com/?p=14utm_sourcepingback -- Topsy.com
  2. Hypertable T-Shirt Preview « The Hypertable Blog

Share your thoughts, post a comment.

(required)
(required)

Note: HTML is allowed. Your email address will never be published.

Subscribe to comments

  • Archives

  • Meta