Sunday, October 7, 2012

BigMemory Go - Fast Restartable Store: Kickass!!


BigMemory Go, the latest product offering from Terracotta. BigMemory Go lets you keep all of your application's data instantly available in your server's memory. Your application will respond promptly when you have your data closest to the application.

A general application will have a database storing GBs of data, which makes the application slow. BigMemory Go provides a brand new reliable, durable, crash resilient and, above all, fast storage, called Fast Restartable Store. Fast Restart feature provides the option to store a fully consistent copy of the in-memory data on the local disk at all times. 

We generally use Database for bookkeeping, state-of-record, persistent data. With BigMemory Go, we provide durable persistent data store, which can handle any kind of shutdown - planned or unplanned. The next time your application starts up, all of the data that was in memory is still available and very quickly accessible.

We are here comparing the performance of mysql db (tuned to my best knowledge) with Ehcache hibernate 2nd level cache (w/ DB backend) and BigMemory store altogether. The sample application used here is classic Spring's PetClinic app (using hibernate), modified for simplicity and shed spring webflow, converted to standalone Java performance test. This app has a number of Owners which have Pets and corresponding Visits to the Clinic. The test initially creates a number of Owners with pets and visits and puts them to the storage. Then, it executes 10% read/write test for 30 mins.

The three cases considered here are
  1. No Hibernate caching with DB
  2. Ehcache with BigMemory used as Hibernate 2nd level cache with DB backend.
  3. BigMemory Go Fast Restartable Store (FRS)
Performance comparison chart shows that BigMemory Go FRS outnumbers the competitors. With no hibernate and db transaction managements, it just runs faster than ever.

DB vs Ehcache h2lc vs BigMemory FRS

BigMemory Go FRS provides consistent 50 µs latency 99.99% of the time compared to 3 ms average latency for ehcache h2lc and 60 ms latency for MySQL db.With hibernate 2nd level cache we are surely making reads faster but still we have too much of hibernate and JPA transaction overhead to it.

BigMemory Go FRS is new-age data storage which is durable, reliable and super-fast. ;)


Test Sources (Maven Project Download)

Friday, September 21, 2012

NUMA & Java



Time to deploy your application, looking forward to procure hardware that suits best to the load requirements. Boxes with 40 cores or 80 cores are pretty common these days. General conception is more cores, more processing power, more throughput. But I have seen a little contrary results, showing a small cpu-intensive test run performs slower on 80 core box than smaller 40 core box.

These boxes with huge cores comes with Non-Uniform Memory Access (NUMA) Architecture. NUMA is an architecture which boosts the performance of memory access for local nodes.  These new hardware boxes are divided into different zones called Nodes. These nodes have a certain number of cores alloted with a portion of memory alloted to them. So for the box with 1 TB RAM and 80 Cores, we have 4 nodes each having 20 cores and 256 GB of memory alloted.


You can check that using command, numactl --hardware
>numactl --hardware
available: 4 nodes (0-3)
node 0 size: 258508 MB
node 0 free: 186566 MB
node 1 size: 258560 MB
node 1 free: 237408 MB
node 2 size: 258560 MB
node 2 free: 234198 MB
node 3 size: 256540 MB
node 3 free: 237182 MB
node distances:
node   0   1   2   3
  0:  10  20  20  20
  1:  20  10  20  20
  2:  20  20  10  20
  3:  20  20  20  10

When JVM starts it starts thread which are scheduled on the cores in some random nodes. Each thread uses its local memory to be fastest as possible. Thread might be in WAITING state at some point and gets rescheduled on CPU. This time its not guaranteed that it will be on same node. Now this time, it has to access a remote memory location which adds latency. Remote memory access is slower, because the instructions has to traverse a interconnect link which introduces additional hops.

Linux command numactl provides a way to bind the process to certain nodes only. It locks a process to a specific node both for execution and memory allocation. If a JVM instance is locked to a single node then the inter node traffic is removed and all memory access will happen on the fast local memory.
numactl
   --cpunodebind=nodes, -c nodes
    Only execute process on the CPUs of nodes.
Created a small test which tries to serialize a big object and calculates the transactions per sec and latency.

To execute a java process bound to one node execute
numactl --cpunodebind=0 java -Dthreads=10 -jar serializationTest.jar

Ran this test on two different boxes.

Box A
4 CPU x 10 cores x 2 (hyperthreading) = Total of 80 cores
Nodes: 0,1,2,3

Box B
2 CPU x 10 cores x 2 (hyperthreading) = Total of 40 cores
Nodes: 0,1

CPU Speed : 2.4 Ghz for both.
Default settings are too use all nodes available on boxes.


Box
NUMA policy
TPS
Latency
(Avg)
Latency
(Min)
A
Default
261
37
18
B
Default
387
25
5
A
--cpunodebind=0,1
405
23
3
B
--cpunodebind=0
1,613
5
3
A
--cpunodebind=0
1,619
5
3

So we can infer that the default settings on Box A with more Nodes is performing low on a "CPU-intesive" test compared to default setting on 2-node Box B. But as we  bind the process to only 2 nodes, it performs equally better. Probably, because it has lesser nodes to hop and probability of threads getting rescheduled on same is increased to 50%.

With --cpunodebind=0, it just outperforms all the cases.

NOTE: Above test was run with 10 threads on 10 core.

Test Jar: download
Test Sources: download

Thursday, May 31, 2012

Fast, Predictable & Highly-Available @ 1 TB/Node




The world is pushing huge amounts of data to applications every second, from mobiles, the web, and various gadgets. More applications these days have to deal with this data. To preserve performance, these applications need fast access to the data tier.

RAM prices have crumbled over the past few years and we can now get hardware with a Terabyte of RAM much more cheaply. OK, got the hardware, now what? We generally use virtualization to create smaller virtual machines to meet applications scale-out requirements, as having a Java application with a terabyte of heap is impractical. JVM Garbage Collection will slaughter your application right away. Ever imagined how much time will it take to do a single full garbage collection for a terabyte of heap? It can pause an application for hours, making it unusable.

BigMemory is the key to access terabytes of data with milliseconds of latency, with no maintenance of disk/raid configurations/databases.

BigMemory = Big Data + In-memory

BigMemory can utilize your hardware to the last byte of RAM. BigMemory can store up to a terabyte of data in single java process.

BigMemory provides "fast", "predictable" and "highly-available" data at 1 terabytes per node.

The following test uses two boxes, each with a terabyte of RAM. Leaving enough room for the OS, we were able to allocate 2 x 960 GB of BigMemory, for a total of 1.8+ TB of data. Without facing the problems of high latencies, huge scale-out architectures ... just using the hardware as it is.

Test results: 23K readonly transactions per second with 20 ms latency.
Graphs for test throughput and periodic latency over time.



Readonly Periodic Throughput Graph


Readonly Periodic Latency Graph

Friday, May 11, 2012

Billions of Entries and Terabytes of Data - BigMemory




Combine BigMemory and Terracotta Server Array for Performance and Scalability

The age of Big Data is upon us. With ever expanding data sets, and the requirement to minimize latency as much as possible, you need a solution that offers the best in reliability and performance. Terracotta’s BigMemory caters to the world of big data, giving your application access to literally terabytes of data, in-memory, with the highest performance and controlled latency.
At Terracotta, while working with customers and testing our releases, we continuously experiment with huge amounts of data. This blog illustrates how we were blown away by the results of a test using BigMemory and the Terracotta Server Array (TSA) to cluster and distribute data over multiple computers.

Test Configuration

Our goal was to take four large servers, each with 1TB of RAM, and push them to the limit with BigMemory in terms of data set size, as well as performance while reading and writing to this data set. As shown in Figure 1, all four servers were configured using the Terracotta Server Array to act as one distributed but uniform data cache.
Figure 1 - Terracotta Server Array with a ~4TB BigMemory Cache

We then configured 16 instances of BigMemory across the four servers in the Terracotta Server Array, where each server had 952GB of BigMemory in-memory cache allocated. This left enough free RAM available to the OS on each server. With Terracotta Server Array, you can configure large data caches with high-availability, with your choice of data striping and/or mirroring across the servers in the array (for more information on this, read here http://terracotta.org/documentation/terracotta-server-array/configuration-guide or watch this video http://blip.tv/terracotta/terracotta-server-array-striping-2865283. The end result was 3.8 terabytes of BigMemory available to our sample application for its in-memory data needs.
Next, we ran 16 instances of our test application, each on its own server, to load data and then perform read and write operations. Additionally, the Terracotta Developer Console (see Figure 2) makes it quick and simple to view and test your in-memory data performance while your application is running. Note that we could have configured BigMemory on these servers as well, thereby forming a Level 1 (L1) cache layer for even lower latency access to hot-set data. However, in this blog we decided to focus on the near-linear scalability of BigMemory as configured on the four stand-alone servers. We’ll cover L1 and L2 cache hierarchies and hot-set data in a future blog.
Figure 2 - The developer console helps configure your application's in-memory data topology.

Data Set Size (2 billion entries; 4TB total)

Now that we had our 3.8 terabyte BigMemory up and running, we loaded close to 4 terabytes of data into the Terracotta Server Array at a rate of about 450 gigabytes per hour. The sample objects loaded were data arrays, each 1700 bytes in size. To be precise, we loaded 2 billion of these entries with 200GB left over for key space.
The chart below outlines the data set configuration, as well as the test used:
Test Configuration

Test:
Offheap-test
svn url:
Element #
2 Billion
Value Size
1700 bytes (simple byte arrays)
Terracotta Server #
16
Stripes #
16 (1 Terracotta Server per Mirror Group)
Application Node #
16
Terracotta Server Heap
3 GB
Application Node Heap
2 GB
Cache Warmup Threads
100
Test Threads
100
Read Write %age
10
Terracotta Server BigMemory
238 GB/Stripe. Total: 3.8 TB

The Test Results

As mentioned above, we ran 16 instances of our test application, each loading data at a rate of 4,000 transactions per second (tps), per server, reaching a total of 64,000 tps. At this rate, we were limited mostly by our network, as 64K tps of our data sample size translates to around 110 MB per second, which is almost 1 Gigabit per second (our network’s theoretical maximum). Figure 2 graphs the average latency measured while loading the data.
Figure 3 - Average latency, in millisecond, during the load phase.

The test phase consisted of operations distributed as 10% writes and 90% reads on randomly accessed keys over our data set of 2 billion entries. The chart below summarizes the incredible performance, measured in tps, of our sample application running with BigMemory.


Test Results      

Warmup Avg TPS
4k / Application Node = 64 k total
Warmup Avg Latency
24 ms
Test Avg TPS
4122 / Application Node  = 66 k total
Test Avg Latency
23 ms
We reached this level of performance with only four physical servers, each with 1 terabyte of RAM. With more servers and more RAM, we can easily scale up to 100 terabytes of data, almost linearly. This performance and scalability simply wouldn’t be possible without BigMemory and the Terracotta Server Array. 

Check out the following links for more information:
·         Terracotta BigMemory: http://terracotta.org/documentation/bigmemory/overview


Thursday, January 19, 2012

Performance Benefits of Pinned Cache



Terracotta is a Tiered Data Management platform that enables applications to keep as much of it's important data as close to where it needs to be. It does so automatically and can help keep data retrieval times at the micro-second level for up to a Terrabyte of data.

Terracotta Server Array tiered architecture


With Automatic Resource Control, we try to utilize the local cache in an efficient manner keeping the most accessed data nearest to the application. More the hits, more its locally cached. An application which accesses all the data (readonly & readwrite) equally, then ARC will try to distribute the local cache space equally to all the caches. Read more in my last blog.

In some applications, we are concerned about throughput (or minimum latencies) accessing one of the data set, which can be certainly achieved if we have it in local cache. Sometimes the developer/admin knows some specific things about an application that allow them to make specific performance decisions about subsets of data. The admin may know that by keeping the "US States Cache" in local heap one can keep a system running at maximum speed.

With Cache Pinning, we can pin the small caches locally(which can fit in local heap) while other caches are controlled by ARC. The latencies for the pinned cache would be minimum at micro-second level.

Here is a test which tries to simulate this kind of scenario and brings out the performance benefits of the Pinned Caches.

The test has 4 caches out of which one cache is a readonly cache. Each cache loads around 250MB of data, total of 1GB data. With 512MB of local cache, ARC tries to equally distribute the resources. Since the local heap is not enough to hold all the data, we store the data at lower layer.

Period Throughput Chart - All caches are Unpinned
The above chart shows that all the caches are getting accessed equally achieving almost same throughput. ARC is working properly but readonly cache is mostly reading from the lower layers as the some of the data is being moved to lower layer i.e. Terracotta, by other caches.

After pinning the readonly cache locally, the applicaion can access the data in readonly cache super-fast way. Readwrite caches will anyway be storing the updated values in Terracotta Server Array.

Period Throughput Chart - Readonly Pinned Cache/ReadWrite Unpinned Caches

The above chart shows that with cache pinning, readonly cache gets a huge boost from somewhere around 1000 tps is now touching 500k tps. As we keep the data closer to the application, it performs better. Now ReadOnly cache is holding up 250 MB (total data) out of total 512 MB of local heap.

Application Throughput
The total throughput of the app also gets a huge boost with cache pinning as ReadOnly cache is performing faster with low latencies and high throughput.

To enable cache pinning add following to cache config. Read more at ehcache.org.
<pinning store="localmemory|localcache|incache">
 Example:

<ehcache                           
    maxBytesLocalHeap="300m">      
    <defaultCache/>                
    <cache                         
        name="readonly-cache"      
        eternal="true">            
        <pinning                   
            store="localmemory"/>  
    </cache>                       
    <cache                         
        name="readwrite-cache-2"   
        eternal="true">            
    </cache>                       
    <cache                         
        name="readwrite-cache-3"   
        eternal="true">            
    </cache>                       
    <cache                         
        name="readwrite-cache-1"   
        eternal="true">            
    </cache>                       
    <terracottaConfig              
        url="localhost:9510"/>     
</ehcache>                         

To download the test click here.

Thursday, November 24, 2011

From "Terrabytes :O" to "Just Terrabytes!!"


My first computer had 64MB of RAM, since then technology has improved and got a lot cheaper. We can get Terrabytes of RAM easily.

But when i talk about storing TB of data on a single java application/process,
I get reaction like are you insane or what!! TB of data on Java application, it wont even start and if it gets into GC (Garbage Collection), you can go and have a coffee at Starbucks even then it wont finish.

Then I say, BigMemory is the saviour you don't have to worry about GCs any more.
But still can BigMemory store TBs of Data without any performance degradation?

Here is my experiment, i tried loading 1 TB of data on a single JVM with BigMemory.
Tried loading 1 Trillion (yes, you read it correctly its thousand times a billion, which we call as trillion) elements of around 850 bytes of payload each. Total data is ~900G, hit the hardware limitaions, but we can sure make it more than TB if hardware is available.

Came across a huge box with 1TB of RAM, which made this happen. To reduce any GC issues, reduce the JVM Heap to 2G.

The test create an Ehcache and loads the data onto it. ehcache configuration used for the test.

<ehcache    
    name="cacheManagerName_0"     
    maxBytesLocalOffHeap="990g">    
    <cache         
        name="mgr_0_cache_0"         
        maxEntriesLocalHeap="3000"     
        overflowToOffHeap="true"/>
</ehcache>

Here is the graph of period (=4 secs) warmup thoughput over time. Secondary Axis of the chart show the total data stored.



There are few slight dips in the chart these are when BigMemory is expanding to store more data. The Throughput is above 200,000 all the time with an average of 350,000 txns/sec.

The latencies are also a big concern for the applications.


Its below a 1 ms and average being 300 µs.

Okay, we have loaded TB of data, now what. Does it even work?

Yes, it does. The test phase does read write operation over the data set. Randomly selects an elements updates it and put it back to the cache.




I will say throughput and latencies are not that bad :)
The spikes are due to JVM GC, even with 2GB heap we will have few GCs, but the pause time is not even 2 secs. So we get the max latency for the test to be around 2 secs but the 99 percentile is around 500 µs

So if you application is slowed down by database or you are spending thousands of dollars on maintaining a database.
Get BigMemory and offload your database!

There would be concerns about searching this huge data, we have ehcache search that makes it happen.



Friday, November 4, 2011

New efficient Ehcache Automatic Resource Control (ARC)

To speed up an application, the most common technique used is caching and Ehcache is most commonly used in Java world. BigMemory Ehcache with Terracotta can cache terrabytes of data without any GC issues, bringing data closer to the clustered application in efficient manner. Terracotta Server stores whole data in BigMemory and provides the required data to the clustered application. With new feature, BigMemory at application level (Terracotta client or Layer1/L1), we are bringing cached data more closer to the application.

In an application, we might need to cache different types data in different caches. At some point of time, we would be using one of the caches heavily with few hits to other caches. Before Terracotta 3.6.0 release, we can specify the number of elements to be cached per cache, but with limited heap/BigMemory it gets tough to allocate space for each cache in efficient manner. The memory allocated to each cache is fixed so even if its not used much, the data will reside with application.

With the new feature, Automatic Resource Control (ARC), Ehcache will manage the heap/BigMemory allocated depending on the cache usage. With increased usage of cache_0, Ehcache will try to allocate more memory to cache_0 and as usage of cache_1 increase, it tries to manage the space between both the caches.

Test Case

Here is a small test which creates two caches, loads up the data to both the caches. During the test phase, threads access both the caches, say cache_0 & cache_1, but for one of the caches, cache_0, introduced a small delay after each transaction reducing the hits to the cache.

Both cache can store 1.5GB of data, total of 3GB of data, while BigMemory allocated at the application level is 2GB only. Its enough to store all cached data for one of the cache but not both.

Access Pattern
  1. 2 mins, both the caches are being used
  2. Next 10 mins cache_0 is being used more often
  3. Again for 2 mins both the caches are being used
  4. Next 10 mins cache_1 is used heavily
  5. Repeat

First, will like to discuss the case without ARC.




The tps remains almost constant even if other cache is not being used. We can see if both the caches are being used then also application throughput is same.



The L1 BigMemory usage remains constant throughout the test.

Now we should praise the benefits of ARC




Now with ARC, we can see if full cached data is at application BigMemory, the throughput gets a boost and touches 140k txn/sec. With both caches being used the tps is almost same as with and without ARC.



The throughput variation can be understood by the graph of L1 BigMemory usage. We can see that the L1 BigMemory usage for the cache_0 increasing, as it is heavily used. Overtime, it uses most of the memory for cache_0. As cache_1 usage increases, the memory usage for it also increases giving boost to the throughput.

To enable ARC, we just need to provide maxBytesLocalOffheap at Cache Manager level.

here is a sample ehcache.xml

 <ehcache           
   name="cacheManagerName_0"            
   maxBytesLocalHeap="512m"             
   maxBytesLocalOffHeap="2g">             
   <defaultCache/>                 
   <cache                      
     name="cache_1"             
     overflowToOffHeap="true">          
     <terracotta/>                 
   </cache>                     
   <cache                      
     name="cache_0"             
     overflowToOffHeap="true">          
     <terracotta/>                 
   </cache>                     
   <terracottaConfig                
     url="localhost:9510"/>             
 </ehcache>                      

This is 1 client attached to Terracotta Server to keep the testcase simpler. With multiple nodes, it will bring out the benefits more. :)

The above testcase is with only two caches, just picture if we have 10s of caches and tuning each cache would be a problem. With Ehcache ARC, its Ehcache responsibility to manage the data efficiently giving maximum throughput out of the application.