Sunday, October 7, 2012

BigMemory Go - Fast Restartable Store: Kickass!!

BigMemory Go, the latest product offering from Terracotta. BigMemory Go lets you keep all of your application's data instantly available in your server's memory. Your application will respond promptly when you have your data closest to the application.

A general application will have a database storing GBs of data, which makes the application slow. BigMemory Go provides a brand new reliable, durable, crash resilient and, above all, fast storage, called Fast Restartable Store. Fast Restart feature provides the option to store a fully consistent copy of the in-memory data on the local disk at all times. 

We generally use Database for bookkeeping, state-of-record, persistent data. With BigMemory Go, we provide durable persistent data store, which can handle any kind of shutdown - planned or unplanned. The next time your application starts up, all of the data that was in memory is still available and very quickly accessible.

We are here comparing the performance of mysql db (tuned to my best knowledge) with Ehcache hibernate 2nd level cache (w/ DB backend) and BigMemory store altogether. The sample application used here is classic Spring's PetClinic app (using hibernate), modified for simplicity and shed spring webflow, converted to standalone Java performance test. This app has a number of Owners which have Pets and corresponding Visits to the Clinic. The test initially creates a number of Owners with pets and visits and puts them to the storage. Then, it executes 10% read/write test for 30 mins.

The three cases considered here are
  1. No Hibernate caching with DB
  2. Ehcache with BigMemory used as Hibernate 2nd level cache with DB backend.
  3. BigMemory Go Fast Restartable Store (FRS)
Performance comparison chart shows that BigMemory Go FRS outnumbers the competitors. With no hibernate and db transaction managements, it just runs faster than ever.

DB vs Ehcache h2lc vs BigMemory FRS

BigMemory Go FRS provides consistent 50 ┬Ás latency 99.99% of the time compared to 3 ms average latency for ehcache h2lc and 60 ms latency for MySQL db.With hibernate 2nd level cache we are surely making reads faster but still we have too much of hibernate and JPA transaction overhead to it.

BigMemory Go FRS is new-age data storage which is durable, reliable and super-fast. ;)

Test Sources (Maven Project Download)

Friday, September 21, 2012

NUMA & Java

Time to deploy your application, looking forward to procure hardware that suits best to the load requirements. Boxes with 40 cores or 80 cores are pretty common these days. General conception is more cores, more processing power, more throughput. But I have seen a little contrary results, showing a small cpu-intensive test run performs slower on 80 core box than smaller 40 core box.

These boxes with huge cores comes with Non-Uniform Memory Access (NUMA) Architecture. NUMA is an architecture which boosts the performance of memory access for local nodes.  These new hardware boxes are divided into different zones called Nodes. These nodes have a certain number of cores alloted with a portion of memory alloted to them. So for the box with 1 TB RAM and 80 Cores, we have 4 nodes each having 20 cores and 256 GB of memory alloted.

You can check that using command, numactl --hardware
>numactl --hardware
available: 4 nodes (0-3)
node 0 size: 258508 MB
node 0 free: 186566 MB
node 1 size: 258560 MB
node 1 free: 237408 MB
node 2 size: 258560 MB
node 2 free: 234198 MB
node 3 size: 256540 MB
node 3 free: 237182 MB
node distances:
node   0   1   2   3
  0:  10  20  20  20
  1:  20  10  20  20
  2:  20  20  10  20
  3:  20  20  20  10

When JVM starts it starts thread which are scheduled on the cores in some random nodes. Each thread uses its local memory to be fastest as possible. Thread might be in WAITING state at some point and gets rescheduled on CPU. This time its not guaranteed that it will be on same node. Now this time, it has to access a remote memory location which adds latency. Remote memory access is slower, because the instructions has to traverse a interconnect link which introduces additional hops.

Linux command numactl provides a way to bind the process to certain nodes only. It locks a process to a specific node both for execution and memory allocation. If a JVM instance is locked to a single node then the inter node traffic is removed and all memory access will happen on the fast local memory.
   --cpunodebind=nodes, -c nodes
    Only execute process on the CPUs of nodes.
Created a small test which tries to serialize a big object and calculates the transactions per sec and latency.

To execute a java process bound to one node execute
numactl --cpunodebind=0 java -Dthreads=10 -jar serializationTest.jar

Ran this test on two different boxes.

Box A
4 CPU x 10 cores x 2 (hyperthreading) = Total of 80 cores
Nodes: 0,1,2,3

Box B
2 CPU x 10 cores x 2 (hyperthreading) = Total of 40 cores
Nodes: 0,1

CPU Speed : 2.4 Ghz for both.
Default settings are too use all nodes available on boxes.

NUMA policy

So we can infer that the default settings on Box A with more Nodes is performing low on a "CPU-intesive" test compared to default setting on 2-node Box B. But as we  bind the process to only 2 nodes, it performs equally better. Probably, because it has lesser nodes to hop and probability of threads getting rescheduled on same is increased to 50%.

With --cpunodebind=0, it just outperforms all the cases.

NOTE: Above test was run with 10 threads on 10 core.

Test Jar: download
Test Sources: download

Thursday, May 31, 2012

Fast, Predictable & Highly-Available @ 1 TB/Node

The world is pushing huge amounts of data to applications every second, from mobiles, the web, and various gadgets. More applications these days have to deal with this data. To preserve performance, these applications need fast access to the data tier.

RAM prices have crumbled over the past few years and we can now get hardware with a Terabyte of RAM much more cheaply. OK, got the hardware, now what? We generally use virtualization to create smaller virtual machines to meet applications scale-out requirements, as having a Java application with a terabyte of heap is impractical. JVM Garbage Collection will slaughter your application right away. Ever imagined how much time will it take to do a single full garbage collection for a terabyte of heap? It can pause an application for hours, making it unusable.

BigMemory is the key to access terabytes of data with milliseconds of latency, with no maintenance of disk/raid configurations/databases.

BigMemory = Big Data + In-memory

BigMemory can utilize your hardware to the last byte of RAM. BigMemory can store up to a terabyte of data in single java process.

BigMemory provides "fast", "predictable" and "highly-available" data at 1 terabytes per node.

The following test uses two boxes, each with a terabyte of RAM. Leaving enough room for the OS, we were able to allocate 2 x 960 GB of BigMemory, for a total of 1.8+ TB of data. Without facing the problems of high latencies, huge scale-out architectures ... just using the hardware as it is.

Test results: 23K readonly transactions per second with 20 ms latency.
Graphs for test throughput and periodic latency over time.

Readonly Periodic Throughput Graph

Readonly Periodic Latency Graph

Friday, May 11, 2012

Billions of Entries and Terabytes of Data - BigMemory

Combine BigMemory and Terracotta Server Array for Performance and Scalability

The age of Big Data is upon us. With ever expanding data sets, and the requirement to minimize latency as much as possible, you need a solution that offers the best in reliability and performance. Terracotta’s BigMemory caters to the world of big data, giving your application access to literally terabytes of data, in-memory, with the highest performance and controlled latency.
At Terracotta, while working with customers and testing our releases, we continuously experiment with huge amounts of data. This blog illustrates how we were blown away by the results of a test using BigMemory and the Terracotta Server Array (TSA) to cluster and distribute data over multiple computers.

Test Configuration

Our goal was to take four large servers, each with 1TB of RAM, and push them to the limit with BigMemory in terms of data set size, as well as performance while reading and writing to this data set. As shown in Figure 1, all four servers were configured using the Terracotta Server Array to act as one distributed but uniform data cache.
Figure 1 - Terracotta Server Array with a ~4TB BigMemory Cache

We then configured 16 instances of BigMemory across the four servers in the Terracotta Server Array, where each server had 952GB of BigMemory in-memory cache allocated. This left enough free RAM available to the OS on each server. With Terracotta Server Array, you can configure large data caches with high-availability, with your choice of data striping and/or mirroring across the servers in the array (for more information on this, read here or watch this video The end result was 3.8 terabytes of BigMemory available to our sample application for its in-memory data needs.
Next, we ran 16 instances of our test application, each on its own server, to load data and then perform read and write operations. Additionally, the Terracotta Developer Console (see Figure 2) makes it quick and simple to view and test your in-memory data performance while your application is running. Note that we could have configured BigMemory on these servers as well, thereby forming a Level 1 (L1) cache layer for even lower latency access to hot-set data. However, in this blog we decided to focus on the near-linear scalability of BigMemory as configured on the four stand-alone servers. We’ll cover L1 and L2 cache hierarchies and hot-set data in a future blog.
Figure 2 - The developer console helps configure your application's in-memory data topology.

Data Set Size (2 billion entries; 4TB total)

Now that we had our 3.8 terabyte BigMemory up and running, we loaded close to 4 terabytes of data into the Terracotta Server Array at a rate of about 450 gigabytes per hour. The sample objects loaded were data arrays, each 1700 bytes in size. To be precise, we loaded 2 billion of these entries with 200GB left over for key space.
The chart below outlines the data set configuration, as well as the test used:
Test Configuration

svn url:
Element #
2 Billion
Value Size
1700 bytes (simple byte arrays)
Terracotta Server #
Stripes #
16 (1 Terracotta Server per Mirror Group)
Application Node #
Terracotta Server Heap
3 GB
Application Node Heap
2 GB
Cache Warmup Threads
Test Threads
Read Write %age
Terracotta Server BigMemory
238 GB/Stripe. Total: 3.8 TB

The Test Results

As mentioned above, we ran 16 instances of our test application, each loading data at a rate of 4,000 transactions per second (tps), per server, reaching a total of 64,000 tps. At this rate, we were limited mostly by our network, as 64K tps of our data sample size translates to around 110 MB per second, which is almost 1 Gigabit per second (our network’s theoretical maximum). Figure 2 graphs the average latency measured while loading the data.
Figure 3 - Average latency, in millisecond, during the load phase.

The test phase consisted of operations distributed as 10% writes and 90% reads on randomly accessed keys over our data set of 2 billion entries. The chart below summarizes the incredible performance, measured in tps, of our sample application running with BigMemory.

Test Results      

Warmup Avg TPS
4k / Application Node = 64 k total
Warmup Avg Latency
24 ms
Test Avg TPS
4122 / Application Node  = 66 k total
Test Avg Latency
23 ms
We reached this level of performance with only four physical servers, each with 1 terabyte of RAM. With more servers and more RAM, we can easily scale up to 100 terabytes of data, almost linearly. This performance and scalability simply wouldn’t be possible without BigMemory and the Terracotta Server Array. 

Check out the following links for more information:
·         Terracotta BigMemory:

Thursday, January 19, 2012

Performance Benefits of Pinned Cache

Terracotta is a Tiered Data Management platform that enables applications to keep as much of it's important data as close to where it needs to be. It does so automatically and can help keep data retrieval times at the micro-second level for up to a Terrabyte of data.

Terracotta Server Array tiered architecture

With Automatic Resource Control, we try to utilize the local cache in an efficient manner keeping the most accessed data nearest to the application. More the hits, more its locally cached. An application which accesses all the data (readonly & readwrite) equally, then ARC will try to distribute the local cache space equally to all the caches. Read more in my last blog.

In some applications, we are concerned about throughput (or minimum latencies) accessing one of the data set, which can be certainly achieved if we have it in local cache. Sometimes the developer/admin knows some specific things about an application that allow them to make specific performance decisions about subsets of data. The admin may know that by keeping the "US States Cache" in local heap one can keep a system running at maximum speed.

With Cache Pinning, we can pin the small caches locally(which can fit in local heap) while other caches are controlled by ARC. The latencies for the pinned cache would be minimum at micro-second level.

Here is a test which tries to simulate this kind of scenario and brings out the performance benefits of the Pinned Caches.

The test has 4 caches out of which one cache is a readonly cache. Each cache loads around 250MB of data, total of 1GB data. With 512MB of local cache, ARC tries to equally distribute the resources. Since the local heap is not enough to hold all the data, we store the data at lower layer.

Period Throughput Chart - All caches are Unpinned
The above chart shows that all the caches are getting accessed equally achieving almost same throughput. ARC is working properly but readonly cache is mostly reading from the lower layers as the some of the data is being moved to lower layer i.e. Terracotta, by other caches.

After pinning the readonly cache locally, the applicaion can access the data in readonly cache super-fast way. Readwrite caches will anyway be storing the updated values in Terracotta Server Array.

Period Throughput Chart - Readonly Pinned Cache/ReadWrite Unpinned Caches

The above chart shows that with cache pinning, readonly cache gets a huge boost from somewhere around 1000 tps is now touching 500k tps. As we keep the data closer to the application, it performs better. Now ReadOnly cache is holding up 250 MB (total data) out of total 512 MB of local heap.

Application Throughput
The total throughput of the app also gets a huge boost with cache pinning as ReadOnly cache is performing faster with low latencies and high throughput.

To enable cache pinning add following to cache config. Read more at
<pinning store="localmemory|localcache|incache">


To download the test click here.