Data Grid. Why?

November 7, 2012

Data Grid, Enterprise Topics, Infrastructure Topics, NOSQL

Why use a data grid? This post aims to answer that very question.

First things first. Evolution.

Local Cache > Clustered Cache > Distributed Cache (Data Grid)

The reasons to use a distributed cache include the reasons to use a clustered cache, and the reasons to use a clustered cache include the reasons to use a local cache.

Performance

It is faster to access an object from local memory than it is to access data from a remote data store (e.g. database).

It is faster to access an existing object than it is to create a new object from data.

The data may be stored in multiple data stores.
The data may have to be retrieved using multiple queries.
The data may be complex.

In addition, a data grid supports performance optimizations that are not available in a clustered cache. For example, an application can rely on data affinity to ensure that related objects are stored in a cache in the same node.

Further, JBoss Data Grid supports performance optimizations that may not be available in competing data grid products. For example, JBoss Data Grid can be configured to use asynchronous communication and it includes an asynchronous API.

Consistency

A local cache is practical if an application is deployed to a single application server. If an application is deployed to multiple application servers, a local cache is not practical. The problem is stale data. A clustered cache relies on replication and invalidation to solve the problem of stale data.

In addition to support for JTA transactions, a data grid supports XA (distributed) and two-phase commit (2PC) transactions.

Further, JBoss Data Grid supports consistency options that may not be available in competing data grid products. For example, JBoss Data Grid supports transaction recovery and includes a version API (remove / replace with version).

Scalability

The difference between a clustered cache and a data grid is scalability. A data grid is scalable. The data is distributed via dynamic partitions. As a result, adding a node increases both throughput and capacity.

Further, JBoss Data Grid uses a consistent hashing algorithm to minimize the impact of adding or removing a node. When a node is added or removed, only a subset of the data is rebalanced. As such, adding or removing a node has an impact on a subset of the nodes in the data grid whereas alternative algorithms may have an impact on every node in the data grid.

Remote

Another difference between a clustered cache and a data grid is remote access.

When a data grid is embedded in an application, it is coupled to the application. Therefore, scaling an embedded data grid requires scaling the application. As a result, scaling the data grid increases administration costs associated with the application and application server infrastructure.

Example. A web application is deployed to multiple application servers. The web application uses an embedded data grid. The embedded data grid reaches capacity. It has to be scaled out.

What does that mean?

A new application server is installed and configured. The application is deployed to the new application server.

What’s the problem?

Administrators are now responsible for an addition application server. The application is now deployed to an additional application server. The problem is the increased administration costs (e.g. data source configuration / application deployment) associated with the application and application server.

If the application is redeployed, the data grid node is redeployed. *
If the data grid is upgraded, the application is upgraded (and redeployed).

* When a data grid node is redeployed, the data grid topology is changed twice: once when the node is removed and once when the node is added. When the data grid topology is changed, the data is rebalanced (albeit a subset of the data). As a result, redeploying a single application instance with an embedded data grid node has an impact on data grid itself. Twice.

What if the data grid needs to scale faster than the application itself?

Does it make sense that the application has to be deployed to an addition application server when it is not necessary for the application itself? What about resource usage? Does it make sense that the application server infrastructure has to suffer resource under utilization in order to scale the data grid?

The solution is a remote data grid.

Example. A web application is deployed to multiple application servers. The web application uses a remote data grid. The data grid reaches capacity. It has to be scaled out.

What does that mean?

A new data grid server is installed and configured. That’s it.

This allows the data grid to scale out independently of the application server infrastructure. It allows for data grid servers to be assigned different resources than application servers. For example, a data grid server may be assigned more memory but less processor cores than an application server.

It allows the data grid infrastructure to be administered and managed independently of the application server infrastructure.

The data grid can be upgraded independently of the application. The application can be redeployed without having an impact on the data grid itself.

Infrastructure

A remote data functions as a top level component of the infrastructure whereas an embedded data grid functions as a second level component of the infrastructure.

So?

Example. An enterprise has an application deployed to an application server cluster with an embedded data grid. Then the enterprise…

Adds a service deployed to an enterprise service bus (ESB) with an embedded data grid.
Adds a portal deployed to enterprise portal platform with an embedded data grid.
Adds rules deployed to a business rules management system (BRMS) with an embedded data grid.

What’s the problem?

The enterprise now has multiple independent data grids to administer and manage thus increasing administrative costs. If the data grids store the same data (e.g. customer data), stale data is a problem like it is with an application deployed to multiple application servers with a local cache. If data is updated in one data grid, that data is becomes stale in the other data grids. If the data grids store the same date, efficiency is a problem. If multiple data grids store the same data, then only a subset of their combined capacity is used like it is with a clustered, replicated cache.

The solution is a remote data grid as a top level infrastructure component.

While there are benefits to using an embedded data grid, they may or may not outweigh the benefits of a remote data grid.

To recap, the benefits of a data grid are scalability, remote access, and a top level infrastructure component along with the benefits of both local and clustered caches: performance & consistency.

Coming Soon – RadarGun, Performance Testing, & Benchmarks