NoSQL Options for Java Developers

avatar-matt_raible.jpg Matt Raible

The Java community is one I know and love, so even though a NoSQL database is rarely tied to a language I’m writing this article for you, Java developers around the world. In this article, I’ll show you several options for NoSQL databases. After exploring all the options, I’ll narrow the choices down to the top five based on Indeed Jobs, GitHub stars, and Stack Overflow tags. Then I’ll let you know if they’re supported by Spring Data and Spring Boot.

Why NoSQL?

NoSQL databases have helped many web-scale companies achieve high scalability through eventual consistency: because a NoSQL database is often distributed across several machines, with some latency, it guarantees only that all instances will eventually be consistent. Eventually consistent services are often called BASE (basically available, soft state, eventual consistency) services in contrast to traditional ACID properties.

Selecting NoSQL Candidates

Defining the top five can be difficult. Many folks have attempted to this recently. See the Research & Notes section at the end of this article for reference.

In mid-August, I told my followers on Twitter that I was writing this article. I asked for good/bad stories about NoSQL databases and received a number of options people wanted me to include.

I received many suggestions, listed in alphabetical order below:

  1. ArangoDB
  2. Cassandra
  3. Couchbase
  4. DynamoDB
  5. FaunaDB
  6. Hazelcast
  7. MongoDB
  8. Neo4j
  9. PostgreSQL JSON
  10. Redis
  11. (JetBrains) Xodus

People also mentioned Hibernate OGM (JPA for NoSQL) and NoSQLUnit as tools to help access and test NoSQL databases.

Note that I didn’t receive any requests for CouchDB, HBase, Elasticsearch, or Solr. CouchDB and Couchbase are often confused because of similar names, but they’re quite different. Since CouchDB is a document store, I included it in my rankings. I also added HBase since it is mentioned by ITBusinessEdge, KDnuggets, and DB-Engines (in Research & Notes section). I didn’t include Elasticsearch or Solr because I believe those aren’t often used as the primary data store.

Raible’s Ranking Technique

I used Indeed Jobs, GitHub Stars, Stack Overflow tags, and Docker pulls to develop my system of ranking the top five NoSQL databases.

Indeed Jobs

I searched on Indeed Jobs without a location and found very few surprises, save for Amazon’s DynamoDB showing up as a top contender.

Indeed Jobs, September 2017

NOTE: It’s difficult to search for “PostgreSQL JSON” because most listings specify “PostgreSQL” as a requirement, not its NoSQL support. I searched for “postgres + json”. Xodus is the name of a company, so I had to tack on “JetBrains” to ensure accurate results.

GitHub Stars

I searched and found the top five NoSQL options by GitHub stars are Redis, MongoDB, ArangoDB, Neo4j, and Cassandra.

GitHub Stars, September 2017

NOTE: Cassandra, HBase, PostgreSQL are mirrored repositories. DynamoDB, Couchbase, and FaunaDB don’t have their servers on GitHub, so I counted stars for their Java-based drivers. Using number of stars for each option’s Java driver is a good idea, but there’s 11 just for Redis.

You can use Tim Qian’s star-history project to see the star growth of these five.

GitHub Star History

Stack Overflow Tags

I searched on Stack Overflow for tags for each and found that MongoDB and PostgreSQL are the most popular, followed by Neo4j, Cassandra, and Redis.

Stack Overflow Tags, September 2017

Docker Pulls

I searched on Docker Hub for images and found the stats to be 10M+ for a few, 5M+ for Neo4j, and 1M+ for many others. FaunaDB and JetBrains Xodus don’t seem to have images available.

Docker Pulls, September 2017

After gathering this information, it didn’t seem very relevant to include these stats in my ranking. My reason is two-fold: because the numbers aren’t exact and because there weren’t “official” images for each option.

NoSQL Options Matrix

I created a matrix to combine jobs, stars, and tags. I awarded 1-5 points based on the ranking they scored in each category. If an option didn’t make the top five, it received a zero. The results – MongoDB, Redis, Cassandra, Neo4j, and PostgreSQL – are in the table below.

NoSQL Option Jobs Stars Tags Total
MongoDB 5 4 5 14
Redis 3 5 1 9
Cassandra 4 1 2 7
Neo4j 0 2 3 5
PostgreSQL 0 0 4 4
ArangoDB 0 3 0 3
HBase 2 0 0 2
DynamoDB 1 0 0 1
Couchbase 1 0 0 1
CouchDB 0 0 0 0
Hazelcast 0 0 0 0
JetBrains Xodus 0 0 0 0
FaunaDB 0 0 0 0

If you look at DB-Engines Ranking for their top five options, you’ll find PostgreSQL, MongoDB, Cassandra, Redis, and HBase.

DB-Engines Ranking, September 2017

Will you look at that - our top five results are pretty close!

Overview of NoSQL Options

Since my top five results are pretty close to what DB-Engines has, I’ll use mine as the top five. Below is an overview of each one, along with information about their Spring Boot support.

You might ask “Why Spring Boot?” My answer is simple: because Spring Boot adoption is high. According to Redmonk’s recent look at Java frameworks, Spring Boot adoption grew 76% between September 2016 and June 2017.

The Spring Boot Explosion

And things haven’t slowed down since June: Maven downloads in August 2017 were 22.2 million.

MongoDB

MongoDB was founded in 2007 by the folks behind DoubleClick, ShopWiki, and Gilt Groupe. It uses the Apache and GNU-APGL licenses on GitHub. Its many large customers include Adobe, eBay, and eHarmony.

  • Available on start.spring.io? Yes, including embedded MongoDB for testing.
  • Supported by Spring Data? Yes, via Spring Data MongoDB.
  • Bonus: Supported by Hibernate OGM, NoSQLUnit, and JHipster.

Redis

Redis stands for REmote Dictionary Server and was started by Salvatore Sanfilippo. It was initially released on April 10, 2009. According to redis.io, Redis is a BSD-licensed in-memory data structure store and can be used as a database, cache, and message broker. Well known companies using Redis include Twitter, GitHub, Snapchat, and Craigslist.

  • Available on start.spring.io? Yes.
  • Supported by Spring Data? Yes, via Spring Data Redis.
  • Bonus: Supported by NoSQLUnit. Hibernate ORM support is in progress.

Cassandra

Cassandra is “a distributed storage system for managing structured data that is designed to scale to a very large size across many commodity servers, with no single point of failure” (from “Cassandra – A structured storage system on a P2P Network” on the Facebook Engineering blog). It was initially developed at Facebook to power its Inbox Search feature. Its creators, Avinash Lakshman (one of the creators of Amazon DynamoDB) and Prashant Malik, released it as an open-source project in July 2008. In March 2009, it became an Apache Incubator project and graduated to a top-level project in February 2010.

In addition to Facebook, Cassandra helps a number of other companies achieve web scale. It has some impressive numbers about scalability on its homepage.

One of the largest production deployments is Apple’s, with over 75,000 nodes storing over 10 PB of data. Other large Cassandra installations include Netflix (2,500 nodes, 420 TB, over 1 trillion requests per day), Chinese search engine Easou (270 nodes, 300 TB, over 800 million requests per day), and eBay (over 100 nodes, 250 TB).

  • Available on start.spring.io? Yes.
  • Supported by Spring Data? Yes, via Spring Data Cassandra.
  • Bonus: Supported by NoSQLUnit and JHipster. Hibernate ORM support is in progress.

Neo4j

Neo4j is available as GPL3-licensed “community edition” with some extensions licensed under the Affero GPL. The community edition is limited to running on one node and does not contain clustering support or hot backups. Neo4J’s “enterprise edition” has scale-out capabilities, in-memory page cache, and hot backups. A 30-day trial is available; no pricing is provided.

Neo4j is best known as a graph database, where everything is stored as an edge, node, or an attribute. Version 1.0 was released in February 2010 and has been developed by Neo4j, Inc. since its beginning. Its large customers include Walmart, Airbnb, Monsanto, and eBay.

  • Available on start.spring.io? Yes.
  • Supported by Spring Data? Yes, via Spring Data Neo4j.
  • Bonus: Supported by Hibernate ORM and NoSQLUnit.

PostgreSQL JSON

PostgreSQL is a traditional relational database management system (RDBMS) that has NoSQL support via its native JSON support (added in version 9.2). In 9.4, they added support for Binary JSON (aka JSONB) and indexes.

Leigh Halliday explains how you can unleash the power of storing JSON in Postgres in a blog post dated June 2017. Halliday goes on to show how this can be used with Ruby on Rails. A blog post from Umair Shahid shows how to process PostgreSQL JSON & JSONB data in Java.

I’m not sure that PostgreSQL and its JSON support should be included as a recommend NoSQL option. However, it likely makes sense if you’re already using PostgreSQL and want to make your data schema more free-flowing. As Dj Walker-Morgan says, “PostgreSQL 9.5 isn’t your next JSON database, but it is a great relational database with a fully fledged JSON story.”

  • Available on start.spring.io? Yes.
  • Supported by Spring Data? Yes, via Spring Data JPA.

Recommendation

I feel good about how this analysis played out, and as a committer on the JHipster project, I’m both well aware of the strength of that team and think that its support for MongoDB and Cassandra is a pretty strong endorsement. It’s interesting to see that there’s work-in-progress to add Couchbase too.

But I’m not stopping there. I shared this analysis with a few experts I know in the Java and NoSQL communities and asked them the following questions:

  1. Do you agree with my choices of the top 5 NoSQL options (MongoDB, Redis, Cassandra, Neo4j, and PostgreSQL with its JSON support)?
  2. Do you have any good or bad stories about using any of these databases in production?
  3. Have you found any of these databases particularly difficult to get started with or maintain over time?
  4. What is your favorite NoSQL database and why?
  5. Anything else you’d like to share?

Please check back in a few weeks! I’ll post the answers to these questions from the experts I interviewed. I’ll also update this post to point to it when I do. If you’re an expert on NoSQL databases, let me know! I’d be happy to include your answers in the interview. Just send me a message to @mraible on Twitter or matt.raible@okta.com.

Update: Part II has been published. Many thanks to Justin McCarthy, Rafal Glowinski, Vlad Mihalcea, and Laurent Doguin for answering these questions!

Research & Notes

ITBusinessEdge has a slideshow about the top five NoSQL databases. However, there’s no date on the article, and it says Redis Labs made the selection. The slideshow lists MongoDB, Cassandra, Redis, Cassandra, CouchDB, and HBase.

Matthew Mayo, editor of KDnuggets, wrote a similar article about Top NoSQL Database Engines in June 2016. Mayo used db-engines.com ranking and Google Trends to select the top five: MongoDB, Cassandra, Redis, HBase, and Neo4j.

Hackernoon has an Infographic of the most popular NoSQL databases that are “worth your notice.” This article is from June 2017, and the comments say the rankings are based on stats from https://db-engines.com/en/ranking_trend.

Hackernoon Infographic

NOTE: If you look at this ranking today (September 6, 2017), you’ll see that Redis has replaced Couchbase. Or maybe Hackernoon skipped over Redis? It also begs the question: is Elasticsearch a NoSQL database, or a search engine? Should Solr be considered a NoSQL database as well? Both show up in DB-Engines Ranking Trends.

DB-Engines Ranking Graph

JAXenter published the results of their annual survey of top database trends on March 30, 2017. They list Elasticsearch and Solr as databases. They also include Apache Spark and Hadoop. MongoDB, Cassandra, Redis, and Neo4j are the most interesting “NoSQL” databases. Hazelcast is listed as the top in-memory data grid, over CouchDB and Oracle.

JAXenter Top Database Trends

Changelog: