Introduction to Neo4j
Introduction to Neo4j
Although relational databases have dominated the data management space for decades, some alternative solutions are better than relational databases at solving specific categories of problems. Neo4j is a graph database designed for interacting with highly related data, such as you might find in a social graph. This article is Part 1 of a series on Neo4j. The following sections introduce you to Neo4j, share examples of when you might want to use this database, and show you how to set up a Neo4j development environment.
Understanding Neo4j Basics
The computer age began with the need to manage data. The 1970s saw the emergence of relational databases as a solution for managing data by defining entities and the relationships between them. For nearly three decades, relational databases dominated the data management scene, until the amount of data grew to Internet scale. As we moved from managing megabytes of data to managing petabytes of data, we needed new paradigms to handle data at this volume. This shift inspired the emergence of what some people call NoSQL, an unfortunate nickname because it actually groups different data store technologies by a feature they lack: SQL as a query language. Regardless of the naming convention, these new technologies enabled us to accomplish far more than we could with relational databases.
NoSQL databases fit into four general categories:
- Key/value stores such as Memcached and Redis
- Column-oriented databases such as Cassandra and HBase
- Document-oriented databases such as MongoDB and CouchDB
- Graph databases such as Neo4j and OrientDB
One big problem with accessing relational databases from object-oriented programming languages is the disconnect between how relational databases represent entities and relationships versus how object-oriented programming languages represent objects, including issues like inheritance and polymorphism. This complication is one of the primary reasons for the emergence of Object Relational Mapping (ORM) tools like Hibernate and iBATIS: bridging the gap between a relational database model and an object-oriented programming model. As we explore Neo4j, you'll see that its programming model is far more natural for object-oriented programmers.
Some categories of problems are not easily solved using relational databases. For example, let's create a social graph of users in which one user can be friends with other users. Figure 1 shows how such a data model might look.
Figure 1 Social graph relational data model.
Figure 1 shows a User table with a User_Friend table that defines friend relationships between users: A user can have zero or more friends. Let's look at how we might use this simple data model.
I am indebted to Aleksa Vukotic and Nicki Watt, co-authors of Neo4j in Action, for testing the performance of such a data model. You could query for the friends of a user's friends like this:
select count(distinct uf2.*) from User_friend uf1 inner join User_Friend uf2 on uf1.user_1 = uf2.user_2 where uf1.user_1 = ?
Likewise, if you want to find friends of friends of friends, you could execute a query like the following:
select count(distinct uf3.*) from User_friend uf1 inner join User_Friend uf2 on uf1.user_1 = uf2.user_2 inner join User_Friend uf3 on uf2.user_1 = uf3.user_2 where uf1.user_1 = ?
These are not complex queries, but in order for a relational database to find your result set, it must create the Cartesian product of the two tables joined together (or, in the last case, the Cartesian product of the User_Friend three times), and remove any extraneous data. Vukotic and Watt executed these types of queries against 1,000 users with approximately 50 friends each. The following table shows their response times.
Depth |
Execution Time (seconds) |
Result Count |
2 |
0.028 |
~900 |
3 |
0.213 |
~999 |
4 |
10.273 |
~999 |
5 |
92.613 |
~999 |
As you can see, relational databases are very good at joins—and the response times are great at rather shallow depths, but horrible when you hit depths of 4 and 5.
Rather than trying to find a way to improve these queries, let's see if we can change the problem altogether. Consider the users and friendships shown in Figure 2.
Figure 2 Social relationships as a graph.
In Figure 2 we have defined users as nodes and their friendship relationships as lines (or edges) between the nodes. This is a more natural way of representing the data in an object-oriented model: starting from a user node, we can traverse all of the friendship relationships to find the user's friends. We can traverse all of the user's friends' friends in the same way.
The creators of graph databases saw this way of representing a social graph as being more natural, so in Neo4j you can perform a similar query to the SQL query just discussed, using code like the following:
TraversalDescription traversalDescription = TraversalDescription() .relationships( "IS_FRIEND_OF", Direction.OUTGOING ) .evaluator( Evaluators.atDepth( 2 ) ) .uniqueness( Uniqueness.NODE_GLOBAL ); Interable<Node> nodes = traversalDescription.traverse(nodeById).nodes();
We choose a starting node and follow all "IS_FRIEND_OF" relationships from that node, searching at a depth of 2, and returning only unique nodes. The following table shows the results of this query at multiple depths.
Depth |
Execution Time (seconds) |
Result Count |
2 |
0.04 |
~900 |
3 |
0.06 |
~999 |
4 |
0.07 |
~999 |
5 |
0.07 |
~999 |
The important thing to notice about this experiment is that the execution time is proportional to the size of the result set, rather than the size of the data. Let's run this experiment one more time, but with a much larger sample set. In this example, we have 1 million users and 50 relationships between users, giving us 50 million User_Friend rows.
Depth |
SQL Execution Time (seconds) |
Neo4j Execution Time (seconds) |
Result Count |
2 |
0.016 |
0.01 |
2,500 |
3 |
30.267 |
0.168 |
125,000 |
4 |
1,543.505 |
1.359 |
600,000 |
5 |
Not finished in an hour |
2.132 |
800,000 |
As you can observe from these results, the response time of Neo4j increases with the number of nodes in the results set, whereas the response time of the SQL execution is proportional to the size of the data and all of those Cartesian products that need to be computed for the multiple joins. While Neo4j took only about two seconds to traverse 800,000 nodes, the SQL execution did not respond in over an hour.
In summary, if your data is designed in such a way that it can be represented in a graph of related nodes, querying that data can be extremely efficient.
Getting Started with Neo4j
I hope I've convinced you that graph databases in general, and specifically Neo4j, are good at solving problems for highly related data. Assuming you're ready to give it a try, let's get Neo4j set up on your local machine. You can download Neo4j from the Neo Technology website, or you can use it in an embedded mode by adding the following dependency to your Maven POM file:
<dependency> <groupId>org.neo4j</groupId> <artifactId>neo4j</artifactId> <version>2.2.0</version> </dependency>
The first step in using Neo4j is to create a GraphDatabaseService object that communicates with the database. If you are running Neo4j in an embedded mode, use the following code:
GraphDatabaseService graphDB = new GraphDatabaseFactory().newEmbeddedDatabase( "mydata" );
In this case, we create a reference to an embedded database instance that resides in the mydata folder. You can specify a relative or absolute path to denote where Neo4j should store its information.
If you want to run a separate Neo4j instance, download Neo4j and decompress it to your hard drive. You can start it by executing the following command from the installation directory:
bin/neo4j start
With Neo4j installed and running, you can access the web console by opening the following URL in your favorite browser:
http://localhost:7474/
You should see a screen similar to the one in Figure 3.
Figure 3 Neo4j web console.
The top panel in the web console (the part with the dollar sign) allows you to enter queries in Cypher, which is Neo4j's custom query language. (We'll cover Cypher in Part 3 of this series.) Once you have data in Neo4j, the "three circles" icon in the upper-left corner will allow you to view the data graphically.
We can connect to Neo4j programmatically with the following code snippet:
GraphDatabaseService graphDB = new RestGraphDatabase( "http://localhost:7474/db/data" );
Neo4j exposes a Representational State Transfer (REST) interface that allows you to access it with any REST client, but if you create a RestGraphDatabase then it will provide the same interface used by the embedded database.
With our GraphDatabaseService in hand, let's create a node:
try( Transaction tx : graphDB.beginTx() ) { Node userNode = graphDB.createNode(); tx.success(); } catch( Exception e ) { tx.failure(); } finally { tx.finish(); }
In addition to providing graph functionality, Neo4j is also a fully ACID-compliant database—atomic, consistent, isolated, and durable—just like its SQL counterparts. From a practical standpoint, this means that you need to wrap all node creations, updates, deletes, and queries inside a transaction.
We have effectively created a node, but it isn't very interesting at this point. Let's add a name property to our user node. We can accomplish that by invoking the setProperty() method:
userNode.setProperty( "name", "Steve" );
Neo4j property names must be Strings, but property values can be any of the following types: String, char, boolean, byte, short, int, long, double, float, or array. For example, I can set my age to int with the following command:
userNode.setProperty( "age", 43 );
Neo4j gives each node a unique identifier for use in retrieving the node. The first node that you add will have an ID of 1, so if we want to retrieve the node we inserted, we can use the following command:
Node userNode = graphDB.getNodeById( 1 );
And then we can retrieve the name property as follows:
String name = userNode.getProperty( "name" );
In Part 2 of this series, we'll explore how to search for nodes, but this brief introduction should give the more curious developer enough information to start being dangerous.
Summary
This article, Part 1 of a three-part series on Neo4j, provided a brief overview of why graph databases are better than relational databases at solving certain categories of problems. Now that your Neo4j environment is set up, Part 2 will review the Java API for creating nodes and relationships and explore the Traversal API for searching a graph database. In the final article in this series, we'll explore Cypher and see how to leverage it to execute complex queries.