In “First Steps with RedisGraph“, after getting up and running, we used a couple of simple graphs to understand what we can do with Cypher and RedisGraph.
This time, we will look at a third and more complex example: building and querying a family tree.
For me, this not just an interesting example, but a matter of personal interest and the reason why I am learning graph databases in the first place. In 2001, I came upon a Family Tree application from the Windows 95 era, and gradually built out my family tree. By the time I realised that it was getting harder to run with each new version of Windows, it was too big to easily and reliably migrate all the data to a new system. Fortunately, Linux is more capable of running this software than Windows.
This software, and others like it, allow you to do a number of things. The first and most obvious is data entry (manually or via an import function) in order to build the family tree. Other than that, they also allow you to query the structure of the family tree, bringing out visualisations (such as descendant trees, ancestor trees, chronological trees etc), statistics (e.g. average age at marriage, life expectancy, average number of children, etc), and answers to simple questions (e.g. who died in 1952?).
An Example Family Tree
In order to have something we can play with, we’ll use this family tree:
This data is entirely fictitious, and while it is a non-trivial structure, I would like to point out a priori several assumptions and design decisions that I have taken in order to keep the structure simple and avoid getting lost in the details of this already lengthy article:
- All children are the result of a marriage. Obviously, this is not necessarily the case in real life.
- All marriages are between a husband and a wife. This is also not necessarily the case in real life. Note that this does not exclude that a single person may be married multiple times.
- When representing dates, we are focusing only on the year in order to avoid complicating things with date arithmetic. In reality, family tree software should not just cater for full dates, but also for dates where some part is unknown (e.g. 1896-01-??).
- Parent-child relationships are represented as childOf arrows, from the child to each parent. This approach is quite different from others you might come across (such as those documented by Rik Van Bruggen). It allows us to maintain a simple structure while not duplicating any information (because the year of birth is stored with the child).
- A man marries a woman. In reality, it should be a bidirectional relationship, but we cannot have that in RedisGraph without having two relationships in opposite directions. Having the relationship go in a single direction turns out to be enough for the queries we need, so there is no need to duplicate that information. The direction was chosen arbitrarily and if anyone feels offended, you are more than welcome to reverse it.
Loading Data in RedisGraph
As we’re now dealing with larger examples, it is not very practical to interactively type or paste the RedisGraph commands into redis-cli
to insert the data we need. Instead, we can prepare a file containing the commands we want to execute, and then pipe it into redis-cli
as follows:
cat familytree.txt | redis-cli --pipe
In our case, you can get the commands to create the example family tree either from the Gigi Labs BitBucket Repository (look for RedisGraph-FamilyTree/familytree.txt) or in the code snippet below:
GRAPH.QUERY FamilyTree "CREATE (:Person {name: 'John', gender: 'm', born: 1932, died: 1982})"
GRAPH.QUERY FamilyTree "CREATE (:Person {name: 'Victoria', gender: 'f', born: 1934, died: 2006})"
GRAPH.QUERY FamilyTree "CREATE (:Person {name: 'Joseph', gender: 'm', born: 1958})"
GRAPH.QUERY FamilyTree "CREATE (:Person {name: 'Christina', gender: 'f', born: 1957, died: 2018})"
GRAPH.QUERY FamilyTree "CREATE (:Person {name: 'Donald', gender: 'm', born: 1984})"
GRAPH.QUERY FamilyTree "CREATE (:Person {name: 'Eleonora', gender: 'f', born: 1986, died: 2010})"
GRAPH.QUERY FamilyTree "CREATE (:Person {name: 'Nancy', gender: 'f', born: 1982})"
GRAPH.QUERY FamilyTree "CREATE (:Person {name: 'Anthony', gender: 'm', born: 2010})"
GRAPH.QUERY FamilyTree "CREATE (:Person {name: 'George', gender: 'm', born: 2012})"
GRAPH.QUERY FamilyTree "CREATE (:Person {name: 'Antoinette', gender: 'f', born: 1967})"
GRAPH.QUERY FamilyTree "CREATE (:Person {name: 'Alfred', gender: 'm', born: 1965})"
GRAPH.QUERY FamilyTree "CREATE (:Person {name: 'Bernard', gender: 'm', born: 1997})"
GRAPH.QUERY FamilyTree "CREATE (:Person {name: 'Fiona', gender: 'f', born: 2000})"
GRAPH.QUERY FamilyTree "MATCH (man:Person { name : 'John' }), (woman:Person { name : 'Victoria' }) CREATE (man)-[:married { year: 1956 }]->(woman)"
GRAPH.QUERY FamilyTree "MATCH (man:Person { name : 'Joseph' }), (woman:Person { name : 'Christina' }) CREATE (man)-[:married { year: 1981 }]->(woman)"
GRAPH.QUERY FamilyTree "MATCH (man:Person { name : 'Donald' }), (woman:Person { name : 'Eleonora' }) CREATE (man)-[:married { year: 2008 }]->(woman)"
GRAPH.QUERY FamilyTree "MATCH (man:Person { name : 'Donald' }), (woman:Person { name : 'Nancy' }) CREATE (man)-[:married { year: 2011 }]->(woman)"
GRAPH.QUERY FamilyTree "MATCH (man:Person { name : 'Alfred' }), (woman:Person { name : 'Antoinette' }) CREATE (man)-[:married { year: 1992 }]->(woman)"
GRAPH.QUERY FamilyTree "MATCH (child:Person { name : 'Joseph' }), (parent:Person { name : 'John' }) CREATE (child)-[:childOf]->(parent)"
GRAPH.QUERY FamilyTree "MATCH (child:Person { name : 'Joseph' }), (parent:Person { name : 'Victoria' }) CREATE (child)-[:childOf]->(parent)"
GRAPH.QUERY FamilyTree "MATCH (child:Person { name : 'Donald' }), (parent:Person { name : 'Joseph' }) CREATE (child)-[:childOf]->(parent)"
GRAPH.QUERY FamilyTree "MATCH (child:Person { name : 'Donald' }), (parent:Person { name : 'Christina' }) CREATE (child)-[:childOf]->(parent)"
GRAPH.QUERY FamilyTree "MATCH (child:Person { name : 'Anthony' }), (parent:Person { name : 'Donald' }) CREATE (child)-[:childOf]->(parent)"
GRAPH.QUERY FamilyTree "MATCH (child:Person { name : 'Anthony' }), (parent:Person { name : 'Eleonora' }) CREATE (child)-[:childOf]->(parent)"
GRAPH.QUERY FamilyTree "MATCH (child:Person { name : 'George' }), (parent:Person { name : 'Donald' }) CREATE (child)-[:childOf]->(parent)"
GRAPH.QUERY FamilyTree "MATCH (child:Person { name : 'George' }), (parent:Person { name : 'Nancy' }) CREATE (child)-[:childOf]->(parent)"
GRAPH.QUERY FamilyTree "MATCH (child:Person { name : 'Antoinette' }), (parent:Person { name : 'John' }) CREATE (child)-[:childOf]->(parent)"
GRAPH.QUERY FamilyTree "MATCH (child:Person { name : 'Antoinette' }), (parent:Person { name : 'Victoria' }) CREATE (child)-[:childOf]->(parent)"
GRAPH.QUERY FamilyTree "MATCH (child:Person { name : 'Bernard' }), (parent:Person { name : 'Alfred' }) CREATE (child)-[:childOf]->(parent)"
GRAPH.QUERY FamilyTree "MATCH (child:Person { name : 'Bernard' }), (parent:Person { name : 'Antoinette' }) CREATE (child)-[:childOf]->(parent)"
GRAPH.QUERY FamilyTree "MATCH (child:Person { name : 'Fiona' }), (parent:Person { name : 'Alfred' }) CREATE (child)-[:childOf]->(parent)"
GRAPH.QUERY FamilyTree "MATCH (child:Person { name : 'Fiona' }), (parent:Person { name : 'Antoinette' }) CREATE (child)-[:childOf]->(parent)"
There are certainly other ways in which the above commands could be rewritten to be more compact, but I wanted to focus more on keeping things readable in this case.
Sidenote: When creating the nodes (not the relationships), another option could be to keep only the JSON-like property structure in a file (see RedisGraph-FamilyTree/familytree-persons.txt), and then use awk
to generate the beginning and end of each command:
awk '{print "GRAPH.QUERY FamilyTree \"CREATE (:Person " $0 ")\""}' familytree-persons.txt | redis-cli --pipe
Querying the Family Tree
Once the family tree data has been loaded, we can finally query it and get some meaningful information. You might want to keep the earlier family tree picture open in a separate window while you read on, to help you follow along.
First, let’s list all individuals:
GRAPH.QUERY FamilyTree "MATCH (x) RETURN x.name"
1) 1) "x.name"
2) 1) 1) "John"
2) 1) "Victoria"
3) 1) "Joseph"
4) 1) "Christina"
5) 1) "Donald"
6) 1) "Eleonora"
7) 1) "Nancy"
8) 1) "Anthony"
9) 1) "George"
10) 1) "Antoinette"
11) 1) "Alfred"
12) 1) "Bernard"
13) 1) "Fiona"
3) 1) "Query internal execution time: 0.631002 milliseconds"
Next, we’ll use the ORDER BY
clause to get a chronological report based on the year people were born:
GRAPH.QUERY FamilyTree "MATCH (x) RETURN x.name, x.born ORDER BY x.born"
1) 1) "x.name"
2) "x.born"
2) 1) 1) "John"
2) (integer) 1932
2) 1) "Victoria"
2) (integer) 1934
3) 1) "Christina"
2) (integer) 1957
4) 1) "Joseph"
2) (integer) 1958
5) 1) "Alfred"
2) (integer) 1965
6) 1) "Antoinette"
2) (integer) 1967
7) 1) "Nancy"
2) (integer) 1982
8) 1) "Donald"
2) (integer) 1984
9) 1) "Eleonora"
2) (integer) 1986
10) 1) "Bernard"
2) (integer) 1997
11) 1) "Fiona"
2) (integer) 2000
12) 1) "Anthony"
2) (integer) 2010
13) 1) "George"
2) (integer) 2012
3) 1) "Query internal execution time: 0.895734 milliseconds"
By adding in a WHERE clause, we can retrieve all those born before 1969, and return them in order of year of birth as in the previous query:
GRAPH.QUERY FamilyTree "MATCH (x) WHERE x.born < 1969 RETURN x.name, x.born ORDER BY x.born"
1) 1) "x.name"
2) "x.born"
2) 1) 1) "John"
2) (integer) 1932
2) 1) "Victoria"
2) (integer) 1934
3) 1) "Christina"
2) (integer) 1957
4) 1) "Joseph"
2) (integer) 1958
5) 1) "Alfred"
2) (integer) 1965
6) 1) "Antoinette"
2) (integer) 1967
3) 1) "Query internal execution time: 1.097382 milliseconds"
EXISTS
allows us to check whether a property is set. Using it with the died property, we can list all the people who died:
GRAPH.QUERY FamilyTree "MATCH (x) WHERE EXISTS(x.died) RETURN x.name"
1) 1) "x.name"
2) 1) 1) "John"
2) 1) "Victoria"
3) 1) "Christina"
4) 1) "Eleonora"
3) 1) "Query internal execution time: 0.936778 milliseconds"
By changing that to NOT EXISTS
, we can get the opposite, i.e. all the people who are still alive:
GRAPH.QUERY FamilyTree "MATCH (x) WHERE NOT EXISTS(x.died) RETURN x.name"
1) 1) "x.name"
2) 1) 1) "Joseph"
2) 1) "Donald"
3) 1) "Nancy"
4) 1) "Anthony"
5) 1) "George"
6) 1) "Antoinette"
7) 1) "Alfred"
8) 1) "Bernard"
9) 1) "Fiona"
3) 1) "Query internal execution time: 1.150569 milliseconds"
Next, let’s answer some questions about specific individuals.
When did Christina die?
GRAPH.QUERY FamilyTree "MATCH (x) WHERE x.name = 'Christina' RETURN x.died ORDER BY x.born"
1) 1) "x.died"
2) 1) 1) (integer) 2018
3) 1) "Query internal execution time: 0.948734 milliseconds"
Who is George’s mother?
GRAPH.QUERY FamilyTree "MATCH (c)-[:childOf]->(p) WHERE c.name = 'George' AND p.gender = 'f' RETURN p.name"
1) 1) "p.name"
2) 1) 1) "Nancy"
3) 1) "Query internal execution time: 1.859084 milliseconds"
At what age did Eleonora get married? Note here that we’re using the AS
keyword to change the title of the returned field (just like in SQL):
GRAPH.QUERY FamilyTree "MATCH (m)-[r:married]->(f) WHERE f.name = 'Christina' RETURN r.year - f.born AS AgeAtMarriage"
1) 1) "AgeAtMarriage"
2) 1) 1) (integer) 24
3) 1) "Query internal execution time: 1.442386 milliseconds"
How many children did Alfred have? In this case, we use the COUNT()
aggregate function. Again, it works just like in SQL:
GRAPH.QUERY FamilyTree "MATCH (c)-[:childOf]->(p) WHERE p.name = 'Alfred' RETURN COUNT(c)"
1) 1) "COUNT(c)"
2) 1) 1) (integer) 2
3) 1) "Query internal execution time: 1.305086 milliseconds"
Let’s get all of Anthony’s ancestors! Here we use the *1..
syntax to indicate that this is not a single relationship, but indeed a path that is made up of one or more hops.
GRAPH.QUERY FamilyTree "MATCH (c)-[:childOf*1..]->(p) WHERE c.name = 'Anthony' RETURN p.name"
1) 1) "p.name"
2) 1) 1) "Eleonora"
2) 1) "Donald"
3) 1) "Christina"
4) 1) "Joseph"
5) 1) "Victoria"
6) 1) "John"
3) 1) "Query internal execution time: 1.456897 milliseconds"
How about Victoria’s descendants? This is the same as the ancestors query in terms of the MATCH
clause, but it’s got the WHERE
and RETURN
parts swapped.
GRAPH.QUERY FamilyTree "MATCH (c)-[:childOf*1..]->(p) WHERE p.name = 'Victoria' RETURN c.name"
1) 1) "c.name"
2) 1) 1) "Antoinette"
2) 1) "Fiona"
3) 1) "Bernard"
4) 1) "Joseph"
5) 1) "Donald"
6) 1) "George"
7) 1) "Anthony"
3) 1) "Query internal execution time: 1.158366 milliseconds"
Can we get Donald’s ancestors and descentants using a single query? Yes! We can use the UNION
operator to combine the ancestors and descentants queries. Note that in this case the AS
keyword is required, because subqueries of a UNION
must have the same column names.
GRAPH.QUERY FamilyTree "MATCH (c)-[:childOf*1..]->(p) WHERE c.name = 'Donald' RETURN p.name AS name UNION MATCH (c)-[:childOf*1..]->(p) WHERE p.name = 'Donald' RETURN c.name AS name"
1) 1) "name"
2) 1) 1) "Christina"
2) 1) "Joseph"
3) 1) "Victoria"
4) 1) "John"
5) 1) "George"
6) 1) "Anthony"
3) 1) "Query internal execution time: 78.088850 milliseconds"
Who are Donald’s cousins? This is a little more complicated because we need two paths that feed into the same parent, exactly two hops away (because one hop away would be siblings). We also need to exclude Donald and his siblings (if he had any) because they could otherwise match the specified pattern.
GRAPH.QUERY FamilyTree "MATCH (c1:Person)-[:childOf]->(p1:Person)-[:childOf]->(:Person)<-[:childOf]-(p2:Person)<-[:childOf]-(c2:Person) WHERE p1 <> p2 AND c1.name = 'Donald' RETURN c2.name"
1) 1) "c2.name"
2) 1) 1) "Bernard"
2) 1) "Fiona"
3) 1) "Query internal execution time: 2.133173 milliseconds"
Update 4th December 2019: The ancestors and descendants query has been added, and the cousins query improved, thanks to the contributions of people in this GitHub issue. Thank you!
Statistical Queries
The last two queries I’d like to show are statistical in nature, and since they’re not easy to visualise directly, I’d like to get to them in steps.
First, let’s calculate life expectancy. In order to understand this, let’s first run a query retrieving the year of birth and death of those people who are already dead:
GRAPH.QUERY FamilyTree "MATCH (x) WHERE EXISTS(x.died) RETURN x.born, x.died"
1) 1) "x.born"
2) "x.died"
2) 1) 1) (integer) 1932
2) (integer) 1982
2) 1) (integer) 1934
2) (integer) 2006
3) 1) (integer) 1957
2) (integer) 2018
4) 1) (integer) 1986
2) (integer) 2010
3) 1) "Query internal execution time: 1.066981 milliseconds"
Since life expectancy is the average age at which people die, then for each of those born/died pairs, we need to subtract born from died to get the age at death for each person, and then average them out. We can do this using the AVG()
aggregate function, which like COUNT()
may be reminiscent of SQL.
GRAPH.QUERY FamilyTree "MATCH (x) WHERE EXISTS(x.died) RETURN AVG( x.died - x.born )"
1) 1) "AVG( x.died - x.born )"
2) 1) 1) "51.75"
3) 1) "Query internal execution time: 1.208347 milliseconds"
The second statistic we’ll calculate is the average age at marriage. This is similar to life expectancy, except that in this case there are two people in each marriage, which complicates things slightly.
Once again, let’s visualise the situation first, by retrieving separately the ages of the female and the male when they got married:
GRAPH.QUERY FamilyTree "MATCH (m)-[r:married]->(f) RETURN r.year - f.born, r.year - m.born"
1) 1) "r.year - f.born"
2) "r.year - m.born"
2) 1) 1) (integer) 22
2) (integer) 24
2) 1) (integer) 24
2) (integer) 23
3) 1) (integer) 22
2) (integer) 24
4) 1) (integer) 29
2) (integer) 27
5) 1) (integer) 25
2) (integer) 27
Therefore, we have five marriages but ten ages at marriage, which is a little confusing to work out an average. However, we can still get to the number we want by adding up the ages for each couple, working out the average, and then dividing by 2 at the end to make up for the difference in the number of values:
GRAPH.QUERY FamilyTree "MATCH (m)-[r:married]->(f) RETURN AVG( (r.year - f.born) + (r.year - m.born) ) / 2"
1) 1) "AVG( (r.year - f.born) + (r.year - m.born) ) / 2"
2) 1) 1) "24.7"
3) 1) "Query internal execution time: 48.874147 milliseconds"
Wrapping Up
We’ve seen another example graph — a family tree — in this article. We discussed the reasons behind the chosen representation, delved into efficient ways to quickly create it from a text file, and then ran a whole bunch of queries to answer different questions and analyse the data in the family tree.
One thing I’m still not sure how to do is whether it’s possible, given two people, to identify their relationship (e.g. cousin, sibling, parent, etc) based on the path between them.
As all this is something I’m still learning, I’m more than happy to receive feedback on how to do things better and perhaps other things you can do which I’m not even aware of.