Project Management is a Graph Problem

Project management, at its heart, involves planning the various tasks involved in a project and monitoring their gradual execution. The tasks are often organised in simple ways using task lists, Kanban boards, calendars, or Gantt charts, and prioritised based on importance. However, these methods often leave out something very fundamental: dependencies between tasks. What use is prioritisation, when task F cannot even commence before tasks D and E are ready?

This difficulty arises because relationships between tasks aren’t linear, and yet we use linear visualisations to make sense of them. By representing a project’s tasks as a graph instead, we can not only easily see the various dependencies, but also use critical path analysis techniques to gather more information about scheduling and risk.

This topic has been previously covered in the Project Management Neo4j graph gist by Nicole White. While it provides splendid coverage of critical path analysis with Neo4j, the article is unfortunately in poor shape, with its images, videos and formatting broken (although I’ve been able to locate an archived version of its graph image). It also represents tasks/activities as nodes, whereas I will be taking a different approach (representing tasks/activities as edges) which I originally learned back in my University days and feel is more intuitive.

Understanding Critical Path Analysis

To understand what we’re talking about, we first need an example.

The original graph.

The above diagram shows a relatively simple graph. Node A represents the starting point, whereas all the others represent different milestones that we need to deliver, including F which represents the final delivery and the end of the project. The arrows between nodes represent the tasks that need to be carried out in order to achieve the milestone where the arrow ends. Each arrow has a number which we can assume is the number of days we think the task will take (duration). In some cases the arrows diverge (e.g. B must be completed in order for either D or E to start) or converge (D and E must both be completed before F can start).

The graph with earliest start times.

At this point, we can calculate the earliest start time of each node. To understand this, let’s consider node E. In order for node E to start, the arrows leading up to it must both be completed. These include the paths ABE (duration = 2 + 4 = 6) and ACE (duration = 3 + 2 = 5). Since both must be completed, there’s no starting E before the longest of these (duration 6) has completed, so E’s earliest start time is 6. This is useful because, considering the tasks that occur sequentially or in parallel, it allows us to schedule a task at the appropriate time when its dependencies have been completed.

In order to calculate the earliest start time of all nodes, we do a forward pass from left to right, assuming that the earliest start time for node A is zero, adding up the durations leading to each node, and taking the highest number where multiple arrows converge to the same node. So:

  • A: assume earliest start time is zero.
  • B: 0 + 2 = 2.
  • C: 0 + 3 = 2.
  • D: 2 + 1 = 3.
  • E: max(3 + 2, 2 + 4) = 6.
  • F: max(3 + 6, 6 + 5) = 11.

The results are shown in the above diagram, where earliest start times are shown under the bottom-left portion of each node.

The graph with latest start times.

Next, we can calculate the latest start times, which tell us the latest day on which we can start each task without delaying the whole project. To do this, we start from the last node, setting its latest start time to the same value as its earliest start time (11 in the case of F). Then, we work backwards, subtracting the duration from the latest start time, and this time taking the minimum where a node diverges. So:

  • F: assume latest start time is same as earliest start time, i.e. 11.
  • E: 11 – 5 = 6.
  • D: 11 – 6 = 5.
  • C: 6 – 2 = 4.
  • B: min(5 – 1, 6 – 4) = 2
  • A: min(4 – 3, 2 – 2) = 0

The critical path consists of the nodes whose start and end times are equal – in this case this would be the path ABEF. If any of the tasks along this path are delayed, this would delay the whole project. On the other hand, nodes with different earliest and latest start times have some leeway. If the task along BD takes 3 days instead of 1 day, the path ABDF takes 2 + 3 + 6 = 11 days, which is the same as we need to get to F from the longer path, and so this doesn’t affect the overall project. The amount of leeway for each node is the difference between its latest and earliest start times. Nodes on the critical path have a zero difference and therefore get no leeway.

Running Neo4j with Docker

Now that we’ve seen how critical path analysis works with manual calculations, we’ll see how to create and analyse the same graph using a graph database, specifically Neo4j.

The easiest way to run Neo4j quickly is using Docker. Assuming we’re using Linux, Docker is already installed, and we want to destroy the container once it’s stopped, the following command achieves this purpose:

sudo docker run --rm -it -p 7687:7687 -p 7474:7474 neo4j

Once Neo4j is running, we can access the Neo4j Browser in a web browser via the URL http://localhost:7474/browser/. The default credentials to login are neo4j for both username and password, and these will have to be changed the first time. After that, the Neo4j Browser can be used to run Cypher queries and view their results.

Creating the Graph

To create the graph, we’ll run the following Cypher in the Neo4j Browser. The first set of statements creates the nodes. The second set locates the nodes we just created, and establishes the relationships between them. Since the statements end with a semicolon, they may be run all together in one go.

create (A:Milestone {name: 'A'});
create (B:Milestone {name: 'B'});
create (C:Milestone {name: 'C'});
create (D:Milestone {name: 'D'});
create (E:Milestone {name: 'E'});
create (F:Milestone {name: 'F'});

match (A:Milestone{name: 'A'}), (B:Milestone {name: 'B'}) create (A)-[:precedes {duration : 2}]->(B);
match (A:Milestone{name: 'A'}), (C:Milestone {name: 'C'}) create (A)-[:precedes {duration : 3}]->(C);
match (B:Milestone{name: 'B'}), (D:Milestone {name: 'D'}) create (B)-[:precedes {duration : 1}]->(D);
match (B:Milestone{name: 'B'}), (E:Milestone {name: 'E'}) create (B)-[:precedes {duration : 4}]->(E);
match (C:Milestone{name: 'C'}), (E:Milestone {name: 'E'}) create (C)-[:precedes {duration : 2}]->(E);
match (D:Milestone{name: 'D'}), (F:Milestone {name: 'F'}) create (D)-[:precedes {duration : 6}]->(F);
match (E:Milestone{name: 'E'}), (F:Milestone {name: 'F'}) create (E)-[:precedes {duration : 5}]->(F);

Once this is done, the resulting graph can be visualised by running the following simple Cypher query, which returns all nodes:

return n

After adjusting the position of the nodes, as well as their colour and caption, the graph matches what we saw earlier:

The graph in Neo4j, as seen in the Neo4j Browser.

Setting Earliest Start Times

As we saw earlier, the earliest start times of each node are calculated by adding up the durations of each arrow leading to that node, taking the highest number in case there is more than one. In Neo4j, we can achieve this with a path query. We’ll build this step by step to clarify what the final query does.

We’ll start with this very simple Cypher query:

match path = (a:Milestone)-[:precedes*]->(b:Milestone)
return a, relationships(path), b

This gets every path between every two nodes, and returns the pair of nodes along with all the relationships along the way. The Text view of the result in the Neo4j browser is the following:

│"a"         │"relationships(path)"                         │"b"         │
│{"name":"A"}│[{"duration":2}]                              │{"name":"B"}│
│{"name":"A"}│[{"duration":2},{"duration":1}]               │{"name":"D"}│
│{"name":"A"}│[{"duration":2},{"duration":4}]               │{"name":"E"}│
│{"name":"A"}│[{"duration":3}]                              │{"name":"C"}│
│{"name":"A"}│[{"duration":3},{"duration":2}]               │{"name":"E"}│
│{"name":"B"}│[{"duration":1}]                              │{"name":"D"}│
│{"name":"B"}│[{"duration":1},{"duration":6}]               │{"name":"F"}│
│{"name":"B"}│[{"duration":4}]                              │{"name":"E"}│
│{"name":"B"}│[{"duration":4},{"duration":5}]               │{"name":"F"}│
│{"name":"C"}│[{"duration":2}]                              │{"name":"E"}│
│{"name":"C"}│[{"duration":2},{"duration":5}]               │{"name":"F"}│
│{"name":"D"}│[{"duration":6}]                              │{"name":"F"}│
│{"name":"E"}│[{"duration":5}]                              │{"name":"F"}│

In our case, we just want the value of the durations along each path, so we extract the duration as follows:

match path = (a:Milestone)-[:precedes*]->(b:Milestone)
return a, [r in relationships(path) | r.duration], b

The part in square brackets on the second line simply means “for each relationship in the path’s relationships, take the duration”. The following is the simplified result:

│"a"         │"[r in relationships(path) | r.duration]"│"b"         │
│{"name":"A"}│[2]                                      │{"name":"B"}│
│{"name":"A"}│[2,1]                                    │{"name":"D"}│
│{"name":"A"}│[2,1,6]                                  │{"name":"F"}│
│{"name":"A"}│[2,4]                                    │{"name":"E"}│
│{"name":"A"}│[2,4,5]                                  │{"name":"F"}│
│{"name":"A"}│[3]                                      │{"name":"C"}│
│{"name":"A"}│[3,2]                                    │{"name":"E"}│
│{"name":"A"}│[3,2,5]                                  │{"name":"F"}│
│{"name":"B"}│[1]                                      │{"name":"D"}│
│{"name":"B"}│[1,6]                                    │{"name":"F"}│
│{"name":"B"}│[4]                                      │{"name":"E"}│
│{"name":"B"}│[4,5]                                    │{"name":"F"}│
│{"name":"C"}│[2]                                      │{"name":"E"}│
│{"name":"C"}│[2,5]                                    │{"name":"F"}│
│{"name":"D"}│[6]                                      │{"name":"F"}│
│{"name":"E"}│[5]                                      │{"name":"F"}│

This gives us a list of durations along each path. We can use the reduce() function to add them up, transforming the query as follows:

match path = (a:Milestone)-[:precedes*]->(b:Milestone)
return a, reduce(x = 0, r in relationships(path) | x + r.duration), b

reduce() uses x as an accumulator variable, adding the duration of each relationship to it and returning the final result. The result is now the following:

│"a"         │"reduce(x = 0, r in relationships(path) | x + r.duration)"│"b"         │
│{"name":"A"}│2                                                         │{"name":"B"}│
│{"name":"A"}│3                                                         │{"name":"D"}│
│{"name":"A"}│9                                                         │{"name":"F"}│
│{"name":"A"}│6                                                         │{"name":"E"}│
│{"name":"A"}│11                                                        │{"name":"F"}│
│{"name":"A"}│3                                                         │{"name":"C"}│
│{"name":"A"}│5                                                         │{"name":"E"}│
│{"name":"A"}│10                                                        │{"name":"F"}│
│{"name":"B"}│1                                                         │{"name":"D"}│
│{"name":"B"}│7                                                         │{"name":"F"}│
│{"name":"B"}│4                                                         │{"name":"E"}│
│{"name":"B"}│9                                                         │{"name":"F"}│
│{"name":"C"}│2                                                         │{"name":"E"}│
│{"name":"C"}│7                                                         │{"name":"F"}│
│{"name":"D"}│6                                                         │{"name":"F"}│
│{"name":"E"}│5                                                         │{"name":"F"}│

Finally, by using the max() function, dropping a from the result, and using a little ordering for clarity, we get exactly the earliest start times we wanted, using the following query:

match path = (:Milestone)-[:precedes*]->(b:Milestone)
return max(reduce(x = 0, r in relationships(path) | x + r.duration)), b
order by

The resulting values, shown below, match what we calculated manually earlier:

│"max(reduce(x = 0, r in relationships(path) | x + r.duration))"│"b"         │
│2                                                              │{"name":"B"}│
│3                                                              │{"name":"C"}│
│3                                                              │{"name":"D"}│
│6                                                              │{"name":"E"}│
│11                                                             │{"name":"F"}│

All we have left to do now is modify the query to set these values on each node:

match path = (:Milestone)-[:precedes*]->(b:Milestone)
with b, max(reduce(x = 0, r in relationships(path) | x + r.duration)) as earliest_start
set b.earliest_start = earliest_start

It is then trivial to verify that the nodes have been updated with the correct earliest start times:

A simple query shows that the nodes have been updated with earliest start times.

Setting Latest Start Times

Setting the latest start times is easier and does not require complex path queries. As we did manually, we work our way backwards, subtracting the duration from the earliest start time, and taking the minimum where there are multiple arrows emerging from a node. The following query does the trick:

match (a:Milestone)-[r:precedes]->(b:Milestone)
return a, min(b.earliest_start - r.duration) as latest_start
order by

The following output shows values that match what we originally calculated manually:

│"a"                            │"latest_start"│
│{"name":"A"}                   │0             │
│{"name":"B","earliest_start":2}│2             │
│{"name":"C","earliest_start":3}│4             │
│{"name":"D","earliest_start":3}│5             │
│{"name":"E","earliest_start":6}│6             │

We can set the latest start time on each node by adjusting the query slightly as follows:

match (a:Milestone)-[r:precedes]->(b:Milestone)
with a, min(b.earliest_start - r.duration) as latest_start
set a.latest_start = latest_start

Once again, we verify that everything has updated correctly:

A simple query shows that the nodes have been updated with latest start times.

Calculating the Critical Path: Maximum Duration

One way to calculate the critical path is shown in the aforelinked Project Management Neo4j graph gist by Nicole White. Adapted to our graph representation, the query for this is as follows:

match path = (a:Milestone)-[:precedes*]->(b:Milestone)
where = 'A' and = 'F'
with path, reduce(total_duration = 0, r in relationships(path) | total_duration + r.duration) AS total_duration
order by total_duration desc
limit 1
return nodes(path)

This method does not need earliest and latest start times at all. It works as follows:

  • It obtains all paths between the start and finish node (as per the where clause).
  • The total duration of each path is calculated with reduce().
  • The longest path is taken thanks to the order bydesc and limit 1.

As you can see from the screenshot below, this method works pretty well.

The critical path shown in Neo4j Browser.

Calculating the Critical Path: Equal Start Times

You might remember from earlier that the earliest and latest start times are equal in each node along the critical path, so this gives us another way to calculate the critical path. To do this, though, we first need to update the start and end nodes to fill in their missing earliest and latest start times, as follows:

where = 'A'
set a.earliest_start = 0;

where = 'F'
set f.latest_start = f.earliest_start;

We can then obtain the critical path as follows, using the all() predicate function to ensure that we pick only the nodes having equal earliest and latest start times:

match path = (a:Milestone)-[r:precedes*]->(b:Milestone)
where = 'A' and = 'F'
and all(node in nodes(path) where node.earliest_start = node.latest_start)
return nodes(path)

As you can see, this method works just as well:

The critical path shown in Neo4j Browser.


Although we’re feeling so Agile nowadays with all these fancy Kanban boards, the nature of projects, tasks and their dependencies makes them best represented by graphs. Additionally, using critical path analysis, it’s possible to obtain useful analytics, such as the optimal time to schedule tasks, which tasks risk delaying the whole project, and which tasks may be delayed without impacting the project delivery.

This scenario served as an example to explore relatively advanced Cypher features, including path queries and various functions.

3 thoughts on “Project Management is a Graph Problem”

  1. Two core reasons most companies adopted or are trying to adopt agile principles are:

    1. Project management is not a graph problem, kanban problem, not even any kind of technical problem. It is a people problem. People must talk, understand each other, show empathy.
    2. Graphs, kanbans, and such tools are still useful to figure technicalities out. They do not help however with the amount of uncertainty that comes with any project: project data is sparse, shallow, and moving, so the main challenge is making predictions while maintaining room for uncertainty and innovation.

    I still enjoyed your post as a nice Neo4j introduction with a better use case than most “movie library” examples 🙂

  2. This approach can be very useful for projects where there are many dependencies. Agile tools that I have seen are very good for planning tasks and for showing the current state, but have very few ways to represent dependencies and therefore Critical Path. (If we combine Critical Path with the people resources, we could represent Critical Chain, but that’s another topic.)

    I recently represented a complex software compilation sequence as a graph. The representation of prerequisites and dependent modules is very similar to yours. The individual dependencies are available from the build process, but the required build sequence is not obvious. There are so many modules and dependencies that viewing the graph in the browser is difficult. Your algorithm for Setting Earliest Start Time is basically the same as mine for setting the earliest dependencies and deriving the required build sequence.

    In my case the durations (compilation times) are nearly the same and I have reduced them to 1. But if I also set durations based on actual compile times I could derive a Critical Path that could optimize the build sequence and reduce the end-to-end build time. The end-to-end time is not as important as just ensuring that modules are built in the proper sequence so all dependencies are satisfied and the build process succeeds.

    Thank you for sharing this article.

Leave a Reply

Your email address will not be published. Required fields are marked *