Tag Archives: Graph Databases

Migrating Cartography to Memgraph

I’ve recently written about how to use Cartography to collect infrastructural data from both AWS and Okta into a Neo4j graph database for security analysis.

Neo4j is a long-standing player in the graph database market, with a robust product, great documentation, and a massive following. However, its long legacy is in a way also a disadvantage, as it can be costly, slow, and resource-hungry (due in no small part to its reliance on the JVM). Sometimes people would like to use an alternative for any of these reasons.

Memgraph, on the other hand, is a relatively young graph database, and certainly not as fully-featured as Neo4j. A key difference is that it is written in C++, meaning it’s designed to be faster and more lightweight than Neo4j (whether it lives up to this is something you’ll need to evaluate for your own use cases). Memgraph also made a very wise decision to support the Bolt protocol and the Cypher language – both of which Neo4j uses – meaning that it’s compatible with existing Neo4j clients and queries. Although there are variations in Cypher dialect, the incompatibilities are few, and moving from Neo4j to Memgraph is significantly less painful than, say, transitioning to a graph database that uses Gremlin as its query language.

At the time of writing this article, Cartography requires Neo4j 4.x, and does not work with Memgraph. However, I’m going to show you how to make at least part of it (the Okta intel module) work with minor alterations to the Cartography codebase. This serves as a demonstration of how to get started migrating an existing application from Neo4j to Memgraph.

Running Memgraph

Before we start looking at Cartography, let’s run an instance of Memgraph. To do this, we’ll take a tip from my earlier article, “Using the Neo4j Bolt Driver for Python with Memgraph“, and run it under Docker as follows (drop the sudo if you’re on Mac or Windows):

sudo docker run --rm -it -p 7687:7687 -p 3000:3000 -e MEMGRAPH="--bolt-server-name-for-init=Neo4j/" memgraph/memgraph-platform

That --bolt-server-name-for-init=Neo4j/ is a first critical step in Neo4j compatibility. As explained in that same article, the Neo4j Bolt Driver (i.e. client) for Python (which Cartography uses) checks whether the server sends an “agent” value that starts with “Neo4j/”. By setting this, Memgraph is effectively posing as a Neo4j server, and the Neo4j Bolt Driver for Python can’t tell the difference.

Update 19th September 2023: as of Memgraph v2.11, --bolt-server-name-for-init has a default value compatible with the Neo4j Bolt Driver, and therefore no longer needs to be provided.

If it’s successful, you should see output such as the following:

Memgraph is running. You can also execute queries directly from here.

Cloning the Cartography Repo

The next thing to do is grab a copy of the Cartography source code from the Cartography GitHub repo:

git clone https://github.com/lyft/cartography.git

Next, run the following command to install the necessary dependencies:

pip3 install -e .

Note: in the past, I’ve usually had to upgrade the Neo4j Bolt Driver for Python to 5.2.1 to get anything working, but as I try this again, it seems to work even with the default 4.4.x that Cartography uses. If you have problems, try changing setup.py to require neo4j>=5.2.1 and run the above command again.

Creating a Launch Configuration in Visual Studio Code

In order to run Cartography from its source code, you could run it directly from the terminal, for instance:

cd cartography/cartography
python3 __main__.py

However, as I’ve recently been using Visual Studio Code for all my polyglot software development needs, I find it much more convenient to set up a launch configuration that allows me to easily debug Cartography and pass whatever command-line arguments and environment variables I want.

The following launch.json is handy to run Cartography with an Okta configuration as described in “Getting Started with Cartography for Okta“:

{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Run Cartography",
            "type": "python",
            "request": "launch",
            "program": "cartography/__main__.py",
            "console": "integratedTerminal",
            "justMyCode": true,
            "args": [
                "--neo4j-user",
                "ignore",
                "--neo4j-password-env-var",
                "NEO4J_PASS",
                "--okta-org-id",
                "dev-xxxxxxxx",
                "--okta-api-key-env-var",
                "OKTA_API_TOKEN"
            ],
            "env": {
                "NEO4J_PASS": "ignore",
                "OKTA_API_TOKEN": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
            }
        }
    ]
}

You might notice we’re telling Cartography to connect to Neo4j (Memgraph actually) with username and password both set to “ignore”. The reason for this is that while the Community Edition of Memgraph does not require (or support) authentication, the Neo4j Bolt Driver for Python (i.e. Neo4j client) does require a username and password to be provided. So, as a second critical compatibility step, we pass any arbitrary value for the Neo4j username and password so long as they are not left empty.

As for the Okta configuration, remember to replace the Organisation ID and API Token with real ones.

Incompatible Index Creation Cypher

Pressing F5, we can now run Cartography from inside Visual Studio Code, and we immediately run into the first problem:

The error says: “no viable alternative at input ‘CREATEINDEXIF'”.

Memgraph is choking on the index creation step in indexes.cypher (in VS Code, use Ctrl+P / Command+P to quickly locate the file) because the index creation syntax is one aspect of Memgraph’s Cypher implementation that is not compatible with that of Neo4j. If we take the first line in the file, the Neo4j-compatible syntax is:

CREATE INDEX IF NOT EXISTS FOR (n:AWSConfigurationRecorder) ON (n.id);

…whereas the equivalent on Memgraph would be:

CREATE INDEX ON :AWSConfigurationRecorder(id);

Note: in Memgraph, the “IF NOT EXISTS” bit is implicit: an index is created if it doesn’t exist; if it does, the operation is a no-op that does not cause any error.

Fortunately, this syntactic difference is easily resolved by replacing (using VS Code search & replace syntax, with regex enabled) this:

CREATE INDEX IF NOT EXISTS FOR \(n:(.*?)\) ON \(n.(.*?)\);

…with this:

CREATE INDEX ON :$1($2);

Tip: although not in scope here, you’ll need to make a similar change also in querybuilder.py and tx.py if you also want to get other intel modules (e.g. AWS) working.

Neo4j Result Consumption

After fixing the index creation syntax and rerunning Cartography, we run into another problem:

The error says: “The result is out of scope. The associated transaction has been closed. Results can only be used while the transaction is open.”

I’m told that consume() is used to fix a problem in which Neo4j connections hang in situations where internal buffers fill up, although the Cartography team is re-evaluating whether this is necessary. In practice, I have seen that removing this doesn’t seem to cause problems with datasets I’ve tested with, although your mileage may vary. Let’s fix this problem by removing usage of consume() in statement.py.

First, we drop the .consume() at the end of line 76 inside the run() function:

    def run(self, session: neo4j.Session) -> None:
        """
        Run the statement. This will execute the query against the graph.
        """
        if self.iterative:
            self._run_iterative(session)
        else:
            session.write_transaction(self._run_noniterative)
        logger.info(f"Completed {self.parent_job_name} statement #{self.parent_job_sequence_num}")

Then, in the _run_iterative() function, we remove the entire while loop (lines 120-128) except for line 121, which we de-indent:

        # while True:
        result: neo4j.Result = session.write_transaction(self._run_noniterative)

            # Exit if we have finished processing all items
            # if not result.consume().counters.contains_updates:
            #     # Ensure network buffers are cleared
            #     result.consume()
            #     break
            # result.consume()

When we run it again, it should finish the run without problems and return control of the terminal with the prompt showing:

...
INFO:cartography.sync:Finishing sync stage 'duo'
INFO:cartography.sync:Starting sync stage 'analysis'
INFO:cartography.intel.analysis:Skipping analysis because no job path was provided.
INFO:cartography.sync:Finishing sync stage 'analysis'
INFO:cartography.sync:Finishing sync with update tag '1689401212'
daniel@andromeda:~/git/cartography$

Querying the Graph

The terminal we’re using to run Memgraph has the mgconsole client running (that’s the memgraph> prompt you see in the earlier screenshot), meaning we can try running queries directly there. For starters, we can try the ubiquitous “get everything” Cypher query:

memgraph> match (n) return n;

Note: if you get a “mg_raw_transport_send: Broken pipe”, just run the query again and it should reconnect.

This gives us some data back:

Querying Memgraph using mgconsole.

As you can see, this is not great to visualise results. Fortunately, Memgraph has its own web client (similar to Neo4j Browser) called Memgraph Lab, that you can access on http://localhost:3000/:

Memgraph Lab: Quick Connect page.

On the Quick Connect page, click the “Connect now” button. Then, switch to the “Query Execution” page using the left navigation sidebar, and you can run queries and view results more comfortably:

Seeing some nodes in Memgraph Lab.

Unlike Neo4j Browser, Memgraph Lab does not return relationships by default when you run this query. If you want to see them as well, you can run this instead:

match (a)-[r]->(b)
return a, r, b
Nodes and relationships in Memgraph Lab.

If the graph looks too cluttered, just drag the nodes around to rearrange them in a way that is more pleasant.

More Cartography with Memgraph

Cartography is a huge project that gathers data from a variety of data sources including AWS, Azure, GitHub, Okta, and others.

I’ve intentionally only covered the Okta intel module in this article because it’s small in scope and easy to digest. To use Cartography with other data sources, additional effort is required to address other problems with incompatible Cypher queries. For instance, at the time of writing this article, there are at least 9 outstanding issues that need to be fixed before Cartography can be used with Memgraph for AWS (that’s quite impressive considering that the AWS intel module is the biggest). Other intel modules may have other problems that need solving; nobody has explored them with Memgraph yet.

Summary

In this article, I’ve shown how one could go about taking an existing application that depends on Neo4j and migrating it to Memgraph. I’ve used Cartography with its Okta intel module to keep things relatively straightforward. The steps involved include:

  1. Running Memgraph with --bolt-server-name-for-init=Neo4j/
  2. Using the same Bolt-compatible Neo4j client, providing arbitrary Neo4j username and password values
  3. Fixing any incompatible Neo4j client code (in this case, consume()), if applicable
  4. Adjusting any incompatible Cypher queries

Getting Started with Cartography for Okta

Cartography is a great security tool that gathers infrastructure and security data from various sources for subsequent analysis. Last year, I wrote an article about Getting Started with Cartography for AWS. Although Cartography focuses mostly on AWS, it also gathers data from several other sources including major cloud and SaaS providers.

In this article, we’ll use Cartography to ingest Okta data. For the unfamiliar, Okta is an enterprise identity management tool that is great for its Single Sign On (SSO) capability. From a single dashboard, it provides seamless access to many different services (e.g. AWS, Gmail, and many others), without having to login every time. See also: What is Okta and What Does Okta Do?

It’s worth noting before we start this journey that Cartography’s support for Okta isn’t great. It only supports a handful of types, and it uses a retired version of the Okta SDK for Python. Nonetheless, it retrieves the most important types, and they enable analysis of some more interesting attack paths (e.g. an Okta user gaining unauthorised access to resources in AWS).

Creating an Okta Developer Account

We’ll first need an Okta account. There are a few different options including a trial, but for development, the best is to sign up for an Okta Developer account as follows.

Click on the Sign up button in the top-right.
In this confusing selection screen, go for the Developer Edition on the right.
Fill the sign-up form and proceed.

Once you get to the sign-up form, fill in the four required fields, and then either sign-up via email or use your GitHub or Google account. Note that Okta demands a “business email”, so you can’t use a Gmail account for this.

After signing up, you’ll get an email to activate your account. Follow its instructions to choose a password, and then you will be logged in and redirected to your Okta dashboard.

The Okta dashboard.

Creating an Okta API Token

Cartography’s Okta Configuration documentation says it’s necessary to set up an Okta API token, so let’s do that. From the Okta Dashboard:

  1. Go to Security -> API via the left navigation menu.
  2. Switch to the “Tokens” tab.
  3. Click the “Create token” button.
Security -> API, Tokens tab, Create token button.

You will then be prompted to enter a name for the API token, and subsequently given the token itself. Copy the token and keep it handy. Take note also of your organisation ID, which you can find either in the URL, or in the top-right under your name (but remove the “okta-” prefix). The organisation ID for a developer account looks like “dev-12345678”.

Running Neo4j

Before we run Cartography, we need a running instance of the Neo4j graph database, because that’s where the data gets stored after being retrieved from the configured data sources (in this case Okta). When I wrote “Getting Started with Cartography for AWS“, Cartography only supported up to Neo4j 3.5. Thankfully, that has changed. The Cartography Installation documentation specifically asks for Neo4j 4.x, further remarking that “Neo4j 5.x will probably work but Cartography does not explicitly support it yet.” The latest Neo4j Docker image at the time of writing this article seems to be 5.9, and I’m feeling adventurous, so let’s give it a try.

I did explain in “Getting Started with Cartography for AWS” how to run Neo4j under Docker, but we’ll do it a little better this time. Use the following command:

sudo docker run --rm -p 7474:7474 -p 7473:7473 -p 7687:7687 -e NEO4J_AUTH=neo4j/password neo4j:5.9

Here’s a brief explanation of what all this means:

  • sudo: I’m on Linux, so I need to run Docker with elevated privileges. If you’re on Windows or Mac, omit this.
  • docker run: runs a new Docker container with the image specified at the end.
  • --rm: destroys the container after you shut it down. This is because we’re just doing a quick test and don’t want to keep containers around. If you want to keep the container, remove this.
  • -p 7474:7474 -p 7473:7473 -p 7687:7687: maps ports 7473, 7474 and 7687 from the Docker container to the host, so that we can access Neo4j from the host machine. 7474 in particular lets us access the Neo4j Browser, which we’ll see in a moment.
  • -e NEO4J_AUTH=neo4j/password: sets up the initial username and password to “neo4j” and “password” respectively. This bypasses the need to reset the password from the Neo4j Browser as I did in the earlier article. Remember it’s just a quick test, so excuse the silly “password” and choose a better one in production.
  • neo4j:5.9: This is the image we’re going to run – neo4j with tag 5.9.
  • Note that any data will be lost when you stop the container, regardless of the --rm argument. You’ll need to use Docker volumes if you want to retain the data.

Once the container has started, you can access the Neo4j Browser at http://localhost:7474/, and login using the username “neo4j” and password “password”. We’ll use this later to run Cypher queries, but for now it is a sign that Neo4j is running properly.

The Neo4j Browser’s login screen.

Running Cartography

Following the Cartography Installation documentation, run the following to install Cartography:

pip3 install cartography

As per Cartography’s Okta Configuration documentation, assign the Okta API token you created earlier to an environment variable (the following will set it only for your current terminal session):

export OKTA_API_TOKEN=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Then, run Cartography with the following command:

cartography --neo4j-uri bolt://localhost:7687 --neo4j-password-prompt --neo4j-user neo4j --okta-org-id dev-xxxxxxxx --okta-api-key-env-var OKTA_API_TOKEN

Here’s a brief summary of the parameters:

  • --neo4j-uri bolt://localhost:7687: specifies the Neo4j URI to connect to
  • --neo4j-user neo4j: will login with the username “neo4j”
  • --neo4j-password-prompt: means that you will be prompted for the Neo4j password and will have to type it in
  • --okta-org-id dev-xxxxxxxx: will connect to Okta using the organisation ID “dev-xxxxxxxx” (replace this with yours)
  • --okta-api-key-env-var OKTA_API_TOKEN: will use the value of the OKTA_API_TOKEN environment variable as the API token when connecting to Okta

If you see “cartography: command not found” when you run this (especially on Linux), there’s a very good Stack Overflow answer that explains why this happens and offers a simple solution:

export PATH="$HOME/.local/bin:$PATH"

When you manage to run Cartography with the earlier command, enter the Neo4j password (it’s “password” in this example). It will take some time to collect the data from Okta and will write to the terminal periodically as it makes progress. You’ll know it’s done because you’ll see your terminal’s prompt again, and hopefully won’t see any errors.

Querying the Graph

You should now have data in Neo4j, so open your Neo4j Browser at http://localhost:7474/ and run some queries to look at the data. The easiest to start with is the typical “get everything” query:

match (n) return n

On a fresh new account, this gives you back a handful of nodes and the relationships between them:

Okta data in the Neo4j Browser.

Although this is not great for analysis, it’s all you need to get started using Cartography for Okta. You can get more data to play with by either building out your directory (users, groups, etc) via the Okta Dashboard, or else connecting to a real production account with real data.

If you want to analyse attack paths from Okta to AWS, then do the necessary AWS setup (see my earlier article, “Getting Started with Cartography for AWS“), and follow Cartography’s Okta Configuration documentation to set up the bridge between Okta and AWS.

Summary

To get Cartography to collect your Okta data:

  1. Sign up for an Okta account if you don’t have one already.
  2. Create an Okta API Token, and take note of your Okta Organisation ID
  3. Run Neo4j
  4. Run Cartography, providing settings to access Neo4j and Okta

Once the data is in Neo4j, you can analyse it and visualise how the nodes are connected. This can help you understand the paths that an attacker could take to breach the critical parts of your infrastructure. In the case of Okta, this is particularly useful when considering how an attacker could exploit the privileges of an Okta user to access resources in other cloud or SaaS providers.

Project Management is a Graph Problem

Project management, at its heart, involves planning the various tasks involved in a project and monitoring their gradual execution. The tasks are often organised in simple ways using task lists, Kanban boards, calendars, or Gantt charts, and prioritised based on importance. However, these methods often leave out something very fundamental: dependencies between tasks. What use is prioritisation, when task F cannot even commence before tasks D and E are ready?

This difficulty arises because relationships between tasks aren’t linear, and yet we use linear visualisations to make sense of them. By representing a project’s tasks as a graph instead, we can not only easily see the various dependencies, but also use critical path analysis techniques to gather more information about scheduling and risk.

This topic has been previously covered in the Project Management Neo4j graph gist by Nicole White. While it provides splendid coverage of critical path analysis with Neo4j, the article is unfortunately in poor shape, with its images, videos and formatting broken (although I’ve been able to locate an archived version of its graph image). It also represents tasks/activities as nodes, whereas I will be taking a different approach (representing tasks/activities as edges) which I originally learned back in my University days and feel is more intuitive.

Understanding Critical Path Analysis

To understand what we’re talking about, we first need an example.

The original graph.

The above diagram shows a relatively simple graph. Node A represents the starting point, whereas all the others represent different milestones that we need to deliver, including F which represents the final delivery and the end of the project. The arrows between nodes represent the tasks that need to be carried out in order to achieve the milestone where the arrow ends. Each arrow has a number which we can assume is the number of days we think the task will take (duration). In some cases the arrows diverge (e.g. B must be completed in order for either D or E to start) or converge (D and E must both be completed before F can start).

The graph with earliest start times.

At this point, we can calculate the earliest start time of each node. To understand this, let’s consider node E. In order for node E to start, the arrows leading up to it must both be completed. These include the paths ABE (duration = 2 + 4 = 6) and ACE (duration = 3 + 2 = 5). Since both must be completed, there’s no starting E before the longest of these (duration 6) has completed, so E’s earliest start time is 6. This is useful because, considering the tasks that occur sequentially or in parallel, it allows us to schedule a task at the appropriate time when its dependencies have been completed.

In order to calculate the earliest start time of all nodes, we do a forward pass from left to right, assuming that the earliest start time for node A is zero, adding up the durations leading to each node, and taking the highest number where multiple arrows converge to the same node. So:

  • A: assume earliest start time is zero.
  • B: 0 + 2 = 2.
  • C: 0 + 3 = 2.
  • D: 2 + 1 = 3.
  • E: max(3 + 2, 2 + 4) = 6.
  • F: max(3 + 6, 6 + 5) = 11.

The results are shown in the above diagram, where earliest start times are shown under the bottom-left portion of each node.

The graph with latest start times.

Next, we can calculate the latest start times, which tell us the latest day on which we can start each task without delaying the whole project. To do this, we start from the last node, setting its latest start time to the same value as its earliest start time (11 in the case of F). Then, we work backwards, subtracting the duration from the latest start time, and this time taking the minimum where a node diverges. So:

  • F: assume latest start time is same as earliest start time, i.e. 11.
  • E: 11 – 5 = 6.
  • D: 11 – 6 = 5.
  • C: 6 – 2 = 4.
  • B: min(5 – 1, 6 – 4) = 2
  • A: min(4 – 3, 2 – 2) = 0

The critical path consists of the nodes whose start and end times are equal – in this case this would be the path ABEF. If any of the tasks along this path are delayed, this would delay the whole project. On the other hand, nodes with different earliest and latest start times have some leeway. If the task along BD takes 3 days instead of 1 day, the path ABDF takes 2 + 3 + 6 = 11 days, which is the same as we need to get to F from the longer path, and so this doesn’t affect the overall project. The amount of leeway for each node is the difference between its latest and earliest start times. Nodes on the critical path have a zero difference and therefore get no leeway.

Running Neo4j with Docker

Now that we’ve seen how critical path analysis works with manual calculations, we’ll see how to create and analyse the same graph using a graph database, specifically Neo4j.

The easiest way to run Neo4j quickly is using Docker. Assuming we’re using Linux, Docker is already installed, and we want to destroy the container once it’s stopped, the following command achieves this purpose:

sudo docker run --rm -it -p 7687:7687 -p 7474:7474 neo4j

Once Neo4j is running, we can access the Neo4j Browser in a web browser via the URL http://localhost:7474/browser/. The default credentials to login are neo4j for both username and password, and these will have to be changed the first time. After that, the Neo4j Browser can be used to run Cypher queries and view their results.

Creating the Graph

To create the graph, we’ll run the following Cypher in the Neo4j Browser. The first set of statements creates the nodes. The second set locates the nodes we just created, and establishes the relationships between them. Since the statements end with a semicolon, they may be run all together in one go.

create (A:Milestone {name: 'A'});
create (B:Milestone {name: 'B'});
create (C:Milestone {name: 'C'});
create (D:Milestone {name: 'D'});
create (E:Milestone {name: 'E'});
create (F:Milestone {name: 'F'});

match (A:Milestone{name: 'A'}), (B:Milestone {name: 'B'}) create (A)-[:precedes {duration : 2}]->(B);
match (A:Milestone{name: 'A'}), (C:Milestone {name: 'C'}) create (A)-[:precedes {duration : 3}]->(C);
match (B:Milestone{name: 'B'}), (D:Milestone {name: 'D'}) create (B)-[:precedes {duration : 1}]->(D);
match (B:Milestone{name: 'B'}), (E:Milestone {name: 'E'}) create (B)-[:precedes {duration : 4}]->(E);
match (C:Milestone{name: 'C'}), (E:Milestone {name: 'E'}) create (C)-[:precedes {duration : 2}]->(E);
match (D:Milestone{name: 'D'}), (F:Milestone {name: 'F'}) create (D)-[:precedes {duration : 6}]->(F);
match (E:Milestone{name: 'E'}), (F:Milestone {name: 'F'}) create (E)-[:precedes {duration : 5}]->(F);

Once this is done, the resulting graph can be visualised by running the following simple Cypher query, which returns all nodes:

match(n)
return n

After adjusting the position of the nodes, as well as their colour and caption, the graph matches what we saw earlier:

The graph in Neo4j, as seen in the Neo4j Browser.

Setting Earliest Start Times

As we saw earlier, the earliest start times of each node are calculated by adding up the durations of each arrow leading to that node, taking the highest number in case there is more than one. In Neo4j, we can achieve this with a path query. We’ll build this step by step to clarify what the final query does.

We’ll start with this very simple Cypher query:

match path = (a:Milestone)-[:precedes*]->(b:Milestone)
return a, relationships(path), b

This gets every path between every two nodes, and returns the pair of nodes along with all the relationships along the way. The Text view of the result in the Neo4j browser is the following:

╒════════════╤══════════════════════════════════════════════╤════════════╕
│"a"         │"relationships(path)"                         │"b"         │
╞════════════╪══════════════════════════════════════════════╪════════════╡
│{"name":"A"}│[{"duration":2}]                              │{"name":"B"}│
├────────────┼──────────────────────────────────────────────┼────────────┤
│{"name":"A"}│[{"duration":2},{"duration":1}]               │{"name":"D"}│
├────────────┼──────────────────────────────────────────────┼────────────┤
│{"name":"A"}│[{"duration":2},{"duration":1},{"duration":6}]│{"name":"F"}│
├────────────┼──────────────────────────────────────────────┼────────────┤
│{"name":"A"}│[{"duration":2},{"duration":4}]               │{"name":"E"}│
├────────────┼──────────────────────────────────────────────┼────────────┤
│{"name":"A"}│[{"duration":2},{"duration":4},{"duration":5}]│{"name":"F"}│
├────────────┼──────────────────────────────────────────────┼────────────┤
│{"name":"A"}│[{"duration":3}]                              │{"name":"C"}│
├────────────┼──────────────────────────────────────────────┼────────────┤
│{"name":"A"}│[{"duration":3},{"duration":2}]               │{"name":"E"}│
├────────────┼──────────────────────────────────────────────┼────────────┤
│{"name":"A"}│[{"duration":3},{"duration":2},{"duration":5}]│{"name":"F"}│
├────────────┼──────────────────────────────────────────────┼────────────┤
│{"name":"B"}│[{"duration":1}]                              │{"name":"D"}│
├────────────┼──────────────────────────────────────────────┼────────────┤
│{"name":"B"}│[{"duration":1},{"duration":6}]               │{"name":"F"}│
├────────────┼──────────────────────────────────────────────┼────────────┤
│{"name":"B"}│[{"duration":4}]                              │{"name":"E"}│
├────────────┼──────────────────────────────────────────────┼────────────┤
│{"name":"B"}│[{"duration":4},{"duration":5}]               │{"name":"F"}│
├────────────┼──────────────────────────────────────────────┼────────────┤
│{"name":"C"}│[{"duration":2}]                              │{"name":"E"}│
├────────────┼──────────────────────────────────────────────┼────────────┤
│{"name":"C"}│[{"duration":2},{"duration":5}]               │{"name":"F"}│
├────────────┼──────────────────────────────────────────────┼────────────┤
│{"name":"D"}│[{"duration":6}]                              │{"name":"F"}│
├────────────┼──────────────────────────────────────────────┼────────────┤
│{"name":"E"}│[{"duration":5}]                              │{"name":"F"}│
└────────────┴──────────────────────────────────────────────┴────────────┘

In our case, we just want the value of the durations along each path, so we extract the duration as follows:

match path = (a:Milestone)-[:precedes*]->(b:Milestone)
return a, [r in relationships(path) | r.duration], b

The part in square brackets on the second line simply means “for each relationship in the path’s relationships, take the duration”. The following is the simplified result:

╒════════════╤═════════════════════════════════════════╤════════════╕
│"a"         │"[r in relationships(path) | r.duration]"│"b"         │
╞════════════╪═════════════════════════════════════════╪════════════╡
│{"name":"A"}│[2]                                      │{"name":"B"}│
├────────────┼─────────────────────────────────────────┼────────────┤
│{"name":"A"}│[2,1]                                    │{"name":"D"}│
├────────────┼─────────────────────────────────────────┼────────────┤
│{"name":"A"}│[2,1,6]                                  │{"name":"F"}│
├────────────┼─────────────────────────────────────────┼────────────┤
│{"name":"A"}│[2,4]                                    │{"name":"E"}│
├────────────┼─────────────────────────────────────────┼────────────┤
│{"name":"A"}│[2,4,5]                                  │{"name":"F"}│
├────────────┼─────────────────────────────────────────┼────────────┤
│{"name":"A"}│[3]                                      │{"name":"C"}│
├────────────┼─────────────────────────────────────────┼────────────┤
│{"name":"A"}│[3,2]                                    │{"name":"E"}│
├────────────┼─────────────────────────────────────────┼────────────┤
│{"name":"A"}│[3,2,5]                                  │{"name":"F"}│
├────────────┼─────────────────────────────────────────┼────────────┤
│{"name":"B"}│[1]                                      │{"name":"D"}│
├────────────┼─────────────────────────────────────────┼────────────┤
│{"name":"B"}│[1,6]                                    │{"name":"F"}│
├────────────┼─────────────────────────────────────────┼────────────┤
│{"name":"B"}│[4]                                      │{"name":"E"}│
├────────────┼─────────────────────────────────────────┼────────────┤
│{"name":"B"}│[4,5]                                    │{"name":"F"}│
├────────────┼─────────────────────────────────────────┼────────────┤
│{"name":"C"}│[2]                                      │{"name":"E"}│
├────────────┼─────────────────────────────────────────┼────────────┤
│{"name":"C"}│[2,5]                                    │{"name":"F"}│
├────────────┼─────────────────────────────────────────┼────────────┤
│{"name":"D"}│[6]                                      │{"name":"F"}│
├────────────┼─────────────────────────────────────────┼────────────┤
│{"name":"E"}│[5]                                      │{"name":"F"}│
└────────────┴─────────────────────────────────────────┴────────────┘

This gives us a list of durations along each path. We can use the reduce() function to add them up, transforming the query as follows:

match path = (a:Milestone)-[:precedes*]->(b:Milestone)
return a, reduce(x = 0, r in relationships(path) | x + r.duration), b

reduce() uses x as an accumulator variable, adding the duration of each relationship to it and returning the final result. The result is now the following:

╒════════════╤══════════════════════════════════════════════════════════╤════════════╕
│"a"         │"reduce(x = 0, r in relationships(path) | x + r.duration)"│"b"         │
╞════════════╪══════════════════════════════════════════════════════════╪════════════╡
│{"name":"A"}│2                                                         │{"name":"B"}│
├────────────┼──────────────────────────────────────────────────────────┼────────────┤
│{"name":"A"}│3                                                         │{"name":"D"}│
├────────────┼──────────────────────────────────────────────────────────┼────────────┤
│{"name":"A"}│9                                                         │{"name":"F"}│
├────────────┼──────────────────────────────────────────────────────────┼────────────┤
│{"name":"A"}│6                                                         │{"name":"E"}│
├────────────┼──────────────────────────────────────────────────────────┼────────────┤
│{"name":"A"}│11                                                        │{"name":"F"}│
├────────────┼──────────────────────────────────────────────────────────┼────────────┤
│{"name":"A"}│3                                                         │{"name":"C"}│
├────────────┼──────────────────────────────────────────────────────────┼────────────┤
│{"name":"A"}│5                                                         │{"name":"E"}│
├────────────┼──────────────────────────────────────────────────────────┼────────────┤
│{"name":"A"}│10                                                        │{"name":"F"}│
├────────────┼──────────────────────────────────────────────────────────┼────────────┤
│{"name":"B"}│1                                                         │{"name":"D"}│
├────────────┼──────────────────────────────────────────────────────────┼────────────┤
│{"name":"B"}│7                                                         │{"name":"F"}│
├────────────┼──────────────────────────────────────────────────────────┼────────────┤
│{"name":"B"}│4                                                         │{"name":"E"}│
├────────────┼──────────────────────────────────────────────────────────┼────────────┤
│{"name":"B"}│9                                                         │{"name":"F"}│
├────────────┼──────────────────────────────────────────────────────────┼────────────┤
│{"name":"C"}│2                                                         │{"name":"E"}│
├────────────┼──────────────────────────────────────────────────────────┼────────────┤
│{"name":"C"}│7                                                         │{"name":"F"}│
├────────────┼──────────────────────────────────────────────────────────┼────────────┤
│{"name":"D"}│6                                                         │{"name":"F"}│
├────────────┼──────────────────────────────────────────────────────────┼────────────┤
│{"name":"E"}│5                                                         │{"name":"F"}│
└────────────┴──────────────────────────────────────────────────────────┴────────────┘

Finally, by using the max() function, dropping a from the result, and using a little ordering for clarity, we get exactly the earliest start times we wanted, using the following query:

match path = (:Milestone)-[:precedes*]->(b:Milestone)
return max(reduce(x = 0, r in relationships(path) | x + r.duration)), b
order by b.name

The resulting values, shown below, match what we calculated manually earlier:

╒═══════════════════════════════════════════════════════════════╤════════════╕
│"max(reduce(x = 0, r in relationships(path) | x + r.duration))"│"b"         │
╞═══════════════════════════════════════════════════════════════╪════════════╡
│2                                                              │{"name":"B"}│
├───────────────────────────────────────────────────────────────┼────────────┤
│3                                                              │{"name":"C"}│
├───────────────────────────────────────────────────────────────┼────────────┤
│3                                                              │{"name":"D"}│
├───────────────────────────────────────────────────────────────┼────────────┤
│6                                                              │{"name":"E"}│
├───────────────────────────────────────────────────────────────┼────────────┤
│11                                                             │{"name":"F"}│
└───────────────────────────────────────────────────────────────┴────────────┘

All we have left to do now is modify the query to set these values on each node:

match path = (:Milestone)-[:precedes*]->(b:Milestone)
with b, max(reduce(x = 0, r in relationships(path) | x + r.duration)) as earliest_start
set b.earliest_start = earliest_start

It is then trivial to verify that the nodes have been updated with the correct earliest start times:

A simple query shows that the nodes have been updated with earliest start times.

Setting Latest Start Times

Setting the latest start times is easier and does not require complex path queries. As we did manually, we work our way backwards, subtracting the duration from the earliest start time, and taking the minimum where there are multiple arrows emerging from a node. The following query does the trick:

match (a:Milestone)-[r:precedes]->(b:Milestone)
return a, min(b.earliest_start - r.duration) as latest_start
order by a.name

The following output shows values that match what we originally calculated manually:

╒═══════════════════════════════╤══════════════╕
│"a"                            │"latest_start"│
╞═══════════════════════════════╪══════════════╡
│{"name":"A"}                   │0             │
├───────────────────────────────┼──────────────┤
│{"name":"B","earliest_start":2}│2             │
├───────────────────────────────┼──────────────┤
│{"name":"C","earliest_start":3}│4             │
├───────────────────────────────┼──────────────┤
│{"name":"D","earliest_start":3}│5             │
├───────────────────────────────┼──────────────┤
│{"name":"E","earliest_start":6}│6             │
└───────────────────────────────┴──────────────┘

We can set the latest start time on each node by adjusting the query slightly as follows:

match (a:Milestone)-[r:precedes]->(b:Milestone)
with a, min(b.earliest_start - r.duration) as latest_start
set a.latest_start = latest_start

Once again, we verify that everything has updated correctly:

A simple query shows that the nodes have been updated with latest start times.

Calculating the Critical Path: Maximum Duration

One way to calculate the critical path is shown in the aforelinked Project Management Neo4j graph gist by Nicole White. Adapted to our graph representation, the query for this is as follows:

match path = (a:Milestone)-[:precedes*]->(b:Milestone)
where a.name = 'A' and b.name = 'F'
with path, reduce(total_duration = 0, r in relationships(path) | total_duration + r.duration) AS total_duration
order by total_duration desc
limit 1
return nodes(path)

This method does not need earliest and latest start times at all. It works as follows:

  • It obtains all paths between the start and finish node (as per the where clause).
  • The total duration of each path is calculated with reduce().
  • The longest path is taken thanks to the order bydesc and limit 1.

As you can see from the screenshot below, this method works pretty well.

The critical path shown in Neo4j Browser.

Calculating the Critical Path: Equal Start Times

You might remember from earlier that the earliest and latest start times are equal in each node along the critical path, so this gives us another way to calculate the critical path. To do this, though, we first need to update the start and end nodes to fill in their missing earliest and latest start times, as follows:

match(a:Milestone)
where a.name = 'A'
set a.earliest_start = 0;

match(f:Milestone)
where f.name = 'F'
set f.latest_start = f.earliest_start;

We can then obtain the critical path as follows, using the all() predicate function to ensure that we pick only the nodes having equal earliest and latest start times:

match path = (a:Milestone)-[r:precedes*]->(b:Milestone)
where a.name = 'A' and b.name = 'F'
and all(node in nodes(path) where node.earliest_start = node.latest_start)
return nodes(path)

As you can see, this method works just as well:

The critical path shown in Neo4j Browser.

Conclusion

Although we’re feeling so Agile nowadays with all these fancy Kanban boards, the nature of projects, tasks and their dependencies makes them best represented by graphs. Additionally, using critical path analysis, it’s possible to obtain useful analytics, such as the optimal time to schedule tasks, which tasks risk delaying the whole project, and which tasks may be delayed without impacting the project delivery.

This scenario served as an example to explore relatively advanced Cypher features, including path queries and various functions.

Using the Neo4j Bolt Driver for Python with Memgraph

Memgraph is a relatively young graph database. It supports the Cypher query language and the Bolt protocol – just like Neo4j – therefore it is usually possible to use Neo4j client libraries (called “drivers”) with Memgraph. In fact, according to the Memgraph Drivers documentation, using the Neo4j drivers is the recommended way to communicate with Memgraph from several languages like Go, Java and C#.

Memgraph Python Client Libraries

For Python, there are actually a number of options to choose from:

  • pymgclient: I discovered this recently and haven’t used it, but it seems to be a lower-level client that works well.
  • gqlalchemy: Memgraph’s recommended client for Python, which uses pymgclient underneath. Perhaps named after Python ORM sqlalchemy, it provides three different ways to query Memgraph, none of which I found to be very practical:
    • Basic execute() and execute_and_fetch(): this seems simple enough, but I haven’t found any way to pass parameters to queries, making it useless for my use case.
    • OGM: This is a graph equivalent of an ORM. It’s no secret that ORMs are one of the things I avoid like the plague – I’ve already written some of my thoughts on the subject in “ADO .NET Part 1: Introduction“, and time-permitting it will also be the subject of a future article. In a nutshell: I just want to write Cypher queries and execute them, not have to translate them to some library’s arbitrary API.
    • Query builder: A fluent query builder, similar in approach to what Elasticsearch provides for .NET. I’m not a fan for the same reasons that apply to ORMs (see previous point above).
  • Neo4j Bolt Driver for Python: This doesn’t work with Memgraph out of the box, but we’ll talk more about this.

It’s unfortunate that the Neo4j Bolt Driver for Python doesn’t work with Memgraph by default, because if you already have Python code that works with Neo4j, you could otherwise use Memgraph as a drop-in replacement for Neo4j with minimal changes (e.g. fixing incompatible Cypher).

For the rest of this article, I will be focusing on the Neo4j Bolt Driver for Python, to understand why we can’t use it with Memgraph and explain how to get around the problem.

Update 21st November 2022: TL;DR: if you need a quick solution, go to the end of this article.

Why the Neo4j Driver Fails with Memgraph

Let’s make a first attempt to use the Neo4j Bolt Driver for Python with Memgraph.

First, we need to have an instance of Memgraph running. The easiest way is to run it with Docker, e.g. as follows (assuming you’re on Linux):

sudo docker run --rm -it -p 7687:7687 -p 3000:3000 memgraph/memgraph-platform

If this works, it will start a Memgraph shell, and you can also access Memgraph Lab (Memgraph’s web user interface) by visiting http://localhost:3000/.

Memgraph shell after running it with Docker, and Memgraph Lab (UI) open in Firefox in the background.

Next, create a folder for your Python code. Run the following to install the Neo4j driver:

pip3 install neo4j

At the time of writing this article, the version of the Neo4j Python driver is 5.2.1. With earlier versions, it’s possible you might run into errors such as:

neobolt.exceptions.SecurityError: Failed to establish secure connection to ‘[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1131)’

In this case, update the driver as follows:

pip3 install neo4j --upgrade

At this point, we can steal some example code from the Neo4j Bolt Driver for Python, as follows, and put it in a file called main.py:

from neo4j import GraphDatabase

driver = GraphDatabase.driver("neo4j://localhost:7687",
                              auth=("neo4j", "password"))

def add_friend(tx, name, friend_name):
    tx.run("MERGE (a:Person {name: $name}) "
           "MERGE (a)-[:KNOWS]->(friend:Person {name: $friend_name})",
           name=name, friend_name=friend_name)

def print_friends(tx, name):
    query = ("MATCH (a:Person)-[:KNOWS]->(friend) WHERE a.name = $name "
             "RETURN friend.name ORDER BY friend.name")
    for record in tx.run(query, name=name):
        print(record["friend.name"])

with driver.session(database="neo4j") as session:
    session.execute_write(add_friend, "Arthur", "Guinevere")
    session.execute_write(add_friend, "Arthur", "Lancelot")
    session.execute_write(add_friend, "Arthur", "Merlin")
    session.execute_read(print_friends, "Arthur")

driver.close()

Once we run this with python3 main.py, we get a nice big error:

$ python3 main.py
Traceback (most recent call last):
  File "main.py", line 18, in <module>
    session.execute_write(add_friend, "Arthur", "Guinevere")
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/work/session.py", line 712, in execute_write
    return self._run_transaction(
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/work/session.py", line 484, in _run_transaction
    self._open_transaction(
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/work/session.py", line 396, in _open_transaction
    self._connect(access_mode=access_mode)
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/work/session.py", line 123, in _connect
    super()._connect(access_mode, **access_kwargs)
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/work/workspace.py", line 198, in _connect
    self._connection = self._pool.acquire(**acquire_kwargs_)
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/io/_pool.py", line 778, in acquire
    self.ensure_routing_table_is_fresh(
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/io/_pool.py", line 721, in ensure_routing_table_is_fresh
    self.update_routing_table(
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/io/_pool.py", line 648, in update_routing_table
    if self._update_routing_table_from(
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/io/_pool.py", line 596, in _update_routing_table_from
    new_routing_table = self.fetch_routing_table(
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/io/_pool.py", line 534, in fetch_routing_table
    new_routing_info = self.fetch_routing_info(
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/io/_pool.py", line 504, in fetch_routing_info
    cx = self._acquire(address, deadline, None)
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/io/_pool.py", line 221, in _acquire
    return connection_creator()
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/io/_pool.py", line 138, in connection_creator
    connection = self.opener(
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/io/_pool.py", line 441, in opener
    return Bolt.open(
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/io/_bolt.py", line 377, in open
    connection.hello()
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/io/_bolt4.py", line 450, in hello
    check_supported_server_product(self.server_info.agent)
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/io/_common.py", line 283, in check_supported_server_product
    raise UnsupportedServerProduct(agent)
neo4j.exceptions.UnsupportedServerProduct: None

The last three lines indicate that the problem seems to be a simple validation, which we can confirm by looking up the offending function in the Neo4j driver’s source code:

def check_supported_server_product(agent):
    """ Checks that a server product is supported by the driver by
    looking at the server agent string.
    :param agent: server agent string to check for validity
    :raises UnsupportedServerProduct: if the product is not supported
    """
    if not agent.startswith("Neo4j/"):
        raise UnsupportedServerProduct(agent)

What would happen if we simply disable this check? Let’s find out.

Tweaking the Neo4j Driver to Work with Memgraph

First, let’s clone the Neo4j driver’s repo:

git clone https://github.com/neo4j/neo4j-python-driver.git

A quick search shows that there are two places where the server product check is done:

There are two equivalent check_supported_server_product() functions in _neo4j/_async/io/_common.py and _neo4j/_sync/io/_common.py.

We can disable the validation by replacing the implementation of each function with just pass:

def check_supported_server_product(agent):
    pass

Next, we build this modified version of the Neo4j driver as follows:

python3 setup.py sdist

This creates a file called neo4j-5.2.dev0.tar.gz in a dist subfolder. Take note of the path of this file.

Back in the folder with our Python test code (where we were attempting to communicate with Memgraph), install the package we just built:

$ pip3 install /home/daniel/Desktop/neo4j-python-driver/dist/neo4j-5.2.dev0.tar.gz
Processing /home/daniel/Desktop/neo4j-python-driver/dist/neo4j-5.2.dev0.tar.gz
Requirement already satisfied: pytz in /home/daniel/.local/lib/python3.8/site-packages (from neo4j==5.2.dev0) (2022.1)
Building wheels for collected packages: neo4j
  Building wheel for neo4j (setup.py) ... done
  Created wheel for neo4j: filename=neo4j-5.2.dev0-py3-none-any.whl size=244857 sha256=ec2951ea1fecf2ae1aacced4d93c66b1b5d90bc3710746ff3814b9b62a96a9af
  Stored in directory: /home/daniel/.cache/pip/wheels/0d/4c/55/2486d65ebf98105bc54a490ebd91cea4ba538268a32ffc91f0
Successfully built neo4j
Installing collected packages: neo4j
  Attempting uninstall: neo4j
    Found existing installation: neo4j 5.2.1
    Uninstalling neo4j-5.2.1:
      Successfully uninstalled neo4j-5.2.1
Successfully installed neo4j-5.2.dev0

Run the Python code again…

$ python3 main.py
Unable to retrieve routing information
Transaction failed and will be retried in 0.9256931081701124s (Unable to retrieve routing information)
Unable to retrieve routing information
Transaction failed and will be retried in 2.0779915720272504s (Unable to retrieve routing information)

We still have a failure, but this is a simple connectivity issue that is easily fixed by changing the scheme in the URI from neo4j to bolt:

driver = GraphDatabase.driver("bolt://localhost:7687",
                              auth=("neo4j", "password"),)

Running it again, we see that it now works!

$ python3 main.py
Guinevere
Lancelot
Merlin

We can also view the created data in Memgraph Lab to double-check that it really worked:

Querying the data in Memgraph Lab, we see that the example nodes were created.

Conclusion

In this article, we’ve confirmed that, at a basic level, the only thing preventing the Neo4j Bolt Driver for Python from being used with Memgraph is a simple check against a response from the server. We saw that queries could be executed once this check was disabled.

As a result, it’s not clear why Memgraph built their own Python clients instead of simply addressing this check (e.g. by sending the same response as Neo4j, or forking the driver and eliminating the check as I did). I will refrain from speculating on possible reasons, but I found this interesting to investigate and hope it saves time for other people in the same situation.

P.S.: There’s an Easier Way

This section was added on 21st November 2022.

I learned from the Memgraph team that they do provide a way to deal with the server check – it’s just not documented at the time of writing this article. Basically, all you have to do is run Memgraph using a --bolt-server-name-for-init switch that sets the missing server response. So if you run Memgraph in Docker, you’d need to run it as follows:

sudo docker run --rm -it -p 7687:7687 -p 3000:3000 -e MEMGRAPH="--bolt-server-name-for-init=Neo4j/" memgraph/memgraph-platform

If you run the example code with the bolt:// scheme using the unmodified Neo4j Bolt Driver for Python, it should work just as well.

Update 19th September 2023: as of Memgraph v2.11, --bolt-server-name-for-init has a default value compatible with the Neo4j Bolt Driver, and therefore no longer needs to be provided.

Getting Started with Cartography for AWS

I have recently been working with Cartography. This tool is great for taking stock of your infrastructural and security assets, visualising them, and running security audits. However, getting it to work the first time is more painful than it needs to be. Through this article, I hope to make it less painful for other people checking out Cartography for the first time.

What is Cartography?

Cartography is a tool that can explore cloud and Software as a Service (SaaS) providers (such as AWS, Azure, GCP, GitHub, Okta and others), gather metadata about them, and store it in a Neo4j graph database. Once in Neo4j, the data can be queried using the Cypher language and the results can be visualised. This is extremely useful to understand the relationship between different infrastructural and security assets, which can sometimes reveal security flaws that need to be addressed.

Cartography is written in Python and maintained by Lyft. Sacha Faust’s “Automating Security Visibility and Democratization” 30-minute talk at BSidesSF 2019 serves as a great intro to Cartography, and also illustrates several of the early data relationships it collected.

Good to Know

Before we dive into setting up Cartography and its dependencies, I want to point out some issues I ran into, in order to minimise frustration.

[Update 8th July 2023: all issues in this section have by now been fixed, so you can skip this section. You can use a newer version of Neo4j now, although the rest of the article still uses Neo4j 3.5 for historical reasons.]

The biggest of these is that Cartography still requires the outdated Neo4j 3.5, which was planned to reach its end-of-life on 28th November 2021. Although a pull request for migration to Neo4j 4.4 was contributed on 30th January 2021, the Lyft team completely missed this deadline. Fortunately, support for Neo4j 3.5 was extended to 27th May 2022. Although the maintainers are planning to migrate to migrate to a newer Neo4j version by then, I’m not holding my breath.

This worries me for a number of reasons:

  1. If Neo4j 3.5 reaches end of life before Cartography have migrated to a more recent version, it means people using Cartography would need to run an unsupported version of Neo4j. This could be a security risk, which is ironic given that Cartography is a tool used for security.
  2. It gives the feeling that Cartography is not very well-maintained, if issues as important as this take well over a year to resolve.
  3. It makes it virtually impossible to run Cartography on a Mac with one of the newer Apple M1 CPUs. That’s because Neo4j 3.5 won’t run on an arm64 processor (e.g. Neo4j Docker images for this architecture started to appear only since 4.4), but also because a Python cryptography dependency needs to be upgraded.

So if you feel you need to depend on Cartography, it might make sense to fork it and maintain it yourself. Upgrading it to support Neo4j 4.4 is tedious but not extremely complicated, and mostly is a matter of updating Cypher queries to use the new parameter syntax as explained in the aforementioned pull request.

Another problem I ran into (and reported) is that Cartography gets much more EBS snapshot data than necessary. This bloats the Neo4j database with orders of magnitude of unnecessary data, and makes the already slow process of data collection take several minutes longer than it needs to.

Setting Up Neo4j

For now, we’ll have to stick with Neo4j 3.5. You can follow the Cartography Installation documentation to set up a local Neo4j instance, but it’s actually much easier to just run a Docker container. In fact, all you need is to run the following command:

sudo docker run -p 7474:7474 -p 7473:7473 -p 7687:7687 neo4j:3.5

Like this, you can avoid bloating your system with dependencies like Java, and just manage the container instead. Depending on the operating system, you use, you may need to keep or drop the sudo command. You’ll also need to mount a volume (not shown here) if you want the data to survive container restarts.

Running a Neo4j 3.5 Docker container.

Once Neo4j 3.5 is running, you can access the Neo4j Browser at localhost:7474:

The Neo4j Browser’s login screen.

Login with the default credentials, i.e. with “neo4j” as both username and password. You will then be prompted to change your password:

Changing password in the Neo4j Browser.

Go ahead and change the password. This is necessary because Cartography would not otherwise be able to connect to Neo4j using the default credentials.

The Neo4j Browser’s dashboard after changing password.

Setting Up a SecurityAudit User in AWS

Cartography can be used to map out several different services, but here we’ll use AWS. To retrieve AWS data, we’ll need to set up a user with a SecurityAudit policy.

Log into the AWS Console, then go into the IAM service, and finally select “Users” on the left. Click the “Add users” button on the right.

Once in IAM, select “Users” on the left, and then click “Add users” on the right.

In the next screen, enter a name for the user, and choose “Access key – Programmatic access” as the AWS credential type, then click the “Next: Permissions” button at the bottom-right.

Enter a username, then choose Programmatic access before proceeding.

In the Permissions screen, select “Attach existing policies directly” (an arguable practice, but for now it will suffice). Use the search input to quickly filter the list of policies until you can see “SecurityAudit”, then click the checkbox next to it, and finally click the “Next: Tags” button at the bottom-right to proceed.

Attach the “SecurityAudit” policy directly to the new user.

There is nothing more to do, so just click on the remaining “Next” buttons and create the user. At this point you are given the new user’s Access key ID and Secret access key. Grab hold of them and keep them in a safe place. We’ll use them shortly.

Now that we have a user with the right permissions, all we need to do us set up the necessary AWS configuration locally, so that Cartography can use that user to inspect the AWS account. This is quite simple and is covered in the AWS Configuration and credential file settings documentation.

First, create a file at ~/.aws/credentials, and then add the Access key ID and Secret access key you just obtained, as follows (replacing the placeholder values):

[default]
aws_access_key_id=ACCESSKEYIDVALUE
aws_secret_access_key=SECRETACCESSKEYIDVALUE

Then, create another file at ~/.aws/config, and add the basic configuration as follows. I’m not sure whether the region actually makes a difference, since Cartography will in fact inspect all regions for many services that can be deployed in multiple regions.

[default]
region=us-west-2
output=json

That’s it! Let’s run Cartography.

Running Cartography

Run the following command to install Cartography:

pip3 install cartography

Then, run Cartography itself:

cartography --neo4j-uri bolt://localhost:7687 --neo4j-password-prompt --neo4j-user neo4j

Enter the Neo4j password you set earlier (i.e. not the default one) when prompted.

Cartography should now run, collecting data from AWS, adding it to Neo4j, and writing output as it works. It takes a while, even for a brand new AWS account.

Querying the Graph

Once Cartography finishes running, go back to the Neo4j Browser at http://localhost:7474/browser/ . You can now write Cypher queries to analyse the data collected by Cartography.

If you haven’t used Cypher before, check out my articles “First Steps with RedisGraph” and “Family Tree with RedisGraph“, as well as my RedisConf 2020 talk “A Practical Introduction to RedisGraph“. RedisGraph is another graph database that uses the same Cypher query language, and these resources should allow you to ramp up quickly.

You might not know what Cartography data to look for initially, but you can always start with a simple MATCH query, and as you type “AWS” as a node type in a partial query (e.g. “MATCH (x:AWS“), Neo4j will suggest types from the ones it knows about. You can also consult the AWS Schema documentation, as well as the aforementioned “Automating Security Visibility and Democratization” talk which illustrates some of these types and their relationships in handy diagrams.

Let’s take a look at a few simple examples around IAM to ease you in.

Example 1: Get All Principals

MATCH (u:AWSPrincipal)
RETURN u

In AWS, a “principal” is an umbrella term for anything that can make a request, including users, groups, roles, and the special root user. Although this is a very basic query, you’ll be surprised by what it returns, including some special internal AWS roles.

Example 2: Get Users with Policies

MATCH (u:AWSUser)-[:POLICY]->(p:AWSPolicy)
RETURN u, p

This query gets users and their policies via the POLICY relationship. Due to the nature of the query, it won’t return users that don’t have any directly attached policies. In this case all I’ve got is the cartography user I created earlier, but you can see the connection to the SecurityAudit policy.

The cartography user is linked to the SecurityAudit policy.

Example 3: Get Policy Statements for Principals

MATCH (a:AWSPrincipal)-->(p:AWSPolicy)-[:STATEMENT]->(s)
RETURN a, p, s

Cartography parses the statements in AWS policies, so if you inspect a node of type AWSPolicy, you can actually see what resources it provides access to. This query shows the relationship between principals (again, this means users, groups, etc) and the details of the policies attached directly to them.

It is possible to refine this query further to include indirectly assigned policies (e.g. to see what permissions a user has via a group it belongs to), or to look for specific permissions (e.g. whether a principal has access to iam:*).

Results of a Cypher query linking AWS principals to the policy statements that apply to them, via AWS policies.

Wrapping Up

As you can see, Cartography takes a bit of effort to set up and has some caveats, but it’s otherwise a fantastic tool to gather data about your resources into Neo4j for further analysis.