Tag Archives: Memgraph

Migrating Cartography to Memgraph

I’ve recently written about how to use Cartography to collect infrastructural data from both AWS and Okta into a Neo4j graph database for security analysis.

Neo4j is a long-standing player in the graph database market, with a robust product, great documentation, and a massive following. However, its long legacy is in a way also a disadvantage, as it can be costly, slow, and resource-hungry (due in no small part to its reliance on the JVM). Sometimes people would like to use an alternative for any of these reasons.

Memgraph, on the other hand, is a relatively young graph database, and certainly not as fully-featured as Neo4j. A key difference is that it is written in C++, meaning it’s designed to be faster and more lightweight than Neo4j (whether it lives up to this is something you’ll need to evaluate for your own use cases). Memgraph also made a very wise decision to support the Bolt protocol and the Cypher language – both of which Neo4j uses – meaning that it’s compatible with existing Neo4j clients and queries. Although there are variations in Cypher dialect, the incompatibilities are few, and moving from Neo4j to Memgraph is significantly less painful than, say, transitioning to a graph database that uses Gremlin as its query language.

At the time of writing this article, Cartography requires Neo4j 4.x, and does not work with Memgraph. However, I’m going to show you how to make at least part of it (the Okta intel module) work with minor alterations to the Cartography codebase. This serves as a demonstration of how to get started migrating an existing application from Neo4j to Memgraph.

Running Memgraph

Before we start looking at Cartography, let’s run an instance of Memgraph. To do this, we’ll take a tip from my earlier article, “Using the Neo4j Bolt Driver for Python with Memgraph“, and run it under Docker as follows (drop the sudo if you’re on Mac or Windows):

sudo docker run --rm -it -p 7687:7687 -p 3000:3000 -e MEMGRAPH="--bolt-server-name-for-init=Neo4j/" memgraph/memgraph-platform

That --bolt-server-name-for-init=Neo4j/ is a first critical step in Neo4j compatibility. As explained in that same article, the Neo4j Bolt Driver (i.e. client) for Python (which Cartography uses) checks whether the server sends an “agent” value that starts with “Neo4j/”. By setting this, Memgraph is effectively posing as a Neo4j server, and the Neo4j Bolt Driver for Python can’t tell the difference.

Update 19th September 2023: as of Memgraph v2.11, --bolt-server-name-for-init has a default value compatible with the Neo4j Bolt Driver, and therefore no longer needs to be provided.

If it’s successful, you should see output such as the following:

Memgraph is running. You can also execute queries directly from here.

Cloning the Cartography Repo

The next thing to do is grab a copy of the Cartography source code from the Cartography GitHub repo:

git clone https://github.com/lyft/cartography.git

Next, run the following command to install the necessary dependencies:

pip3 install -e .

Note: in the past, I’ve usually had to upgrade the Neo4j Bolt Driver for Python to 5.2.1 to get anything working, but as I try this again, it seems to work even with the default 4.4.x that Cartography uses. If you have problems, try changing setup.py to require neo4j>=5.2.1 and run the above command again.

Creating a Launch Configuration in Visual Studio Code

In order to run Cartography from its source code, you could run it directly from the terminal, for instance:

cd cartography/cartography
python3 __main__.py

However, as I’ve recently been using Visual Studio Code for all my polyglot software development needs, I find it much more convenient to set up a launch configuration that allows me to easily debug Cartography and pass whatever command-line arguments and environment variables I want.

The following launch.json is handy to run Cartography with an Okta configuration as described in “Getting Started with Cartography for Okta“:

{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Run Cartography",
            "type": "python",
            "request": "launch",
            "program": "cartography/__main__.py",
            "console": "integratedTerminal",
            "justMyCode": true,
            "args": [
                "--neo4j-user",
                "ignore",
                "--neo4j-password-env-var",
                "NEO4J_PASS",
                "--okta-org-id",
                "dev-xxxxxxxx",
                "--okta-api-key-env-var",
                "OKTA_API_TOKEN"
            ],
            "env": {
                "NEO4J_PASS": "ignore",
                "OKTA_API_TOKEN": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
            }
        }
    ]
}

You might notice we’re telling Cartography to connect to Neo4j (Memgraph actually) with username and password both set to “ignore”. The reason for this is that while the Community Edition of Memgraph does not require (or support) authentication, the Neo4j Bolt Driver for Python (i.e. Neo4j client) does require a username and password to be provided. So, as a second critical compatibility step, we pass any arbitrary value for the Neo4j username and password so long as they are not left empty.

As for the Okta configuration, remember to replace the Organisation ID and API Token with real ones.

Incompatible Index Creation Cypher

Pressing F5, we can now run Cartography from inside Visual Studio Code, and we immediately run into the first problem:

The error says: “no viable alternative at input ‘CREATEINDEXIF'”.

Memgraph is choking on the index creation step in indexes.cypher (in VS Code, use Ctrl+P / Command+P to quickly locate the file) because the index creation syntax is one aspect of Memgraph’s Cypher implementation that is not compatible with that of Neo4j. If we take the first line in the file, the Neo4j-compatible syntax is:

CREATE INDEX IF NOT EXISTS FOR (n:AWSConfigurationRecorder) ON (n.id);

…whereas the equivalent on Memgraph would be:

CREATE INDEX ON :AWSConfigurationRecorder(id);

Note: in Memgraph, the “IF NOT EXISTS” bit is implicit: an index is created if it doesn’t exist; if it does, the operation is a no-op that does not cause any error.

Fortunately, this syntactic difference is easily resolved by replacing (using VS Code search & replace syntax, with regex enabled) this:

CREATE INDEX IF NOT EXISTS FOR \(n:(.*?)\) ON \(n.(.*?)\);

…with this:

CREATE INDEX ON :$1($2);

Tip: although not in scope here, you’ll need to make a similar change also in querybuilder.py and tx.py if you also want to get other intel modules (e.g. AWS) working.

Neo4j Result Consumption

After fixing the index creation syntax and rerunning Cartography, we run into another problem:

The error says: “The result is out of scope. The associated transaction has been closed. Results can only be used while the transaction is open.”

I’m told that consume() is used to fix a problem in which Neo4j connections hang in situations where internal buffers fill up, although the Cartography team is re-evaluating whether this is necessary. In practice, I have seen that removing this doesn’t seem to cause problems with datasets I’ve tested with, although your mileage may vary. Let’s fix this problem by removing usage of consume() in statement.py.

First, we drop the .consume() at the end of line 76 inside the run() function:

    def run(self, session: neo4j.Session) -> None:
        """
        Run the statement. This will execute the query against the graph.
        """
        if self.iterative:
            self._run_iterative(session)
        else:
            session.write_transaction(self._run_noniterative)
        logger.info(f"Completed {self.parent_job_name} statement #{self.parent_job_sequence_num}")

Then, in the _run_iterative() function, we remove the entire while loop (lines 120-128) except for line 121, which we de-indent:

        # while True:
        result: neo4j.Result = session.write_transaction(self._run_noniterative)

            # Exit if we have finished processing all items
            # if not result.consume().counters.contains_updates:
            #     # Ensure network buffers are cleared
            #     result.consume()
            #     break
            # result.consume()

When we run it again, it should finish the run without problems and return control of the terminal with the prompt showing:

...
INFO:cartography.sync:Finishing sync stage 'duo'
INFO:cartography.sync:Starting sync stage 'analysis'
INFO:cartography.intel.analysis:Skipping analysis because no job path was provided.
INFO:cartography.sync:Finishing sync stage 'analysis'
INFO:cartography.sync:Finishing sync with update tag '1689401212'
daniel@andromeda:~/git/cartography$

Querying the Graph

The terminal we’re using to run Memgraph has the mgconsole client running (that’s the memgraph> prompt you see in the earlier screenshot), meaning we can try running queries directly there. For starters, we can try the ubiquitous “get everything” Cypher query:

memgraph> match (n) return n;

Note: if you get a “mg_raw_transport_send: Broken pipe”, just run the query again and it should reconnect.

This gives us some data back:

Querying Memgraph using mgconsole.

As you can see, this is not great to visualise results. Fortunately, Memgraph has its own web client (similar to Neo4j Browser) called Memgraph Lab, that you can access on http://localhost:3000/:

Memgraph Lab: Quick Connect page.

On the Quick Connect page, click the “Connect now” button. Then, switch to the “Query Execution” page using the left navigation sidebar, and you can run queries and view results more comfortably:

Seeing some nodes in Memgraph Lab.

Unlike Neo4j Browser, Memgraph Lab does not return relationships by default when you run this query. If you want to see them as well, you can run this instead:

match (a)-[r]->(b)
return a, r, b
Nodes and relationships in Memgraph Lab.

If the graph looks too cluttered, just drag the nodes around to rearrange them in a way that is more pleasant.

More Cartography with Memgraph

Cartography is a huge project that gathers data from a variety of data sources including AWS, Azure, GitHub, Okta, and others.

I’ve intentionally only covered the Okta intel module in this article because it’s small in scope and easy to digest. To use Cartography with other data sources, additional effort is required to address other problems with incompatible Cypher queries. For instance, at the time of writing this article, there are at least 9 outstanding issues that need to be fixed before Cartography can be used with Memgraph for AWS (that’s quite impressive considering that the AWS intel module is the biggest). Other intel modules may have other problems that need solving; nobody has explored them with Memgraph yet.

Summary

In this article, I’ve shown how one could go about taking an existing application that depends on Neo4j and migrating it to Memgraph. I’ve used Cartography with its Okta intel module to keep things relatively straightforward. The steps involved include:

  1. Running Memgraph with --bolt-server-name-for-init=Neo4j/
  2. Using the same Bolt-compatible Neo4j client, providing arbitrary Neo4j username and password values
  3. Fixing any incompatible Neo4j client code (in this case, consume()), if applicable
  4. Adjusting any incompatible Cypher queries

Using the Neo4j Bolt Driver for Python with Memgraph

Memgraph is a relatively young graph database. It supports the Cypher query language and the Bolt protocol – just like Neo4j – therefore it is usually possible to use Neo4j client libraries (called “drivers”) with Memgraph. In fact, according to the Memgraph Drivers documentation, using the Neo4j drivers is the recommended way to communicate with Memgraph from several languages like Go, Java and C#.

Memgraph Python Client Libraries

For Python, there are actually a number of options to choose from:

  • pymgclient: I discovered this recently and haven’t used it, but it seems to be a lower-level client that works well.
  • gqlalchemy: Memgraph’s recommended client for Python, which uses pymgclient underneath. Perhaps named after Python ORM sqlalchemy, it provides three different ways to query Memgraph, none of which I found to be very practical:
    • Basic execute() and execute_and_fetch(): this seems simple enough, but I haven’t found any way to pass parameters to queries, making it useless for my use case.
    • OGM: This is a graph equivalent of an ORM. It’s no secret that ORMs are one of the things I avoid like the plague – I’ve already written some of my thoughts on the subject in “ADO .NET Part 1: Introduction“, and time-permitting it will also be the subject of a future article. In a nutshell: I just want to write Cypher queries and execute them, not have to translate them to some library’s arbitrary API.
    • Query builder: A fluent query builder, similar in approach to what Elasticsearch provides for .NET. I’m not a fan for the same reasons that apply to ORMs (see previous point above).
  • Neo4j Bolt Driver for Python: This doesn’t work with Memgraph out of the box, but we’ll talk more about this.

It’s unfortunate that the Neo4j Bolt Driver for Python doesn’t work with Memgraph by default, because if you already have Python code that works with Neo4j, you could otherwise use Memgraph as a drop-in replacement for Neo4j with minimal changes (e.g. fixing incompatible Cypher).

For the rest of this article, I will be focusing on the Neo4j Bolt Driver for Python, to understand why we can’t use it with Memgraph and explain how to get around the problem.

Update 21st November 2022: TL;DR: if you need a quick solution, go to the end of this article.

Why the Neo4j Driver Fails with Memgraph

Let’s make a first attempt to use the Neo4j Bolt Driver for Python with Memgraph.

First, we need to have an instance of Memgraph running. The easiest way is to run it with Docker, e.g. as follows (assuming you’re on Linux):

sudo docker run --rm -it -p 7687:7687 -p 3000:3000 memgraph/memgraph-platform

If this works, it will start a Memgraph shell, and you can also access Memgraph Lab (Memgraph’s web user interface) by visiting http://localhost:3000/.

Memgraph shell after running it with Docker, and Memgraph Lab (UI) open in Firefox in the background.

Next, create a folder for your Python code. Run the following to install the Neo4j driver:

pip3 install neo4j

At the time of writing this article, the version of the Neo4j Python driver is 5.2.1. With earlier versions, it’s possible you might run into errors such as:

neobolt.exceptions.SecurityError: Failed to establish secure connection to ‘[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1131)’

In this case, update the driver as follows:

pip3 install neo4j --upgrade

At this point, we can steal some example code from the Neo4j Bolt Driver for Python, as follows, and put it in a file called main.py:

from neo4j import GraphDatabase

driver = GraphDatabase.driver("neo4j://localhost:7687",
                              auth=("neo4j", "password"))

def add_friend(tx, name, friend_name):
    tx.run("MERGE (a:Person {name: $name}) "
           "MERGE (a)-[:KNOWS]->(friend:Person {name: $friend_name})",
           name=name, friend_name=friend_name)

def print_friends(tx, name):
    query = ("MATCH (a:Person)-[:KNOWS]->(friend) WHERE a.name = $name "
             "RETURN friend.name ORDER BY friend.name")
    for record in tx.run(query, name=name):
        print(record["friend.name"])

with driver.session(database="neo4j") as session:
    session.execute_write(add_friend, "Arthur", "Guinevere")
    session.execute_write(add_friend, "Arthur", "Lancelot")
    session.execute_write(add_friend, "Arthur", "Merlin")
    session.execute_read(print_friends, "Arthur")

driver.close()

Once we run this with python3 main.py, we get a nice big error:

$ python3 main.py
Traceback (most recent call last):
  File "main.py", line 18, in <module>
    session.execute_write(add_friend, "Arthur", "Guinevere")
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/work/session.py", line 712, in execute_write
    return self._run_transaction(
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/work/session.py", line 484, in _run_transaction
    self._open_transaction(
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/work/session.py", line 396, in _open_transaction
    self._connect(access_mode=access_mode)
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/work/session.py", line 123, in _connect
    super()._connect(access_mode, **access_kwargs)
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/work/workspace.py", line 198, in _connect
    self._connection = self._pool.acquire(**acquire_kwargs_)
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/io/_pool.py", line 778, in acquire
    self.ensure_routing_table_is_fresh(
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/io/_pool.py", line 721, in ensure_routing_table_is_fresh
    self.update_routing_table(
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/io/_pool.py", line 648, in update_routing_table
    if self._update_routing_table_from(
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/io/_pool.py", line 596, in _update_routing_table_from
    new_routing_table = self.fetch_routing_table(
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/io/_pool.py", line 534, in fetch_routing_table
    new_routing_info = self.fetch_routing_info(
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/io/_pool.py", line 504, in fetch_routing_info
    cx = self._acquire(address, deadline, None)
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/io/_pool.py", line 221, in _acquire
    return connection_creator()
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/io/_pool.py", line 138, in connection_creator
    connection = self.opener(
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/io/_pool.py", line 441, in opener
    return Bolt.open(
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/io/_bolt.py", line 377, in open
    connection.hello()
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/io/_bolt4.py", line 450, in hello
    check_supported_server_product(self.server_info.agent)
  File "/home/daniel/.local/lib/python3.8/site-packages/neo4j/_sync/io/_common.py", line 283, in check_supported_server_product
    raise UnsupportedServerProduct(agent)
neo4j.exceptions.UnsupportedServerProduct: None

The last three lines indicate that the problem seems to be a simple validation, which we can confirm by looking up the offending function in the Neo4j driver’s source code:

def check_supported_server_product(agent):
    """ Checks that a server product is supported by the driver by
    looking at the server agent string.
    :param agent: server agent string to check for validity
    :raises UnsupportedServerProduct: if the product is not supported
    """
    if not agent.startswith("Neo4j/"):
        raise UnsupportedServerProduct(agent)

What would happen if we simply disable this check? Let’s find out.

Tweaking the Neo4j Driver to Work with Memgraph

First, let’s clone the Neo4j driver’s repo:

git clone https://github.com/neo4j/neo4j-python-driver.git

A quick search shows that there are two places where the server product check is done:

There are two equivalent check_supported_server_product() functions in _neo4j/_async/io/_common.py and _neo4j/_sync/io/_common.py.

We can disable the validation by replacing the implementation of each function with just pass:

def check_supported_server_product(agent):
    pass

Next, we build this modified version of the Neo4j driver as follows:

python3 setup.py sdist

This creates a file called neo4j-5.2.dev0.tar.gz in a dist subfolder. Take note of the path of this file.

Back in the folder with our Python test code (where we were attempting to communicate with Memgraph), install the package we just built:

$ pip3 install /home/daniel/Desktop/neo4j-python-driver/dist/neo4j-5.2.dev0.tar.gz
Processing /home/daniel/Desktop/neo4j-python-driver/dist/neo4j-5.2.dev0.tar.gz
Requirement already satisfied: pytz in /home/daniel/.local/lib/python3.8/site-packages (from neo4j==5.2.dev0) (2022.1)
Building wheels for collected packages: neo4j
  Building wheel for neo4j (setup.py) ... done
  Created wheel for neo4j: filename=neo4j-5.2.dev0-py3-none-any.whl size=244857 sha256=ec2951ea1fecf2ae1aacced4d93c66b1b5d90bc3710746ff3814b9b62a96a9af
  Stored in directory: /home/daniel/.cache/pip/wheels/0d/4c/55/2486d65ebf98105bc54a490ebd91cea4ba538268a32ffc91f0
Successfully built neo4j
Installing collected packages: neo4j
  Attempting uninstall: neo4j
    Found existing installation: neo4j 5.2.1
    Uninstalling neo4j-5.2.1:
      Successfully uninstalled neo4j-5.2.1
Successfully installed neo4j-5.2.dev0

Run the Python code again…

$ python3 main.py
Unable to retrieve routing information
Transaction failed and will be retried in 0.9256931081701124s (Unable to retrieve routing information)
Unable to retrieve routing information
Transaction failed and will be retried in 2.0779915720272504s (Unable to retrieve routing information)

We still have a failure, but this is a simple connectivity issue that is easily fixed by changing the scheme in the URI from neo4j to bolt:

driver = GraphDatabase.driver("bolt://localhost:7687",
                              auth=("neo4j", "password"),)

Running it again, we see that it now works!

$ python3 main.py
Guinevere
Lancelot
Merlin

We can also view the created data in Memgraph Lab to double-check that it really worked:

Querying the data in Memgraph Lab, we see that the example nodes were created.

Conclusion

In this article, we’ve confirmed that, at a basic level, the only thing preventing the Neo4j Bolt Driver for Python from being used with Memgraph is a simple check against a response from the server. We saw that queries could be executed once this check was disabled.

As a result, it’s not clear why Memgraph built their own Python clients instead of simply addressing this check (e.g. by sending the same response as Neo4j, or forking the driver and eliminating the check as I did). I will refrain from speculating on possible reasons, but I found this interesting to investigate and hope it saves time for other people in the same situation.

P.S.: There’s an Easier Way

This section was added on 21st November 2022.

I learned from the Memgraph team that they do provide a way to deal with the server check – it’s just not documented at the time of writing this article. Basically, all you have to do is run Memgraph using a --bolt-server-name-for-init switch that sets the missing server response. So if you run Memgraph in Docker, you’d need to run it as follows:

sudo docker run --rm -it -p 7687:7687 -p 3000:3000 -e MEMGRAPH="--bolt-server-name-for-init=Neo4j/" memgraph/memgraph-platform

If you run the example code with the bolt:// scheme using the unmodified Neo4j Bolt Driver for Python, it should work just as well.

Update 19th September 2023: as of Memgraph v2.11, --bolt-server-name-for-init has a default value compatible with the Neo4j Bolt Driver, and therefore no longer needs to be provided.