Troubleshooting

This page presents some common GraphAI errors, both errors caused by configuration/server issues and errors caused by lesser known bugs (especially Celery bugs), along with their solutions. All solutions assume that the server is running on Ubuntu, but usually you can swap out "systemctl" for "brew services" if running on MacOS.

RabbitMQ and Redis

GraphAI uses Celery with Redis as its results backend and RabbitMQ as its message broker. If either of these is not running or badly configured, the server will refuse to start.

`kombu` connection refused error

Error: kombu.exceptions.OperationalError: [Errno 61] Connection refused
Cause: Most likely, the RabbitMQ service is not running.
Solution: systemctl —user start rabbitmq-server

"Not able to persist to disk" error

Error:

/data/venvs/test_venv/lib/python3.11/site-packages/celery/app/trace.py:686: RuntimeWarning: Exception raised outside body: ResponseError('Command # 1 (SETEX celery-task-meta-2e5b1842-1f49-449a-bc14-1903dd5ad55f 10800 \x04\x00\x00\x00\x00\x00\x00\x00}(\x06status\x07STARTED\x06result}(\x03pidJ&\x00\x08hostname\x17workerHigh@/home/ubuntuu\ttracebackN\x08children]\tdate_doneN\x04name\x0ctext_10.init\x04args)\x06kwargs}\x06workerh\x07\x07retriesK\x00\x05queue\x07text_10\x07task_id$2e5b1842-1f49-449a-bc14-1903dd5ad55fu.) of pipeline caused error: MISCONF Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands that may modify the data set are disabled, because this instance is configured to report errors during writes if RDB snapshotting fails (stop-writes-on-bgsave-error option). Please check the Redis logs for details about the RDB error.')

Cause: You probably updated redis-server without restarting its service.
Solution: systemctl —user restart redis-server

Celery errors (especially after adding new features)

Too many or too few arguments

You probably have two tasks with the same name by mistake. Make sure no two tasks ever have the same name, or Celery will misdetect the signature (as it is basically overwritten) and then tell you that you are not providing enough arguments (or that you are providing too many) even though you are providing the right number.

It is also possible, if the error is in the form of expected 2 arguments, found 3, that you ARE actually providing too many arguments. Except for the first task in a chain of tasks, the rest have their first argument coming from the result of the previous task. Ignoring this fact will result in the aforementioned error.

Celery chain getting lost

If you have multiple map-reduce steps in your chain, you may run into a very obscure Celery bug. Let's say you have tasks A (preprocessing), B and D (parallel tasks, i.e. group), and C and E (callback, i.e. chord). And let's say that you have the following chain that you want to orchestrate:

tasks = chain([A, group(B(i) for i in range(8)), C, group(D(i) for i in range(8)), E])

This chain will most likely never give you a final result, and when you check for it in Flower (the celery interface, which is found on http://localhost:5555 by default), you will see that the chain vanished after C. This is because each group needs to be preceded by a task that is NOT a chord. The solution? Create a dummy task F that simply gets one input and returns the same, and put it between C and D:

tasks = chain([A, group(B(i) for i in range(8)), C, F, group(D(i) for i in range(8)), E])

Your problem should be solved.

Celery worker going offline inexplicably

Sometimes, a Celery worker will simply be killed, especially if you're running the API on a server with low resources. In this case, in the Workers tab on Flower, you will see that the worker is offline, and if you do htop on the server, you will not find the workerLow process. In such a case, you need to restart the API, although with the default daemons (which we've provided in this Wiki) it'll do so automatically.

However, it is possible for a worker to be severed from the rest while still being alive. Since v3.1, Celery has had a "gossip" feature, which has workers sending each other heartbeats so that they can be synchronized, but sometimes a worker will miss a heartbeat and all the others will believe that it is dead. In order to fix this problem, we have added a --without-gossip option to every Celery worker in the deploy_celery.sh script. This option disables the gossiping and makes sure your worker won't be disconnected while alive, which would require a complete restart of the API to fix. Be sure to add this option to every one of your Celery daemon files if you're running the API using systemd.

Note: If you're running GraphAI on MacOS, you may need to remove the --without-gossip flag. The issue that happens on Ubuntu without the flag happens on MacOS with the flag.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Troubleshooting

RabbitMQ and Redis

`kombu` connection refused error

"Not able to persist to disk" error

Celery errors (especially after adding new features)

Too many or too few arguments

Celery chain getting lost

Celery worker going offline inexplicably

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Troubleshooting

RabbitMQ and Redis

kombu connection refused error

"Not able to persist to disk" error

Celery errors (especially after adding new features)

Too many or too few arguments

Celery chain getting lost

Celery worker going offline inexplicably

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

`kombu` connection refused error