This change kept coming up in #6713, #7773, and #8135.
So this PR moves the existing code without actually changing anything,
to help get rid of some of the circular dependencies.
## Review / Merge checklist
- [x] PR title and summary are descriptive.
## Summary
Add `ShutdownService` and `OnShutdown` decorator for more unified way to
shutdown different components. Use this new way in the following
components:
- HTTP(S) server
- Pruning service
- Push connection
- License
---------
Co-authored-by: कारतोफ्फेलस्क्रिप्ट™ <aditya@netroy.in>
## Summary
We accidentally made some functions `async` in
https://github.com/n8n-io/n8n/pull/7846
This PR reverts that change.
## Review / Merge checklist
- [x] PR title and summary are descriptive.
Add generic N8N_GRACEFUL_SHUTDOWN_TIMEOUT which controls how long n8n
process will wait for graceful exit before exitting forcefully. This
variables replaces the QUEUE_WORKER_TIMEOUT variable that was used for
worker process.
DEPRECATED: QUEUE_WORKER_TIMEOUT deprected
QUEUE_WORKER_TIMEOUT environment variable has been replaced with
N8N_GRACEFUL_SHUTDOWN_TIMEOUT.
If the process doesn't shutdown within a time limit, exit with error
code.
1. conceptually something timing out is an error.
2. on successful exit we close down the DB connection gracefully. On an
exit timeout we rather not do that, since it will wait for any active
connections to close and would possible block the exit.
We added ID-less workflow reporting at #8031, which has already produced
multiple reports coming from internal, enough info to tackle [this
story](https://linear.app/n8n/issue/PAY-1147). To prevent an
overwhelming number of reports from cloud, this PR removes the reporting
for now.
When setting up queue mode, it is easy to overlook that not exporting
Postgres env vars will default the worker to use sqlite, which will fail
during execution with a non-obvious error. Hence add warnings when
starting a worker with an incompatible DB type.
We're initializing the queue twice because of a [bad
merge](2c63474538).
No associated known bugs but no need to init the queue twice. We should
follow up by investigating if any pending bugs can be associated to
this.
Ensure all errors in `cli` are `ApplicationError` or children of it and
contain no variables in the message, to continue normalizing all the
errors we report to Sentry
Follow-up to: https://github.com/n8n-io/n8n/pull/7839
Ensure all errors in `cli` inherit from `ApplicationError` to continue
normalizing all the errors we report to Sentry
Follow-up to: https://github.com/n8n-io/n8n/pull/7820
Followup to #7566 | Story: https://linear.app/n8n/issue/PAY-926
### Manual workflow activation and deactivation
In a multi-main scenario, if the user manually activates or deactivates
a workflow, the process (whether leader or follower) that handles the
PATCH request and updates its internal state should send a message into
the command channel, so that all other main processes update their
internal state accordingly:
- Add to `ActiveWorkflows` if activating
- Remove from `ActiveWorkflows` if deactivating
- Remove and re-add to `ActiveWorkflows` if the update did not change
activation status.
After updating their internal state, if activating or deactivating, the
recipient main processes should push a message to all connected
frontends so that these can update their stores and so reflect the value
in the UI.
### Workflow activation errors
On failure to activate a workflow, the main instance should record the
error in Redis - main instances should always pull activation errors
from Redis in a multi-main scenario.
### Leadership change
On leadership change...
- The old leader should stop pruning and the new leader should start
pruning.
- The old leader should remove trigger- and poller-based workflows and
the new leader should add them.
Story: https://linear.app/n8n/issue/PAY-926
This PR coordinates workflow activation on instance startup and on
leadership change in multiple main scenario in the internal API. Part 3
on manual workflow activation and deactivation will be a separate PR.
### Part 1: Instance startup
In multi-main scenario, on starting an instance...
- [x] If the instance is the leader, it should add webhooks, triggers
and pollers.
- [x] If the instance is the follower, it should not add webhooks,
triggers or pollers.
- [x] Unit tests.
### Part 2: Leadership change
In multi-main scenario, if the main instance leader dies…
- [x] The new main instance leader must activate all trigger- and
poller-based workflows, excluding webhook-based workflows.
- [x] The old main instance leader must deactivate all trigger- and
poller-based workflows, excluding webhook-based workflows.
- [x] Unit tests.
To test, start two instances and check behavior on startup and
leadership change:
```
EXECUTIONS_MODE=queue N8N_LEADER_SELECTION_ENABLED=true N8N_LICENSE_TENANT_ID=... N8N_LICENSE_ACTIVATION_KEY=... N8N_LOG_LEVEL=debug npm run start
EXECUTIONS_MODE=queue N8N_LEADER_SELECTION_ENABLED=true N8N_LICENSE_TENANT_ID=... N8N_LICENSE_ACTIVATION_KEY=... N8N_LOG_LEVEL=debug N8N_PORT=5679 npm run start
```
This PR ensures `MultiMainInstancePublisher` is initialized before
checking if the instance is leader or follower. Followers skip license
init, license check, and pruning start and stop.
Due to a change, during the credentials import command, the core's
Credential object is being called through its prototype. This caused the
Credential's cipher variable to not be set, thus no cipher service being
available during import. This fix catches this edge case and provides a
fix.
https://linear.app/n8n/issue/PAY-933/set-up-leader-selection-for-multiple-main-instances
- [x] Set up new envs
- [x] Add config and license checks
- [x] Implement `MultiMainInstancePublisher`
- [x] Expand `RedisServicePubSubPublisher` to support
`MultiMainInstancePublisher`
- [x] Init `MultiMainInstancePublisher` on startup and destroy on
shutdown
- [x] Add to sandbox plans
- [x] Test manually
Note: This is only for setup - coordinating in reaction to leadership
changes will come in later PRs.
This change ensures that things like `encryptionKey` and `instanceId`
are always available directly where they are needed, instead of passing
them around throughout the code.
In a rare edge case an undefined queue could be returned - this should
not happen and now an error is thrown.
Also using the opportunity to remove a cyclic dependency from the Queue.