Story: https://linear.app/n8n/issue/PAY-926
This PR coordinates workflow activation on instance startup and on
leadership change in multiple main scenario in the internal API. Part 3
on manual workflow activation and deactivation will be a separate PR.
### Part 1: Instance startup
In multi-main scenario, on starting an instance...
- [x] If the instance is the leader, it should add webhooks, triggers
and pollers.
- [x] If the instance is the follower, it should not add webhooks,
triggers or pollers.
- [x] Unit tests.
### Part 2: Leadership change
In multi-main scenario, if the main instance leader dies…
- [x] The new main instance leader must activate all trigger- and
poller-based workflows, excluding webhook-based workflows.
- [x] The old main instance leader must deactivate all trigger- and
poller-based workflows, excluding webhook-based workflows.
- [x] Unit tests.
To test, start two instances and check behavior on startup and
leadership change:
```
EXECUTIONS_MODE=queue N8N_LEADER_SELECTION_ENABLED=true N8N_LICENSE_TENANT_ID=... N8N_LICENSE_ACTIVATION_KEY=... N8N_LOG_LEVEL=debug npm run start
EXECUTIONS_MODE=queue N8N_LEADER_SELECTION_ENABLED=true N8N_LICENSE_TENANT_ID=... N8N_LICENSE_ACTIVATION_KEY=... N8N_LOG_LEVEL=debug N8N_PORT=5679 npm run start
```
This PR ensures `MultiMainInstancePublisher` is initialized before
checking if the instance is leader or follower. Followers skip license
init, license check, and pruning start and stop.
Github issue / Community forum post (link here to close automatically):
---------
Co-authored-by: कारतोफ्फेलस्क्रिप्ट™ <netroy@users.noreply.github.com>
To help debugging possible issues in startup and migrations, log the
executed migrations with log level 'info', instead of 'debug'.
Github issue / Community forum post (link here to close automatically):
Due to a change, during the credentials import command, the core's
Credential object is being called through its prototype. This caused the
Credential's cipher variable to not be set, thus no cipher service being
available during import. This fix catches this edge case and provides a
fix.
https://linear.app/n8n/issue/PAY-933/set-up-leader-selection-for-multiple-main-instances
- [x] Set up new envs
- [x] Add config and license checks
- [x] Implement `MultiMainInstancePublisher`
- [x] Expand `RedisServicePubSubPublisher` to support
`MultiMainInstancePublisher`
- [x] Init `MultiMainInstancePublisher` on startup and destroy on
shutdown
- [x] Add to sandbox plans
- [x] Test manually
Note: This is only for setup - coordinating in reaction to leadership
changes will come in later PRs.
Github issue / Community forum post (link here to close automatically):
---------
Signed-off-by: Oleg Ivaniv <me@olegivaniv.com>
Co-authored-by: कारतोफ्फेलस्क्रिप्ट™ <aditya@netroy.in>
This PR allows users to configure the settings to Bull, possibly
reducing the errors with `maxStalledCount` and other issues, that
usually happen either when a worker crashes or when the event loop is
super busy. Increasing the lease time and the `maxStalledCount` settings
might improve UX.
Github issue / Community forum post (link here to close automatically):
This PR converts the hard-deletion interval to a timeout:
- to prevent the interval from not being restored when hard deletion
throws, and
- to prevent a long-running hard deletion from leading to duplicate
deletions.
Since we do not store which executions produced binary data, for pruning
on S3 we need to query for binary data items for each execution in order
to delete them. To minimize requests to S3, allow the user to skip
pruning requests when setting TTL at bucket level.
This change ensures that things like `encryptionKey` and `instanceId`
are always available directly where they are needed, instead of passing
them around throughout the code.
This is related to an issue with how Bull handles stalled jobs, see
https://github.com/OptimalBits/bull/issues/1415 for reference.
CPU intensive workflows can in certain cases take a long while to finish
up, thereby blocking the thread and causing Bull queue to think the job
has stalled, even though it finished successfully. In these cases the
error handling could then overwrite the successful execution data with
the error message.
In a rare edge case an undefined queue could be returned - this should
not happen and now an error is thrown.
Also using the opportunity to remove a cyclic dependency from the Queue.