At https://github.com/n8n-io/n8n/pull/8213 we introduced Redis hashes
for workflow ownership and manual webhooks...
- to remove clutter from multiple related keys at the top level,
- to improve performance by preventing serializing-deserializing, and
- to guarantee atomicity during concurrent updates in multi-main setup.
Workflow activation errors can also benefit from this. Added test
coverage as well.
To test manually, create a workflow with a trigger with an invalid
credential, edit the workflow's `active` column to `true`, and restart.
The activation error should show as a red triangle on canvas and in the
workflow list.
Story: https://linear.app/n8n/issue/PAY-1188
- Implement Redis hashes on the caching service, based on Micha's work
in #7747, adapted from `node-cache-manager-ioredis-yet`. Optimize
workflow ownership lookups and manual webhook lookups with Redis hashes.
- Simplify the caching service by removing all currently unused methods
and options: `enable`, `disable`, `getCache`, `keys`, `keyValues`,
`refreshFunctionEach`, `refreshFunctionMany`, `refreshTtl`, etc.
- Remove the flag `N8N_CACHE_ENABLED`. Currently some features on
`master` are broken with caching disabled, and test webhooks now rely
entirely on caching, for multi-main setup support. We originally
introduced this flag to protect against excessive memory usage, but
total cache usage is low enough that we decided to drop this setting.
Apparently this flag was also never documented.
- Overall caching service refactor: use generics, reduce branching, add
discriminants for cache kinds for better type safety, type caching
events, improve readability, remove outdated docs, etc. Also refactor
and expand caching service tests.
Follow-up to: https://github.com/n8n-io/n8n/pull/8176
---------
Co-authored-by: Michael Auerswald <michael.auerswald@gmail.com>
In a multi-main setup, we have the following issue. The user's client
connects to main A and runs a test webhook, so main A starts listening
for a webhook call. A third-party service sends a request to the test
webhook URL. The request is forwarded by the load balancer to main B,
who is not listening for this webhook call. Therefore, the webhook call
is unhandled.
To start addressing this, cache test webhook registrations, using Redis
for queue mode and in-memory for regular mode. When the third-party
service sends a request to the test webhook URL, the request is
forwarded by the load balancer to main B, who fetches test webhooks from
the cache and, if it finds a match, executes the test webhook. This
should be transparent - test webhook behavior should remain the same as
so far.
Notes:
- Test webhook timeouts are not cached. A timeout is only relevant to
the process it was created in, so another process retrieving from Redis
a "foreign" timeout will be unable to act on it. A timeout also has
circular references, so `cache-manager-ioredis-yet` is unable to
serialize it.
- In a single-main scenario, the timeout remains in the single process
and is cleared on test webhook expiration, successful execution, and
manual cancellation - all as usual.
- In a multi-main scenario, we will need to have the process who
received the webhook call send a message to the process who created the
webhook directing this originating process to clear the timeout. This
will likely be implemented via execution lifecycle hooks and Redis
channel messages checking session ID. This implementation is out of
scope for this PR and will come next.
- Additional data in test webhooks is not cached. From what I can tell,
additional data is not needed for test webhooks to be executed.
Additional data also has circular references, so
`cache-manager-ioredis-yet` is unable to serialize it.
Follow-up to: #8155
also refactored the code to
1. stop passing around `scope === 'global'`, since this code can be used
only for changing globalRole.
2. leak less details when input validation fails.
## Review / Merge checklist
- [x] PR title and summary are descriptive
- [x] Tests included
## Summary
This PR continues refactoring webhooks code for better modularity.
Continued from #8069 to bring back `ActiveWebhooks`, but this time
actually handling active webhook calls in this class.
## Review / Merge checklist
- [x] PR title and summary are descriptive
This PR simplifies state in test webhooks so that it can be cached
easily. Caching this state will allow us to start using Redis for manual
webhooks, to support manual webhooks to work in multi-main setup.
- [x] Convert `workflowWebhooks` to a getter - no need to optimize for
deactivation
- [x] Remove array from value in `TestWebhooks.webhookUrls`
- [x] Consolidate `webhookUrls` and `registeredWebhooks`
- Move webhook, poller and trigger activation logs closer to activation
event
- Enrich response of `/debug/multi-main-setup`
- Ensure workflow updates broadcast activation state changes only if
state changed
- Fix bug on workflow activation after leadership change
- Ensure debug controller is not available in production
---------
Co-authored-by: Omar Ajoue <krynble@gmail.com>
## Summary
Add `ShutdownService` and `OnShutdown` decorator for more unified way to
shutdown different components. Use this new way in the following
components:
- HTTP(S) server
- Pruning service
- Push connection
- License
---------
Co-authored-by: कारतोफ्फेलस्क्रिप्ट™ <aditya@netroy.in>
In the case of a filesystem failure to rename the binary files as part
of the execution's cleanup process, the execution would fail to be saved
and would never finish. This catch prevents it.
## Summary
Whenever an execution is wrapping u to save the data, if it uses binary
data n8n will try to find possibly misallocated files and place them in
the right folder. If this process fails, the execution fails to finish.
Given the execution has already finished at this point, and we cannot
handle the binary data errors more gracefully, all we can do at this
point is log the message as it's a filesystem issue. The rest of the
execution saving process should remain as normal.
## Related tickets and issues
https://linear.app/n8n/issue/HELP-430
## Review / Merge checklist
- [ ] PR title and summary are descriptive. **Remember, the title
automatically goes into the changelog. Use `(no-changelog)` otherwise.**
([conventions](https://github.com/n8n-io/n8n/blob/master/.github/pull_request_title_conventions.md))
- [ ] [Docs updated](https://github.com/n8n-io/n8n-docs) or follow-up
ticket created.
- [ ] Tests included.
> A bug is not considered fixed, unless a test is added to prevent it
from happening again.
> A feature is not complete without tests.
---------
Co-authored-by: Iván Ovejero <ivov.src@gmail.com>
Remove duplication, improve readability, and expand tests for
`TestWebhooks.ts` - in anticipation for storing test webhooks in Redis.
---------
Co-authored-by: कारतोफ्फेलस्क्रिप्ट™ <aditya@netroy.in>
## Summary
We accidentally made some functions `async` in
https://github.com/n8n-io/n8n/pull/7846
This PR reverts that change.
## Review / Merge checklist
- [x] PR title and summary are descriptive.
Refactor static workflow service classes into DI-compatible classes
Context: https://n8nio.slack.com/archives/C069HS026UF/p1702466571648889
Up next:
- Inject dependencies into workflow services
- Consolidate workflow controllers into one
- Make workflow controller injectable
- Inject dependencies into workflow controller
---------
Co-authored-by: कारतोफ्फेलस्क्रिप्ट™ <aditya@netroy.in>
We're initializing the queue twice because of a [bad
merge](2c63474538).
No associated known bugs but no need to init the queue twice. We should
follow up by investigating if any pending bugs can be associated to
this.
Github issue / Community forum post (link here to close automatically):
---------
Co-authored-by: कारतोफ्फेलस्क्रिप्ट™ <aditya@netroy.in>
Co-authored-by: Giulio Andreini <andreini@netseven.it>
When performing actions such as renaming a workflow or updating its
settings, n8n errors with "Failed to save workflow version" in the
console although the saving process was successful. We are now correctly
checking whether `nodes` and `connections` exist and only then save a
snapshot.
Github issue / Community forum post (link here to close automatically):
Saving execution data is one of the slowest DB operations in the
application, and is likely behind some of the sqlite transaction
concurrency issues we've been seeing.
This not only remove the 2 separate transactions for saving
`ExecutionEntity` and `ExecutionData`, but also remove fields from
`ExecutionData.workflowData` that don't need to be saved (like `tags`,
`shared`, `statistics`, `triggerCount`, etc).
## Summary
Deduplicate, separate, organize and speed up tests for subworkflow
caller policy checks.
Follow-up to: https://github.com/n8n-io/n8n/pull/7913
```
PASS test/unit/PermissionChecker.test.ts
check()
✓ should allow if workflow has no creds (3 ms)
✓ should allow if requesting user is instance owner (83 ms)
✓ should allow if workflow creds are valid subset (151 ms)
✓ should deny if workflow creds are not valid subset (85 ms)
checkSubworkflowExecutePolicy()
no caller policy
✓ should fall back to N8N_WORKFLOW_CALLER_POLICY_DEFAULT_OPTION (1 ms)
overridden caller policy
✓ if no sharing, policy becomes workflows-from-same-owner (1 ms)
workflows-from-list caller policy
✓ should allow if caller list contains parent workflow ID
✓ should deny if caller list does not contain parent workflow ID (1 ms)
any caller policy
✓ should not throw
workflows-from-same-owner caller policy
✓ should deny if the two workflows are owned by different users (1 ms)
✓ should allow if both workflows are owned by the same user
```
...
#### How to test the change:
1. ...
## Issues fixed
Include links to Github issue or Community forum post or **Linear
ticket**:
> Important in order to close automatically and provide context to
reviewers
...
## Review / Merge checklist
- [ ] PR title and summary are descriptive. **Remember, the title
automatically goes into the changelog. Use `(no-changelog)` otherwise.**
([conventions](https://github.com/n8n-io/n8n/blob/master/.github/pull_request_title_conventions.md))
- [ ] [Docs updated](https://github.com/n8n-io/n8n-docs) or follow-up
ticket created.
- [ ] Tests included.
> A bug is not considered fixed, unless a test is added to prevent it
from happening again. A feature is not complete without tests.
>
> *(internal)* You can use Slack commands to trigger [e2e
tests](https://www.notion.so/n8n/How-to-use-Test-Instances-d65f49dfc51f441ea44367fb6f67eb0a?pvs=4#a39f9e5ba64a48b58a71d81c837e8227)
or [deploy test
instance](https://www.notion.so/n8n/How-to-use-Test-Instances-d65f49dfc51f441ea44367fb6f67eb0a?pvs=4#f6a177d32bde4b57ae2da0b8e454bfce)
or [deploy early access version on
Cloud](https://www.notion.so/n8n/Cloudbot-3dbe779836004972b7057bc989526998?pvs=4#fef2d36ab02247e1a0f65a74f6fb534e).
## Summary
Ensure `ownedBy` and `sharedWith` are present and uniform for
credentials and workflows.
Details in story: https://linear.app/n8n/issue/PAY-987
Ensure all errors in `cli` are `ApplicationError` or children of it and
contain no variables in the message, to continue normalizing all the
errors we report to Sentry
Follow-up to: https://github.com/n8n-io/n8n/pull/7839
Ensure all errors in `cli` inherit from `ApplicationError` to continue
normalizing all the errors we report to Sentry
Follow-up to: https://github.com/n8n-io/n8n/pull/7820
This PR continues the effort of moving logic inside execution lifecycle
hooks into standalone testable functions, as a stepping stone to
refactoring the hooks themselves.
https://linear.app/n8n/issue/PAY-985
```
PATCH /users/:id/role
unauthenticated user
✓ should receive 401 (349 ms)
member
✓ should fail to demote owner to member (349 ms)
✓ should fail to demote owner to admin (359 ms)
✓ should fail to demote admin to member (381 ms)
✓ should fail to promote other member to owner (353 ms)
✓ should fail to promote other member to admin (377 ms)
✓ should fail to promote self to admin (354 ms)
✓ should fail to promote self to owner (371 ms)
admin
✓ should receive 400 on invalid payload (351 ms)
✓ should receive 404 on unknown target user (351 ms)
✓ should fail to demote owner to admin (349 ms)
✓ should fail to demote owner to member (347 ms)
✓ should fail to promote member to owner (384 ms)
✓ should fail to promote admin to owner (350 ms)
✓ should be able to demote admin to member (354 ms)
✓ should be able to demote self to member (350 ms)
✓ should be able to promote member to admin (349 ms)
owner
✓ should be able to promote member to admin (349 ms)
✓ should be able to demote admin to member (349 ms)
✓ should fail to demote self to admin (348 ms)
✓ should fail to demote self to member (354 ms)
```
This PR introduces the following changes:
- New Vue stores: `collaborationStore` and `pushConnectionStore`
- Front-end push connection handling overhaul: Keep only a singe
connection open and handle it from the new store
- Add user avatars in the editor header when there are multiple users
working on the same workflow
- Sending a heartbeat event to back-end service periodically to confirm
user is still active
- Back-end overhauls (authored by @tomi):
- Implementing a cleanup procedure that removes inactive users
- Refactoring collaboration service current implementation
---------
Co-authored-by: Tomi Turtiainen <10324676+tomi@users.noreply.github.com>
Validate first and last names before saving them to database. This
should prevent security issue with un-sanitized data that ends up in
emails.
---------
Co-authored-by: कारतोफ्फेलस्क्रिप्ट™ <aditya@netroy.in>
Followup to #7566 | Story: https://linear.app/n8n/issue/PAY-926
### Manual workflow activation and deactivation
In a multi-main scenario, if the user manually activates or deactivates
a workflow, the process (whether leader or follower) that handles the
PATCH request and updates its internal state should send a message into
the command channel, so that all other main processes update their
internal state accordingly:
- Add to `ActiveWorkflows` if activating
- Remove from `ActiveWorkflows` if deactivating
- Remove and re-add to `ActiveWorkflows` if the update did not change
activation status.
After updating their internal state, if activating or deactivating, the
recipient main processes should push a message to all connected
frontends so that these can update their stores and so reflect the value
in the UI.
### Workflow activation errors
On failure to activate a workflow, the main instance should record the
error in Redis - main instances should always pull activation errors
from Redis in a multi-main scenario.
### Leadership change
On leadership change...
- The old leader should stop pruning and the new leader should start
pruning.
- The old leader should remove trigger- and poller-based workflows and
the new leader should add them.
This PR:
- Creates `InvitationController`
- Moves `POST /users` to `POST /invitations` and move related test to
`invitations.api.tests`
- Moves `POST /users/:id` to `POST /invitations/:id/accept` and move
related test to `invitations.api.tests`
- Adjusts FE to use new endpoints
- Moves all the invitation logic to the `UserService`
---------
Co-authored-by: कारतोफ्फेलस्क्रिप्ट™ <aditya@netroy.in>
Groundwork to be able to safely refactor and move the invitation logic
to the UserService.
Fixes ADO-1358
---------
Co-authored-by: कारतोफ्फेलस्क्रिप्ट™ <aditya@netroy.in>
- Enable two-way communication with web sockets
- Enable sending push messages to specific users
- Add collaboration service for managing active users for workflow
Missing things:
- State is currently kept only in memory, making this not work in
multi-master setups
- Removing a user from active users in situations where they go inactive
or we miss the "workflow closed" message
- I think a timer based solution for this would cover most edge cases.
I.e. have FE ping every X minutes, BE removes the user unless they have
received a ping in Y minutes, where Y > X
- FE changes to be added later by @MiloradFilipovic
Github issue / Community forum post (link here to close automatically):
---------
Co-authored-by: कारतोफ्फेलस्क्रिप्ट™ <aditya@netroy.in>
Story: https://linear.app/n8n/issue/PAY-926
This PR coordinates workflow activation on instance startup and on
leadership change in multiple main scenario in the internal API. Part 3
on manual workflow activation and deactivation will be a separate PR.
### Part 1: Instance startup
In multi-main scenario, on starting an instance...
- [x] If the instance is the leader, it should add webhooks, triggers
and pollers.
- [x] If the instance is the follower, it should not add webhooks,
triggers or pollers.
- [x] Unit tests.
### Part 2: Leadership change
In multi-main scenario, if the main instance leader dies…
- [x] The new main instance leader must activate all trigger- and
poller-based workflows, excluding webhook-based workflows.
- [x] The old main instance leader must deactivate all trigger- and
poller-based workflows, excluding webhook-based workflows.
- [x] Unit tests.
To test, start two instances and check behavior on startup and
leadership change:
```
EXECUTIONS_MODE=queue N8N_LEADER_SELECTION_ENABLED=true N8N_LICENSE_TENANT_ID=... N8N_LICENSE_ACTIVATION_KEY=... N8N_LOG_LEVEL=debug npm run start
EXECUTIONS_MODE=queue N8N_LEADER_SELECTION_ENABLED=true N8N_LICENSE_TENANT_ID=... N8N_LICENSE_ACTIVATION_KEY=... N8N_LOG_LEVEL=debug N8N_PORT=5679 npm run start
```
Due to a change, during the credentials import command, the core's
Credential object is being called through its prototype. This caused the
Credential's cipher variable to not be set, thus no cipher service being
available during import. This fix catches this edge case and provides a
fix.
https://linear.app/n8n/issue/PAY-933/set-up-leader-selection-for-multiple-main-instances
- [x] Set up new envs
- [x] Add config and license checks
- [x] Implement `MultiMainInstancePublisher`
- [x] Expand `RedisServicePubSubPublisher` to support
`MultiMainInstancePublisher`
- [x] Init `MultiMainInstancePublisher` on startup and destroy on
shutdown
- [x] Add to sandbox plans
- [x] Test manually
Note: This is only for setup - coordinating in reaction to leadership
changes will come in later PRs.
This PR converts the hard-deletion interval to a timeout:
- to prevent the interval from not being restored when hard deletion
throws, and
- to prevent a long-running hard deletion from leading to duplicate
deletions.
This change ensures that things like `encryptionKey` and `instanceId`
are always available directly where they are needed, instead of passing
them around throughout the code.
This is related to an issue with how Bull handles stalled jobs, see
https://github.com/OptimalBits/bull/issues/1415 for reference.
CPU intensive workflows can in certain cases take a long while to finish
up, thereby blocking the thread and causing Bull queue to think the job
has stalled, even though it finished successfully. In these cases the
error handling could then overwrite the successful execution data with
the error message.
This fixes a bug in the pruning (soft-delete). The pruning was a bit too
aggressive, as it also pruned executions that weren't in an end state
yet. This only becomes an issue if there are long-running executions
(e.g. workflow with Wait node) or the prune parameters are set to keep
only a tiny number of executions.
This PR adds a message for queue mode which triggers an external secrets
provider reload inside the workers if the configuration has changed on
the main instance.
It also refactors some of the message handler code to remove cyclic
dependencies, as well as remove unnecessary duplicate redis clients
inside services (thanks to no more cyclic deps)