feat(grpc): add exponential backoff for reconnection attempts#789
feat(grpc): add exponential backoff for reconnection attempts#789Molter73 wants to merge 9 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughCentralized duration parsing added; BackoffConfig introduced and wired into GrpcConfig and CLI; scan_interval parsing moved to Duration; gRPC reconnect loop uses exponential backoff with optional jitter; tests, changelog, and rand dependency updated. ChangesgRPC Exponential Backoff Configuration and Duration Parsing
Sequence Diagram(s)sequenceDiagram
participant GrpcRunLoop as gRPC Run Loop
participant CreateChannel as create_channel()
participant Backoff
GrpcRunLoop->>Backoff: Initialize from BackoffConfig
loop Reconnect attempts
GrpcRunLoop->>CreateChannel: create_channel()
alt Channel creation succeeds
CreateChannel-->>GrpcRunLoop: Ok(Channel)
GrpcRunLoop->>Backoff: reset()
else Channel creation fails
CreateChannel-->>GrpcRunLoop: Err
GrpcRunLoop->>Backoff: next()
Backoff-->>GrpcRunLoop: next_delay (Duration)
GrpcRunLoop->>GrpcRunLoop: sleep(next_delay)
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Assisted-by: claude-opus-4-6@default <noreply@opencode.ai>
…d env vars Add BackoffConfig to GrpcConfig with YAML (grpc.backoff.initial, grpc.backoff.max), CLI (--backoff-initial, --backoff-max), and env var (FACT_GRPC_BACKOFF_INITIAL, FACT_GRPC_BACKOFF_MAX) support. Extract yaml_to_duration_secs and parse_duration_secs helpers, reusing them for scan_interval parsing as well. Assisted-by: claude-opus-4-6@default <noreply@opencode.ai>
ddb3e74 to
fc69ffe
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #789 +/- ##
==========================================
+ Coverage 27.96% 31.95% +3.99%
==========================================
Files 21 21
Lines 2596 2766 +170
Branches 2596 2766 +170
==========================================
+ Hits 726 884 +158
- Misses 1867 1879 +12
Partials 3 3 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
vladbologa
left a comment
There was a problem hiding this comment.
Just some high level feedback, as I don't think I should be reviewing rust code. 😅
Why not use this existing crate?
And also, I checked the stackrox repo and there we use 10s initial, 5 minute max, and a 1.5 multiplier (and also jitter). So you could use these settings for consistency.
You are not the only one in the team thinking this, but I need to start getting people into it because I can't be the only one 🤣
With all the supply chain attacks going on these days I've become quite paranoid and would prefer we keep dependencies to a minimum. I've never heard of that crate, it looks interesting, but for a straightforward backoff implementation that is only used once (which is what I was going for), I didn't even consider looking for another dependency.
Sounds good, I'll modify the defaults and try to add jitter as well (probably make the multiplier and jitter configurable as well). |
Actually, as I went to change the deafults I realized for a component that does runtime security, we probably want to keep the reconnection as fast as possible, otherwise we will start to drop events. I will add jitter and change the multiplier though. |
Add configurable jitter to the exponential backoff using rand crate for uniform distribution over [0, delay]. Jitter is enabled by default and can be toggled via YAML (grpc.backoff.jitter), CLI (--backoff-jitter / --no-backoff-jitter), or env var (FACT_GRPC_NO_BACKOFF_JITTER). Assisted-by: claude-opus-4-6@default <noreply@opencode.ai>
Add multiplier field to BackoffConfig (default 1.5x). Configurable via YAML (grpc.backoff.multiplier), CLI (--backoff-multiplier), and env var (FACT_GRPC_BACKOFF_MULTIPLIER). Must be > 1.0. Drops Eq derive from config types in favor of PartialEq to support storing multiplier as f64 directly. Assisted-by: claude-opus-4-6@default <noreply@opencode.ai>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
fact/src/config/mod.rs (1)
425-433: 💤 Low valueConsider extracting multiplier validation into a shared helper.
The multiplier validation logic (finite, > 1.0) is duplicated between
parse_multiplier(line 42) and this YAML parsing. While functionally correct, extracting a shared validation function would reduce duplication and keep the validation rules in one place.♻️ Optional refactor to reduce duplication
+fn validate_multiplier(mult: f64) -> anyhow::Result<f64> { + if !mult.is_finite() || mult <= 1.0 { + bail!("multiplier must be > 1.0, got {mult}"); + } + Ok(mult) +} + fn parse_multiplier(s: &str) -> anyhow::Result<f64> { let mult = s.parse::<f64>()?; - if !mult.is_finite() || mult <= 1.0 { - bail!("multiplier must be > 1.0, got {mult}"); - } - Ok(mult) + validate_multiplier(mult) }Then simplify the YAML parsing:
"multiplier" => { - let multiplier = match v.as_f64().or_else(|| v.as_i64().map(|v| v as f64)) { - Some(m) if !m.is_finite() || m <= 1.0 => { - bail!("invalid grpc.backoff.multiplier: {v:?}") - } - None => { - bail!("invalid grpc.backoff.multiplier: {v:?}") - } - Some(m) => m, - }; + let Some(m) = v.as_f64().or_else(|| v.as_i64().map(|v| v as f64)) else { + bail!("invalid grpc.backoff.multiplier: {v:?}") + }; + let multiplier = validate_multiplier(m) + .with_context(|| format!("invalid grpc.backoff.multiplier: {v:?}"))?; backoff.multiplier = Some(multiplier); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@fact/src/config/mod.rs` around lines 425 - 433, The YAML parsing block duplicates multiplier validation logic; extract the shared validation into a helper (e.g., parse_multiplier or validate_multiplier) and call it from both places (the existing parse_multiplier caller and the YAML parsing code that currently binds `multiplier`). Implement a function that accepts a serde_json/serde_yaml Value or f64 and checks is_finite() and > 1.0, returns Result<f64, Error>, then replace the match in the YAML branch with a call to that helper to eliminate duplication and keep rules centralized.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@fact/src/config/mod.rs`:
- Around line 607-617: The env var and visibility are inverted for the backoff
jitter flags: move the env attribute from the negative flag to the positive flag
and swap visibility so the positive flag is the public env-controlled option.
Concretely, on the backoff_jitter field (symbol backoff_jitter) add env =
"FACT_GRPC_BACKOFF_JITTER" and remove hide = true (make it exposed), and on
no_backoff_jitter remove the env attribute and add hide = true (hide the
negative flag); keep the overrides_with relationships (backoff_jitter
overrides_with = "no_backoff_jitter" and vice versa) so behavior stays
consistent but follows the established positive-flag env pattern.
---
Nitpick comments:
In `@fact/src/config/mod.rs`:
- Around line 425-433: The YAML parsing block duplicates multiplier validation
logic; extract the shared validation into a helper (e.g., parse_multiplier or
validate_multiplier) and call it from both places (the existing parse_multiplier
caller and the YAML parsing code that currently binds `multiplier`). Implement a
function that accepts a serde_json/serde_yaml Value or f64 and checks
is_finite() and > 1.0, returns Result<f64, Error>, then replace the match in the
YAML branch with a call to that helper to eliminate duplication and keep rules
centralized.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Enterprise
Run ID: 5206ea5e-dd20-4dac-8c98-2e989a4b7c5b
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (5)
Cargo.tomlfact/Cargo.tomlfact/src/config/mod.rsfact/src/config/tests.rsfact/src/output/grpc.rs
✅ Files skipped from review due to trivial changes (1)
- fact/Cargo.toml
🚧 Files skipped from review as they are similar to previous changes (2)
- fact/src/output/grpc.rs
- fact/src/config/tests.rs
vladbologa
left a comment
There was a problem hiding this comment.
LGTM! You could make the PR a bit shorter by not exposing the jitter and multiplier configuration (e.g. stackrox doesn't expose them).
| } | ||
|
|
||
| fn next(&mut self) -> Duration { | ||
| let delay = self.current; |
There was a problem hiding this comment.
If self.current == initial and initial > max, the first delay will be > max.
There was a problem hiding this comment.
I'm not 100% sure how I would handle this. To me it is a configuration on error on the user side, the alternatives for handling it would be:
- Reject the configuration and rollback to defaults (or use the previous configuration values if available).
- Arbitrarily pick a different initial or max value that would work correctly.
- Completely crash the program (there is always a nuclear option, even if it gets immediately ruled out).
- Log a warning message and proceed with the provided configuration.
Of those options, I think the last one makes the most sense, it doesn't force a different behavior, it doesn't interfere with regular operations and it notifies the user of a problem on the configuration. WDYT?
There was a problem hiding this comment.
I'm fine with the last option.
There was a problem hiding this comment.
sorry to resurrect a resolved thread, but can't we just set initial to max when we detect this misconfiguration? (in addition to the existing log message)
There was a problem hiding this comment.
We can, but two lines down will simply make it so fact uses max always when initial >= max, so there's not much point in changing the configuration value I think.
There was a problem hiding this comment.
Only for the second delay. The first will still be >max.
Fair, but I like having as much knobs as I can in case something goes wrong. |
| /// Initial backoff delay in seconds for gRPC reconnection | ||
| /// | ||
| /// Default value is 1 second | ||
| #[arg(long, env = "FACT_GRPC_BACKOFF_INITIAL", value_parser = parse_positive_duration_secs)] |
There was a problem hiding this comment.
[nit] I appreciate that this is a bit of a mouthful but FACT_GRPC_BACKOFF_INITIAL_DURATION might make the value clearer. Same for FACT_GRPC_BACKOFF_MAX_DURATION
or s/DURATION/DELAY/ ?
There was a problem hiding this comment.
I think duration is clearer, I've renamed the env vars.
| } | ||
|
|
||
| fn next(&mut self) -> Duration { | ||
| let delay = self.current; |
There was a problem hiding this comment.
sorry to resurrect a resolved thread, but can't we just set initial to max when we detect this misconfiguration? (in addition to the existing log message)
Description
If connection to a gRPC server fails, fact was into a loop trying to reconnect every second which is pretty noisy and not really needed. With this change we add an exponential backoff algorithm that will try to reconnect with some leeway.
In the interest of making it obvious
factis failing to connect on k8s, we probably want to crash the application if we reach the maximum backoff, but there is currently no easy way to do that, so I'll leave it to a follow up.Checklist
Automated testing
If any of these don't apply, please comment below.
Testing Performed
Without this change, reconnection attempts every second
With this change, exponential attempts
Summary by CodeRabbit
New Features
Improvements
scan_intervaland backoff durations now parse directly intoDurationwith stricter validation and clearer error handling.Tests
scan_intervaland gRPC backoff.Documentation