FIM, Auditing, and Configuration Compliance in Active-Active and Active-Passive Designs

2025-04-22

In a lot of environments I walk into, clustering is treated as a resilience problem - but not always as a security one. We're busy designing for uptime, failover, scale, etc... but we don’t always design for consistency of trust across the cluster (especially for on premise deployments where there's a lot more variables at play in my experience). In doing so we, we expose ourselves to risk

The Misconception

There’s a common assumption:

Active-active clusters: “everything is live, so everything must be secured”
Active-passive clusters: “only the active node matters right now”

That second assumption is where risk creeps in because, from a security standpoint, a passive node is not inactive - it’s just "waiting". Waiting to take over your production traffic, to execute the same critical workloads you were running before seamlessly, to become the new “trusted” state (a term we use a lot with Tripwire Enterprise and change monitoring.

That why if your passive infrastructure is not continuously monitored and validated (consistently and alongside your active nodes), you’ve effectively built a pre-positioned attack vector into your architecture (and one that will often expose you to the most risk when you're already potentially fighting another fire that started a failover too).

The Real Risk: Drift + Failover = Exposure

In real terms, you want to consider what happens during a failover to justify your monitoring stance:

The Scenario: Active-Passive Cluster Failover

Node A (active) is hardened, monitored, compliant
Node B (passive) is:
- Missing patches
- Has config drift
- Has unauthorised changes (or worse, persistence mechanisms)

Everything looks fine… until failover.

Now Node B becomes active.

And suddenly:

Your compliance posture changes instantly
Your monitoring baselines are invalid
Potentially compromised state becomes production reality

This is not theoretical - this is a common failure mode I see a lot of the time in the real world.

Why FIM and Auditing Must Be Cluster-Wide

1. Trust Must Be Symmetrical

Clusters are built on the assumption of equivalence between nodes.

If Node B cannot be trusted to behave identically to Node A:

Your failover is not safe
Your resilience is compromised
Your security model is inconsistent

FIM (File Integrity Monitoring) ensures:

Critical binaries, configs, and system files remain consistent
Drift is detected before failover happens

2. Drift Doesn’t Care About Node State

Configuration drift occurs regardless of:

Node role (active/passive)
Workload assignment
Traffic patterns

Common causes:

Manual changes during maintenance
Automation gaps
Patch inconsistencies
“temporary” fixes that become permanent

Without continuous monitoring:

Passive nodes silently diverge
Baselines become meaningless

3. Attackers Target the Quiet Parts of Your Estate

If I were attacking your environment, I wouldn’t go for the most monitored system.

I’d go for:

The standby node
The DR environment
The “not currently in use” infrastructure

Why?

Because:

It’s often less monitored
Changes go unnoticed
It provides a clean pivot point into production

A compromised passive node is effectively a time-delayed breach. And it's not just me - compromised backup infrastructure has become an increasingly common attack vector and we need to remember that the backup of your crown jewels is, in effect, just a set of crown jewels.

4. Compliance Doesn’t Pause for Failover

From an audit and regulatory standpoint:

There is no distinction between:

“active system”
“standby system”

If it can process production workloads, it must:

Meet configuration standards
Be continuously audited
Provide evidential integrity

Otherwise, you end up in a position where:

You pass audits during normal operation
You fail them the moment failover occurs

Active-Active Isn’t Immune Either

It’s tempting to assume active-active setups are safer because everything is “live”.

But the same issues apply:

Nodes can still drift independently
Load distribution can mask inconsistent states
Partial compromise can spread laterally

Without FIM and configuration compliance:

You lose the guarantee that nodes are equivalent
You risk inconsistent behaviour under load
You introduce non-deterministic failure modes

What “Good” Looks Like

A properly secured cluster treats every node as production-ready at all times.

Minimum baseline:

FIM applied to all nodes
- Critical OS paths
- Application configs
- Cluster configuration files
Continuous auditing
- Not scheduled “once a day” checks
- Near real-time or frequent validation
Configuration compliance enforcement
- CIS / internal hardening baselines
- Drift detection + remediation workflows
Consistent policy application
- No “reduced monitoring” for passive nodes
- No exceptions for DR environments

The Design Principle to Take Away

If a node can ever become active, it must be treated as active all the time.

That means:

Monitored
Audited
Validated
Trusted

Anything less introduces a gap between:

Resilience architecture
Security architecture

And that gap is exactly where failures - and breaches - tend to occur.

Final Thought

Clustering is about removing single points of failure.

But if your security controls only apply to part of the cluster, you’ve just moved the single point of failure somewhere less visible.

And that’s significantly worse.

Because it won’t fail loudly.

It will fail quietly - right at the moment you need it most.