SabFlow · Experimentation

Stop guessing which template wins

A/B Testing turns any branch in a flow into a statistically rigorous experiment. Split traffic across two, three or five variants — different templates, different timing, different AI prompts, different paths entirely — and let the engine declare the winner. No spreadsheet, no p-value debates, no surprise regressions.

  • Up to 5 variants per experiment
  • Built-in significance and confidence
  • Auto-promote winner on threshold
  • Multi-armed bandit mode for traffic shifting
The problem

The "we tried two versions" experiment that proved nothing

Most teams who claim they A/B test do not actually A/B test. They send template A to one segment on Monday and template B to a different segment on Friday, declare A "won" because more people replied, and ship A everywhere. They did not control for segment composition, day-of-week effects, sample size, or random variance. They probably also did not pre-register their success metric, so they retro-fitted whichever number made A look good.

The other failure mode is the team that does run a proper split, but on a sample so small that the result is statistically meaningless. They see template A at 7.2% reply rate and template B at 6.8%, declare a 6% lift, and roll out — when the 95% confidence interval on that lift is something like [-15%, +28%]. The experiment told them nothing; they just made it feel scientific.

A/B Testing in SabFlow does the math for you. You set a primary metric (reply rate, conversion, click-through, time-to-resolution), the engine splits traffic uniformly, tracks the metric per variant, computes confidence intervals using bootstrapped resampling, and tells you when you have a significant result. If you ship a variant before significance, the system warns you. If you let it run, it auto-promotes the winner when the threshold is hit.

What it is

A/B Testing, in depth.

An experiment in SabFlow is declared at a Branch node. You assign weights to each outbound arm — 50/50 for a two-way split, 33/33/34 for three-way, or weighted (90/10) when you want to test a risky variant on a small slice. The engine deterministically hashes the contact ID to an arm so the same contact always sees the same variant across re-runs and replays. This consistency matters: customers who get different messages on different days because of randomness leak the experiment.

Every experiment has a primary metric and optional guardrail metrics. Primary might be "WhatsApp message reply within 24 hours" or "Shopify order placed within 7 days" — anything observable in the platform. Guardrails are metrics you do not want to regress: "unsubscribes per 1,000 sent", "agent CSAT", "complaint rate". The engine reports both. A variant that wins the primary but spikes a guardrail is flagged, not auto-promoted.

Statistical analysis is built in. We compute a two-sided test (Welch's t for continuous metrics, Bayesian beta-binomial for conversion rates) with configurable confidence (90%, 95%, 99%). The dashboard shows the observed lift, the confidence interval, and a power estimate — how many more contacts you need to detect a target lift if the current sample is too small. No more p-value cargo culting.

For traffic that needs to adapt continuously rather than wait for significance, switch to multi-armed bandit mode. Thompson sampling allocates more traffic to the better-performing arm in real time while still exploring the laggards. This is the right mode for high-volume, low-stakes experiments — broadcast subject lines, button copy, AI prompt variants — where you want to start capturing the lift the moment it emerges.

Capabilities

Everything you get with A/B Testing.

7 capabilities
01

Up to 5 variants per branch

Configure 2-5 arms with custom traffic weights. Each arm is its own subgraph of nodes — so variants can differ in template, timing, AI prompt, or entire downstream paths. Not just "different text" but "different journey".

02

Deterministic contact bucketing

Contacts are bucketed via a hash of (contact_id, experiment_id) so the same person always sees the same variant. Re-runs, replays and resumes all respect the original assignment. No bucket leakage when you debug or retry.

03

Primary + guardrail metrics

Pick one primary metric the experiment optimises for, plus any number of guardrails it must not regress. Reply rate as primary, unsubscribe rate as guardrail. Auto-promotion blocks if guardrails degrade beyond the threshold you set.

04

Statistical significance built in

Welch's t-test for means, Bayesian beta-binomial for conversion rates, 90/95/99% confidence configurable. Confidence intervals reported alongside the point estimate. "Variant B lifted reply rate by 14% [95% CI: 6%, 22%]" — not just "B looks better".

05

Multi-armed bandit mode

Switch from fixed-split to Thompson sampling. Traffic shifts toward the winning arm in real time while still exploring others. Right for broadcast subjects, button copy, AI prompts — anything where lost traffic is the cost of waiting.

06

Auto-promote winner

Set a confidence threshold and a minimum sample size. When the engine hits both, it promotes the winning variant to 100% traffic and freezes the others. You get a Slack alert with the final report. No more "we forgot the experiment was running".

07

Cross-experiment guardrails

Tenant-wide guardrails catch regressions across all running experiments. If overall opt-out rate spikes 30% in a 24-hour window, all auto-promotions pause until ops investigate. Protects the brand from collective slow drift.

Use cases

Built for the way teams actually work.

D2C
Case 01

Template copy split for D2C launch

Three product-launch templates: emoji-heavy, plain text, story-led. 33/33/34 split across 50,000 contacts. Primary metric: click-to-product-page. Guardrail: unsubscribe rate. Story-led wins by 22% with significance at day 3. Auto-promoted, rolled to remaining list.

E-commerce
Case 02

Send-time optimisation

Four arms: 9am, 12pm, 6pm, 9pm in contact local time. Primary metric: reply within 1 hour. Bandit mode shifts traffic to 6pm and 9pm within a week. Reveals that the standard "send at 10am" advice was costing the brand 18% in engagement.

SaaS
Case 03

AI prompt experimentation

Two AI Generate nodes with different system prompts — one concise, one consultative. Primary metric: contact resolves issue without escalation. 50/50 split, 2,000 conversations. Consultative wins on resolution but loses on response latency; ops promotes consultative with a latency optimisation.

E-commerce
Case 04

Discount level for cart recovery

Three arms: no discount, 5% off, 10% off. Primary: cart recovered. Guardrail: revenue per recovery. 5% recovers 16% of carts at full margin; 10% recovers 19% but cuts net by 8%. Ops picks 5% based on the joint view, not the headline.

EdTech
Case 05

Education program nurture cadence

Two arms: 3-message nurture vs 7-message nurture over two weeks. Primary: enrolment. Guardrail: opt-out rate. 7-message wins enrolment by 11% with opt-out unchanged. The team had assumed more messages would hurt — the data overruled the assumption.

How it works

From signup to first send in minutes.

A/B Testing is included on every SabNode workspace. No separate billing, no extra setup — flip it on from your workspace settings.

  1. 01

    Add a Branch with variants

    Insert a Branch node, switch it to A/B mode, declare 2-5 variants with weights. Each variant is its own subgraph you build out independently.

  2. 02

    Pick a primary metric

    Select from built-in metrics (reply, click, order, custom event) or define a custom one via a downstream Goal node. Add guardrails the experiment must not regress.

  3. 03

    Set confidence and stopping rules

    Configure confidence (95% default), minimum sample size, max runtime, and auto-promotion behavior. Or run in bandit mode for continuous traffic shifting.

  4. 04

    Ship and monitor

    Publish the flow. The experiment dashboard updates in real time — variant share, observed metric, confidence interval, power. Drill into any contact to see their assigned arm.

  5. 05

    Promote or iterate

    When the engine hits significance, auto-promote or review and promote manually. Archive the experiment with the full report stored for audit. Spin up the next test.

Plays well with

Works with the tools you already ship on.

ShopifyStripeRazorpayHubSpotSlackGoogle SheetsMeta WhatsApp Cloud APIMixpanel
Frequently asked

Questions about A/B Testing.

Can't find what you're looking for? Talk to our team.

How do I know my sample size is large enough?
The experiment dashboard shows current statistical power — the probability of detecting a target lift (default 10%) given your current sample. If power is below 80%, the dashboard tells you how many more contacts you need. For typical D2C broadcast lifts (5-20%), you need between 2,000 and 20,000 contacts per arm. The engine pre-flights this when you launch and warns if your audience is too small.
Can I test more than two variants at once?
Yes, up to five variants per Branch. Beyond five, the multiple-comparison correction starts eating statistical power — you would need much larger samples to reach significance. We recommend starting with two for headline tests and three when you have a clear hypothesis about the spectrum (e.g. low/medium/high discount). For high-dimensional exploration, bandit mode is better than fixed splits.
What happens if a contact triggers the flow twice?
Deterministic bucketing means the contact is hashed to the same arm both times. This prevents accidental crossover where a customer sees variant A on Monday and variant B on Friday. If you intentionally want fresh randomisation per execution (rare, but possible for true repeat-purchase tests), there is a per-run randomisation toggle on the Branch node.
Can I A/B test AI prompts in AI Studio?
Yes. Each AI Generate node can declare two prompt variants, and the engine routes traffic with the same A/B mechanics. Primary metric is usually downstream — did the conversation resolve, did the customer convert, did CSAT stay above threshold. Combined with the eval harness in AI Studio, this lets you ship prompt changes confidently without anecdotal "I think the new one is better".
Does the experiment account for novelty effects?
Yes via the minimum runtime parameter. Even if statistical significance is reached on day 1, the engine will not auto-promote until the configured minimum runtime (default 7 days, configurable per experiment) has elapsed. This guards against novelty effects where a new variant gets a temporary attention bump that does not sustain. Bandit mode has a built-in exploration floor to keep all arms alive long enough to detect this.
How is multiple-testing corrected when I have guardrails?
Each guardrail metric gets its own significance test with a Bonferroni-adjusted alpha based on the number of guardrails declared. With three guardrails at 95% overall confidence, each individual test runs at 98.3%. This makes it harder to trigger a false-alarm guardrail regression. You can override the correction if you have domain-specific reasons, but the default is conservative on purpose.
Can I export results for stakeholders?
Yes. Every experiment generates a one-page PDF report with the variants, the metric definitions, observed lifts, confidence intervals, sample sizes, and a recommendation. CSV export gives you the raw assignment and outcome data for re-analysis in R, Python or your preferred tool. Both are accessible via the API for embedding in internal dashboards.
SabFlow · Experimentation

Ship a/b testing into production this week.

No credit card. No sales call required. Spin up a workspace, plug in a number, and your team is live in under an hour.