Up to 5 variants per branch
Configure 2-5 arms with custom traffic weights. Each arm is its own subgraph of nodes — so variants can differ in template, timing, AI prompt, or entire downstream paths. Not just "different text" but "different journey".
A/B Testing turns any branch in a flow into a statistically rigorous experiment. Split traffic across two, three or five variants — different templates, different timing, different AI prompts, different paths entirely — and let the engine declare the winner. No spreadsheet, no p-value debates, no surprise regressions.
Most teams who claim they A/B test do not actually A/B test. They send template A to one segment on Monday and template B to a different segment on Friday, declare A "won" because more people replied, and ship A everywhere. They did not control for segment composition, day-of-week effects, sample size, or random variance. They probably also did not pre-register their success metric, so they retro-fitted whichever number made A look good.
The other failure mode is the team that does run a proper split, but on a sample so small that the result is statistically meaningless. They see template A at 7.2% reply rate and template B at 6.8%, declare a 6% lift, and roll out — when the 95% confidence interval on that lift is something like [-15%, +28%]. The experiment told them nothing; they just made it feel scientific.
A/B Testing in SabFlow does the math for you. You set a primary metric (reply rate, conversion, click-through, time-to-resolution), the engine splits traffic uniformly, tracks the metric per variant, computes confidence intervals using bootstrapped resampling, and tells you when you have a significant result. If you ship a variant before significance, the system warns you. If you let it run, it auto-promotes the winner when the threshold is hit.
An experiment in SabFlow is declared at a Branch node. You assign weights to each outbound arm — 50/50 for a two-way split, 33/33/34 for three-way, or weighted (90/10) when you want to test a risky variant on a small slice. The engine deterministically hashes the contact ID to an arm so the same contact always sees the same variant across re-runs and replays. This consistency matters: customers who get different messages on different days because of randomness leak the experiment.
Every experiment has a primary metric and optional guardrail metrics. Primary might be "WhatsApp message reply within 24 hours" or "Shopify order placed within 7 days" — anything observable in the platform. Guardrails are metrics you do not want to regress: "unsubscribes per 1,000 sent", "agent CSAT", "complaint rate". The engine reports both. A variant that wins the primary but spikes a guardrail is flagged, not auto-promoted.
Statistical analysis is built in. We compute a two-sided test (Welch's t for continuous metrics, Bayesian beta-binomial for conversion rates) with configurable confidence (90%, 95%, 99%). The dashboard shows the observed lift, the confidence interval, and a power estimate — how many more contacts you need to detect a target lift if the current sample is too small. No more p-value cargo culting.
For traffic that needs to adapt continuously rather than wait for significance, switch to multi-armed bandit mode. Thompson sampling allocates more traffic to the better-performing arm in real time while still exploring the laggards. This is the right mode for high-volume, low-stakes experiments — broadcast subject lines, button copy, AI prompt variants — where you want to start capturing the lift the moment it emerges.
Configure 2-5 arms with custom traffic weights. Each arm is its own subgraph of nodes — so variants can differ in template, timing, AI prompt, or entire downstream paths. Not just "different text" but "different journey".
Contacts are bucketed via a hash of (contact_id, experiment_id) so the same person always sees the same variant. Re-runs, replays and resumes all respect the original assignment. No bucket leakage when you debug or retry.
Pick one primary metric the experiment optimises for, plus any number of guardrails it must not regress. Reply rate as primary, unsubscribe rate as guardrail. Auto-promotion blocks if guardrails degrade beyond the threshold you set.
Welch's t-test for means, Bayesian beta-binomial for conversion rates, 90/95/99% confidence configurable. Confidence intervals reported alongside the point estimate. "Variant B lifted reply rate by 14% [95% CI: 6%, 22%]" — not just "B looks better".
Switch from fixed-split to Thompson sampling. Traffic shifts toward the winning arm in real time while still exploring others. Right for broadcast subjects, button copy, AI prompts — anything where lost traffic is the cost of waiting.
Set a confidence threshold and a minimum sample size. When the engine hits both, it promotes the winning variant to 100% traffic and freezes the others. You get a Slack alert with the final report. No more "we forgot the experiment was running".
Tenant-wide guardrails catch regressions across all running experiments. If overall opt-out rate spikes 30% in a 24-hour window, all auto-promotions pause until ops investigate. Protects the brand from collective slow drift.
Three product-launch templates: emoji-heavy, plain text, story-led. 33/33/34 split across 50,000 contacts. Primary metric: click-to-product-page. Guardrail: unsubscribe rate. Story-led wins by 22% with significance at day 3. Auto-promoted, rolled to remaining list.
Four arms: 9am, 12pm, 6pm, 9pm in contact local time. Primary metric: reply within 1 hour. Bandit mode shifts traffic to 6pm and 9pm within a week. Reveals that the standard "send at 10am" advice was costing the brand 18% in engagement.
Two AI Generate nodes with different system prompts — one concise, one consultative. Primary metric: contact resolves issue without escalation. 50/50 split, 2,000 conversations. Consultative wins on resolution but loses on response latency; ops promotes consultative with a latency optimisation.
Three arms: no discount, 5% off, 10% off. Primary: cart recovered. Guardrail: revenue per recovery. 5% recovers 16% of carts at full margin; 10% recovers 19% but cuts net by 8%. Ops picks 5% based on the joint view, not the headline.
Two arms: 3-message nurture vs 7-message nurture over two weeks. Primary: enrolment. Guardrail: opt-out rate. 7-message wins enrolment by 11% with opt-out unchanged. The team had assumed more messages would hurt — the data overruled the assumption.
A/B Testing is included on every SabNode workspace. No separate billing, no extra setup — flip it on from your workspace settings.
Insert a Branch node, switch it to A/B mode, declare 2-5 variants with weights. Each variant is its own subgraph you build out independently.
Select from built-in metrics (reply, click, order, custom event) or define a custom one via a downstream Goal node. Add guardrails the experiment must not regress.
Configure confidence (95% default), minimum sample size, max runtime, and auto-promotion behavior. Or run in bandit mode for continuous traffic shifting.
Publish the flow. The experiment dashboard updates in real time — variant share, observed metric, confidence interval, power. Drill into any contact to see their assigned arm.
When the engine hits significance, auto-promote or review and promote manually. Archive the experiment with the full report stored for audit. Spin up the next test.
No credit card. No sales call required. Spin up a workspace, plug in a number, and your team is live in under an hour.