SabNode
    ProductsFeaturesEnterpriseCustomersPartnersResourcesPricing
    AllConversationsAutomationCustomer DataGrowthAnalyticsCommerceDeveloperExplore products
    1. Home
    2. Features
    3. Automation
    4. A/B Testing
    SabFlow · Experimentation

    Stop guessing which template wins

    A/B Testing turns any branch in a flow into a statistically rigorous experiment. Split traffic across two, three or five variants — different templates, different timing, different AI prompts, different paths entirely — and let the engine declare the winner. No spreadsheet, no p-value debates, no surprise regressions.

    • Up to 5 variants per experiment
    • Built-in significance and confidence
    • Auto-promote winner on threshold
    • Multi-armed bandit mode for traffic shifting
    Feature signature
    SabNode . SabFlow
    A/B Testing

    Split traffic across flow variants. Pick the winner automatically.

    Live
    5
    max variants per experiment
    95%
    default confidence threshold
    24×
    faster to significance vs manual splits
    The problem

    The "we tried two versions" experiment that proved nothing

    Most teams who claim they A/B test do not actually A/B test. They send template A to one segment on Monday and template B to a different segment on Friday, declare A "won" because more people replied, and ship A everywhere. They did not control for segment composition, day-of-week effects, sample size, or random variance. They probably also did not pre-register their success metric, so they retro-fitted whichever number made A look good.

    The other failure mode is the team that does run a proper split, but on a sample so small that the result is statistically meaningless. They see template A at 7.2% reply rate and template B at 6.8%, declare a 6% lift, and roll out — when the 95% confidence interval on that lift is something like [-15%, +28%]. The experiment told them nothing; they just made it feel scientific.

    A/B Testing in SabFlow does the math for you. You set a primary metric (reply rate, conversion, click-through, time-to-resolution), the engine splits traffic uniformly, tracks the metric per variant, computes confidence intervals using bootstrapped resampling, and tells you when you have a significant result. If you ship a variant before significance, the system warns you. If you let it run, it auto-promotes the winner when the threshold is hit.

    What it is

    A/B Testing, in depth.

    An experiment in SabFlow is declared at a Branch node. You assign weights to each outbound arm — 50/50 for a two-way split, 33/33/34 for three-way, or weighted (90/10) when you want to test a risky variant on a small slice. The engine deterministically hashes the contact ID to an arm so the same contact always sees the same variant across re-runs and replays. This consistency matters: customers who get different messages on different days because of randomness leak the experiment.

    Every experiment has a primary metric and optional guardrail metrics. Primary might be "WhatsApp message reply within 24 hours" or "Shopify order placed within 7 days" — anything observable in the platform. Guardrails are metrics you do not want to regress: "unsubscribes per 1,000 sent", "agent CSAT", "complaint rate". The engine reports both. A variant that wins the primary but spikes a guardrail is flagged, not auto-promoted.

    Statistical analysis is built in. We compute a two-sided test (Welch's t for continuous metrics, Bayesian beta-binomial for conversion rates) with configurable confidence (90%, 95%, 99%). The dashboard shows the observed lift, the confidence interval, and a power estimate — how many more contacts you need to detect a target lift if the current sample is too small. No more p-value cargo culting.

    For traffic that needs to adapt continuously rather than wait for significance, switch to multi-armed bandit mode. Thompson sampling allocates more traffic to the better-performing arm in real time while still exploring the laggards. This is the right mode for high-volume, low-stakes experiments — broadcast subject lines, button copy, AI prompt variants — where you want to start capturing the lift the moment it emerges.

    Capabilities

    Everything you get with A/B Testing.

    7 capabilities
    01

    Up to 5 variants per branch

    Configure 2-5 arms with custom traffic weights. Each arm is its own subgraph of nodes — so variants can differ in template, timing, AI prompt, or entire downstream paths. Not just "different text" but "different journey".

    02

    Deterministic contact bucketing

    Contacts are bucketed via a hash of (contact_id, experiment_id) so the same person always sees the same variant. Re-runs, replays and resumes all respect the original assignment. No bucket leakage when you debug or retry.

    03

    Primary + guardrail metrics

    Pick one primary metric the experiment optimises for, plus any number of guardrails it must not regress. Reply rate as primary, unsubscribe rate as guardrail. Auto-promotion blocks if guardrails degrade beyond the threshold you set.

    04

    Statistical significance built in

    Welch's t-test for means, Bayesian beta-binomial for conversion rates, 90/95/99% confidence configurable. Confidence intervals reported alongside the point estimate. "Variant B lifted reply rate by 14% [95% CI: 6%, 22%]" — not just "B looks better".

    05

    Multi-armed bandit mode

    Switch from fixed-split to Thompson sampling. Traffic shifts toward the winning arm in real time while still exploring others. Right for broadcast subjects, button copy, AI prompts — anything where lost traffic is the cost of waiting.

    06

    Auto-promote winner

    Set a confidence threshold and a minimum sample size. When the engine hits both, it promotes the winning variant to 100% traffic and freezes the others. You get a Slack alert with the final report. No more "we forgot the experiment was running".

    07

    Cross-experiment guardrails

    Tenant-wide guardrails catch regressions across all running experiments. If overall opt-out rate spikes 30% in a 24-hour window, all auto-promotions pause until ops investigate. Protects the brand from collective slow drift.

    Use cases

    Built for the way teams actually work.

    D2CCase 01

    Template copy split for D2C launch

    Three product-launch templates: emoji-heavy, plain text, story-led. 33/33/34 split across 50,000 contacts. Primary metric: click-to-product-page. Guardrail: unsubscribe rate. Story-led wins by 22% with significance at day 3. Auto-promoted, rolled to remaining list.

    E-commerceCase 02

    Send-time optimisation

    Four arms: 9am, 12pm, 6pm, 9pm in contact local time. Primary metric: reply within 1 hour. Bandit mode shifts traffic to 6pm and 9pm within a week. Reveals that the standard "send at 10am" advice was costing the brand 18% in engagement.

    SaaSCase 03

    AI prompt experimentation

    Two AI Generate nodes with different system prompts — one concise, one consultative. Primary metric: contact resolves issue without escalation. 50/50 split, 2,000 conversations. Consultative wins on resolution but loses on response latency; ops promotes consultative with a latency optimisation.

    E-commerceCase 04

    Discount level for cart recovery

    Three arms: no discount, 5% off, 10% off. Primary: cart recovered. Guardrail: revenue per recovery. 5% recovers 16% of carts at full margin; 10% recovers 19% but cuts net by 8%. Ops picks 5% based on the joint view, not the headline.

    EdTechCase 05

    Education program nurture cadence

    Two arms: 3-message nurture vs 7-message nurture over two weeks. Primary: enrolment. Guardrail: opt-out rate. 7-message wins enrolment by 11% with opt-out unchanged. The team had assumed more messages would hurt — the data overruled the assumption.

    How it works

    From signup to first send in minutes.

    A/B Testing is included on every SabNode workspace. No separate billing, no extra setup, flip it on from your workspace settings.

    1. 01

      Add a Branch with variants

      Insert a Branch node, switch it to A/B mode, declare 2-5 variants with weights. Each variant is its own subgraph you build out independently.

    2. 02

      Pick a primary metric

      Select from built-in metrics (reply, click, order, custom event) or define a custom one via a downstream Goal node. Add guardrails the experiment must not regress.

    3. 03

      Set confidence and stopping rules

      Configure confidence (95% default), minimum sample size, max runtime, and auto-promotion behavior. Or run in bandit mode for continuous traffic shifting.

    4. 04

      Ship and monitor

      Publish the flow. The experiment dashboard updates in real time — variant share, observed metric, confidence interval, power. Drill into any contact to see their assigned arm.

    5. 05

      Promote or iterate

      When the engine hits significance, auto-promote or review and promote manually. Archive the experiment with the full report stored for audit. Spin up the next test.

    Plays well with

    Works with the tools you already ship on.

    Connect directly with your existing stack or leverage the Platform Core tools to extend capabilities natively.

    ShopifyStripeRazorpayHubSpotSlackGoogle SheetsMeta WhatsApp Cloud APIMixpanel

    Platform Core Tools

    Enhance this feature with deep integrations into our core infrastructure. Connect via API, utilize webhooks, or embed directly using our SDKs.

    • Unified Dashboard Apps

      Manage all settings seamlessly within the core UI.

    • Developer APIs and Webhooks

      Extend functionality with custom automated workflows.

    Frequently asked

    Questions about A/B Testing.

    Can't find what you're looking for? Talk to our team.

    How do I know my sample size is large enough?
    The experiment dashboard shows current statistical power — the probability of detecting a target lift (default 10%) given your current sample. If power is below 80%, the dashboard tells you how many more contacts you need. For typical D2C broadcast lifts (5-20%), you need between 2,000 and 20,000 contacts per arm. The engine pre-flights this when you launch and warns if your audience is too small.
    Can I test more than two variants at once?
    Yes, up to five variants per Branch. Beyond five, the multiple-comparison correction starts eating statistical power — you would need much larger samples to reach significance. We recommend starting with two for headline tests and three when you have a clear hypothesis about the spectrum (e.g. low/medium/high discount). For high-dimensional exploration, bandit mode is better than fixed splits.
    What happens if a contact triggers the flow twice?
    Deterministic bucketing means the contact is hashed to the same arm both times. This prevents accidental crossover where a customer sees variant A on Monday and variant B on Friday. If you intentionally want fresh randomisation per execution (rare, but possible for true repeat-purchase tests), there is a per-run randomisation toggle on the Branch node.
    Can I A/B test AI prompts in AI Studio?
    Yes. Each AI Generate node can declare two prompt variants, and the engine routes traffic with the same A/B mechanics. Primary metric is usually downstream — did the conversation resolve, did the customer convert, did CSAT stay above threshold. Combined with the eval harness in AI Studio, this lets you ship prompt changes confidently without anecdotal "I think the new one is better".
    Does the experiment account for novelty effects?
    Yes via the minimum runtime parameter. Even if statistical significance is reached on day 1, the engine will not auto-promote until the configured minimum runtime (default 7 days, configurable per experiment) has elapsed. This guards against novelty effects where a new variant gets a temporary attention bump that does not sustain. Bandit mode has a built-in exploration floor to keep all arms alive long enough to detect this.
    How is multiple-testing corrected when I have guardrails?
    Each guardrail metric gets its own significance test with a Bonferroni-adjusted alpha based on the number of guardrails declared. With three guardrails at 95% overall confidence, each individual test runs at 98.3%. This makes it harder to trigger a false-alarm guardrail regression. You can override the correction if you have domain-specific reasons, but the default is conservative on purpose.
    Can I export results for stakeholders?
    Yes. Every experiment generates a one-page PDF report with the variants, the metric definitions, observed lifts, confidence intervals, sample sizes, and a recommendation. CSV export gives you the raw assignment and outcome data for re-analysis in R, Python or your preferred tool. Both are accessible via the API for embedding in internal dashboards.
    Related features

    Stronger when stacked.

    Browse every feature
    Flow Builder
    Drag-and-drop canvas with 42 node types. Triggers → conditions → actions, no code.
    Read more
    Broadcasts
    Ship Meta-approved templates to 100k+ contacts. Live delivery reporting.
    Read more
    AI Studio
    Tenant-scoped LLM with tools, retrieval and guardrails. Deploy anywhere.
    Read more
    Flow Analytics
    Per-node success, drop-off, revenue and SLA metrics.
    Read more
    SabFlow · Experimentation

    Ship a/b testing into production this week.

    No credit card. No sales call required. Spin up a workspace, plug in a number, and your team is live in under an hour.

    Start free Book a demoSee pricing
    SabNode

    SabNode is the operating layer for customer conversations. Chat, automation, CRM, broadcasts, commerce and AI in one workspace.

    Talk to sales
    Conversations
    Browse
    Automation
    Browse
    Customer Data
    Browse
    Growth
    Browse
    © 2026 SabNode. All rights reserved.
    PrivacyTermsStatusContact