Case Study - Wander

Testing Wander with traditional tools was impossible

A conversation between
Sneha Sivakumar
Sneha Sivakumar
CEO of Spur
Nathan Potter
CTO, Wander

COMPANY

A modern travel platform that pairs the comfort of private vacation homes with luxury-hotel-grade amenities and 24/7 concierge service

INDUSTRY

B2C Travel ($40M GMV)

COMPANY SIZE

200

FOUNDED

2021

20×

Increase

in release velocity for any new updates.

$750K

Annual QA spend

avoided after adapting Spur.

80%+

Test coverage achieved

within the first month with 500+ tests daily.

The Problem

50 filters and 50+ locations. Millions of possible scenarios, and no tool that could handle any of it.

Wander is a luxury travel product, the kind where a broken search result or a misfiring concierge message isn't just a bug, it's a ruined booking experience for someone planning a vacation. The quality bar is high, and the testing surface is enormous.

Search alone has more than 50 filters across 50+ locations. That's millions of possible filter combinations before you even get to the AI layer. Wander's search doesn't just match keywords, it understands natural language queries. "Beachfront homes in NYC," "properties in California with a hot tub," "somewhere quiet on the East Coast." Every query is different. Every result needs to make sense.

Then there's Ava, Wander's AI concierge, who handles bookings, answers questions, and guides users through the platform in natural conversation. Testing Ava means testing something that, by design, never gives exactly the same response twice.

Traditional tools weren't built for any of this. String-based assertions, checking whether a specific word or phrase appears in the output, completely break down when the output is generated by a language model. You can't write "verify the response contains 'beachfront'" when the response is different every time and the point is whether it makes sense, not whether it matches a pattern.

"Spur really allowed us to streamline the testing process and keep up with the pace of the product changes that we were shipping."

Manual testing was the only fallback, and at Wander's scale and pace, it was already failing. QA cycles were taking days. The team was shipping fast and hoping coverage would hold.

The Solution

Writing tests 10× faster, validating AI with AI and getting feedback in minutes.

Maxwell Iheagwara, Wander's Lead QA Engineer, had used Cypress before. He describes the difference plainly: writing Spur tests is more than 10× faster. The reason is specific, with Cypress, you still have to type exact string assertions for everything you want to validate. If a search should return results for "hotels," you have to type: check that the text reads "look for hotel." You can't automatically generate what Spur puts in there.

With Spur, you give a high-level instruction. "Search for a property somewhere in the US." Spur's agent navigates like a real user, generates a natural search query, "properties in California," "beachfront homes in New York," something different each run, and then validates whether the results actually make sense, not just whether they contain a specific string.

"Writing Spur tests is more than 10× faster than traditional automation with Cypress."

That non-determinism, the fact that every test run uses a different query, was what Maxwell called a killer feature. Because that's how real users search. Not with the same string every time.

For validating AI outputs, Spur uses sentiment analysis rather than string matching, understanding whether a response is correct, helpful, and on-brand, rather than just checking whether it contains a specific word. That made testing Ava possible in a way that no traditional tool could achieve.

Within the first week, Maxwell wrote over 150 tests. Within the first month, Wander had 80%+ coverage and 500+ tests running daily. QA cycles dropped from days to minutes.

The Bug

Ava was flagging legitimate travel queries as spam. Every user asking about booking was being blocked.

This is the critical bug Spur caught and it's the kind that only an intelligent testing tool could find. Wander's AI concierge Ava was incorrectly classifying genuine booking inquiries as spam. Users asking natural questions about accommodations were being silently blocked from completing their bookings.

There was no string to check. No specific error message to look for. The only way to catch this was to simulate real conversations, the kind a real traveller might have, and evaluate whether Ava's responses made sense. That's exactly what Spur did.

Without Spur, this bug would have reached production. Guests would have hit a wall mid-booking, seen no explanation, and likely gone elsewhere. For a luxury travel product, that's not just a UX problem, it's a revenue problem.

The Results

20× release velocity, $750K in QA spend avoided and one engineer who can now do a teams job.

Nathan describes it simply: Spur allowed them to iterate and ship with confidence. 20× faster by release count, because the bottleneck was QA, and QA is now largely automated.

Maxwell, as the lone QA engineer on the team, effectively became a team of 50+ by deploying Spur's agents across search, booking, and AI interaction flows. 500+ tests running daily. Issues caught before they reach production and feedback in minutes.

"Spur has significantly improved Wander's testing capabilities. It has allowed us to iterate and ship so much faster with confidence."

The financial framing is also striking: $750K in avoided QA spend, and $10M+ in revenue-at-risk bugs caught before production. For a luxury travel platform where a single broken booking flow during peak season is expensive, that second number matters.

20x

Increase in release velocity

$750K

Annual QA spend avoided

80%+

Test coverage achieved within the first month

Critical e-commerce flows across 30+ regions
Every regional price, discount rule, and product variant automatically tested before your sale goes live, no manual spot-checking required.
Hundreds of partner landing pages
Ensuring that every audience coming from podcasts, newsletters, and other partnerships lands on a page that is on brand and error free.
Staging and production environments
Running tests in staging for high confidence before launch, then validating again on production as a final safety net.

Key Insights

Wander's product generates different outputs every time, AI search results, AI concierge conversations, natural language interactions that no two users experience the same way. Traditional testing tools weren't built for that. Spur was. One engineer, 500+ tests daily, and the confidence to ship 20× faster, because the QA finally matched the product.

CUSTOMER STORIES

More teams, same results.

UncommonGoods cut QA time in half with AI-driven testing
How Uncommon Goods stopped spending 50% of their QA time on Selenium
From Manual QA Bottlenecks to Fast, Reliable Releases with Spur
How Wondr Health enabled an entire team to work on more interesting problems
Scaling shoppable UGC QA across dozens of brands by adding a single URL to a shared Spur scenario table
How Hue QAs shoppable widgets across 20+ merchants without rebuilding anything.
From manual spot checks to reliable, release-ready coverage at peak traffic
How Eight Sleep Turned Black Friday QA From All-Nighters to Automated Confidence
Regression Done by Noon, Every Release
The regression marathon that ran 9am to midnight, every two weeks
90 % Coverage in 2 Weeks
How YC Hit 90% Coverage on Its Mission‑Critical Applications Portal
2× Faster Deployments, Zero Manual Testing
August deploys every six hours. 25% of those releases had to be rolled back
20x Increase in Release Velocity
Testing Wander with traditional tools was impossible