During major promotional events and startup launches, Shopify’s Hatchful logo generator experienced an unexpected yet critical issue—logo downloads began failing during traffic peaks. As the service boomed in popularity, download requests surged, and the system fell short of gracefully handling this demand. This article explores how the Hatchful team addressed the problem using a sophisticated retry queue strategy, ultimately stabilizing the service and restoring user trust.
TLDR (Too long, didn’t read)
Hatchful’s logo download service encountered repeated failures during high-traffic periods due to backend bottlenecks. To fix this, the engineering team implemented a retry queue strategy that allowed download requests to be queued and retried until successful. This approach helped balance load across time and improve user experience. The solution dramatically reduced failure rates and became a reusable architecture for other Shopify services.
Understanding the Problem: Download Failures at Scale
Hatchful allows entrepreneurs and small businesses to create logos quickly based on brief preferences. After designing, users can instantly download their logo in various formats. However, during periods of peak usage—such as Black Friday, online webinars, or feature promotions—users reported download errors and delays.
Digging into logs, engineers discovered that:
- Logo generation tasks overwhelmed the processing backend.
- Download endpoints were timed out or dropped due to high concurrency.
- No retry logic was in place if a file wasn’t ready or if a request failed.
This was especially problematic because logo generation is resource-intensive, involving rendering multiple formats and packaging them into downloadable ZIP files.
Root Cause Analysis
The technical investigation revealed that the bottleneck occurred when multiple logo download requests flooded the system simultaneously. Each logo generation involved:
- Creating SVG, PNG, and icon tiles in multiple sizes
- Uploading those assets to cloud storage
- Generating a ready-to-download dynamic ZIP file
The infrastructure wasn’t built with horizontal scaling in mind. Instead, a few centralized workers processed all requests. When those workers were saturated, requests timed out or failed without retry, especially during bursts that exceeded predicted limits.
The Game-Changer: Introducing a Retry Queue Architecture
Rather than increasing processing power linearly, the Hatchful team implemented a queue-based retry strategy. The approach involved decoupling user download requests from the logo generation backends. Here’s how the strategy worked:
- When a user clicks “Download”, the system checks if the logo files are ready.
- If not, it places the request in a retry queue backed by a managed queuing service (Amazon SQS initially).
- A service worker processes these requests in batches, with exponential backoff for retries.
- Once the logo package is ready, the user is notified through UI or can return to retry the download themselves.
This design allowed driven retry attempts without overloading the core system. The delay frustrated some users, but the system offered progress indicators and auto-popup when downloads became ready.
Benefits of the Retry Queue Model
This strategic shift offered several tangible benefits:
- Scalability: Queueing smoothed out traffic bursts, allowing background workers to operate efficiently.
- Reliability: Retry logic ensured high-priority operations weren’t dropped due to temporary failures.
- User Success Rate: The rate of failed downloads dropped from 62% at peak to less than 5% with the queue in place.
- Service Isolation: The retry queue introduced buffers that prevented single points of failure from cascading into larger outages.
Perhaps most importantly, it reduced the manual intervention required by customer support teams who previously handled failed logo requests individually.
How the Retry Queue Was Implemented
The retry queue incorporated several engineering best practices:
- Exponential Backoff: Retry attempts increased delay intervals to avoid hitting busy systems aggressively.
- DLQ (Dead Letter Queue): Requests that failed repeatedly were moved to a separate queue for investigation, preventing endless retries.
- Heartbeat Workers: Workers were designed to continue polling the queue and assess the readiness of files rather than triggering all processes at once.
- Observability: Metrics were added to monitor queue depth, retry rates, and processing times, giving teams full system visibility.
It also helped that Hatchful used feature flags to roll out this solution gradually over multiple regions. This minimized user disruption and uncovered edge case bugs early.
The Aftermath: A More Resilient Hatchful
Since the rollout of the retry queue, Hatchful has not experienced a single peak-time outage of its logo download service. This initiative improved user sentiment and clarified the role of asynchronous architecture in building scalable products.
Additionally, the success of this model led to its adoption in other services within Shopify that experience bursty behavior, such as campaign launches and product import tools.
Lessons Learned
The Hatchful team derived vital lessons from the incident:
- Design systems for burst traffic even if average load is low.
- Retry queues are a powerful pattern when latency or immediate delivery cannot be guaranteed.
- User communication matters—update messages like “Your logo will be ready soon” helped set the right expectation.
- Always have monitoring in place. The earlier insight into system bottlenecks arrives, the faster they can be resolved.
Frequently Asked Questions (FAQ)
1. Why did Hatchful logo downloads start failing?
Failures occurred due to a spike in demand during traffic peaks. The infrastructure couldn’t generate and serve download files fast enough, leading to timeouts and incomplete responses.
2. What is a retry queue and how does it work?
A retry queue temporarily stores failed or pending requests and retries them based on a defined schedule. In Hatchful’s case, it allowed background workers to check the readiness of logo files and retry download prep until successful.
3. Did users have to wait a long time to get their logo?
Initially, some users experienced delays of 1–2 minutes. However, with progress messages and notifications in place, most users stayed informed and eventually received their logos without re-initiating requests.
4. Is this solution used in other Shopify apps?
Yes, the architecture was replicated in other services with similar bursty usage patterns, such as product uploads and bulk editing tools.
5. How is this different from just scaling up servers?
Simply scaling up adds cost and doesn’t always solve instant load spikes. A queue system distributes the load more evenly over time and keeps systems stable without needing superfluous hardware.
In conclusion, this architectural overhaul not only salvaged a failing service operation but also laid the groundwork for more resilient Shopify experiences. The retry queue proved to be a simple yet elegant solution to a complex scalability challenge.
