Elevated 503 Errors Causing Rebuy Services Unavailable

Incident Report for Rebuy

Postmortem

Issue:

Customers experienced 503 Service Unavailable errors, impacting On-site functionality. The issue was identified during routine alert monitoring and in response to reports of degraded service.

Root Cause:

The issue stemmed from the Ingress configuration not being properly applied after an update to the NGINX controller. Specifically, the upgrade to NGINX version 1.12.1 as part of a critical security vulnerability update led to a misapplication of the configuration. The security evaluation process on the configuration snippets caused delays in loading, resulting in traffic being directed to default configurations, which led to service disruptions.

Actions Taken:

  • Stage Environment Rollback: To resolve the issue, the rollback plan was initiated, downgrading from NGINX version 1.12.1 to 1.12.0 in the staging environment. While this did not immediately restore full functionality, it was a key first step in addressing the issue.
  • Configuration Investigation and Fix: Further investigation revealed that the Ingress configurations were not properly loaded due to the security evaluation process and an additional contributing factor. Once identified, configuration changes were made to resolve the snippet evaluation issue. The controller configurations were purged and redeployed, correcting the “default” routing behavior and restoring service.
  • Production Environment Fix: This same resolution was applied to the production environment (Rebuyengine.com), which was fully restored to normal operation.

Next Steps:

All necessary actions to address the vulnerability have been completed, and no further steps are required at this time. Moving forward, we’ve enhanced our testing process to ensure that if this error is encountered, we can take appropriate actions prior to any impact to the Rebuy services.

Posted Mar 25, 2025 - 14:19 EDT

Resolved

The issue has been resolved. Our team will prepare a formal Root Cause Analysis and post it on the status page incident. Thank you for your patience throughout this process.
Posted Mar 25, 2025 - 00:49 EDT

Update

We are continuing to monitor for any further issues.
Posted Mar 25, 2025 - 00:43 EDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Mar 25, 2025 - 00:35 EDT

Update

A fix has been implemented, and we are beginning to see a decrease in errors. More updates to follow.
Posted Mar 25, 2025 - 00:34 EDT

Update

We are continuing to work on resolving the issue. Thank you for your patience as we address this.
Posted Mar 25, 2025 - 00:04 EDT

Identified

We have identified a potential root cause of the issue and are taking the necessary actions to resolve it.
Posted Mar 24, 2025 - 23:43 EDT

Investigating

We are currently experiencing a high volume of 503 errors, which is impacting service availability. Our team is actively investigating and working to resolve the issue as quickly as possible. Thank you for your patience.
Posted Mar 24, 2025 - 23:32 EDT
This incident affected: A/B Testing, Admin Portal, Checkout / Post Purchase, Discounting Functions, Smart Cart, and Smart Search.