Subscribe to Explo’s Channel

On June 10th, 2025, Explo experienced an outage that was 15 hours and 40 minutes long (from 2:47am ET to 6:27pm ET). This is our longest outage to date and we are immediately working on ensuring this never happens again. In response we’ve put together this document to explain what happened and the steps that we’ve already taken and will continue to take to prevent this from happening again.

Resources

Summary

Explo’s main backend service (api.explo.co) manages the in-app experience (app.explo.co), various dashboard / report builder configurations used at embed, and more. For those workspaces who are on the new data connector called FIDO (<region>.data.explo.co), any data related requests go through this microservice (v2 architecture). For those not on FIDO, data related requests go through the main backend service's query coordination (v1 architecture).

Although being on FIDO meant data was able to be queried, information about dashboard / report builder layouts and more could not be retrieved from the main backend service, so embeds remained broken. Note: If you are not on FIDO, let’s work together to move your workspace onto FIDO immediately.

v1 diagram with the impacted service in red

v2 diagram with the impacted service in red

FIDO is deployed on AWS, while the main backend service (api.explo.co) is deployed on Heroku, which is owned by Salesforce (NYSE: CRM). Heroku experienced an outage and was unusable from around 3:00am ET, for what they initially claimed to be because of authentication reasons. At this time (6:51pm ET on June 10th), Heroku nor Salesforce has indicated what the root cause was.

Our team was unable to access any Heroku resources to help reboot it. We decided to pour our engineering resources to migrate from Heroku to AWS, while hoping that Heroku would be fixed on its own. Since our team was not able to access various configurations we set in Heroku application (including certain secrets), this made it extremely difficult to migrate to AWS.

At 6:15pm ET our team was finally ready to point our frontend to a AWS service we spun up that was not full replacement but designed to be a backstop solution. However, given Salesforce released an update at 5:48pm ET indicating we had access to Heroku dashboard resources, we decided to wait for further information. Restoring the Heroku deployment would restore all of the features versus a reduced experience from the AWS service.

At 6:16pm ET, their team updated their message from 5:48pm ET and included instructions that we immediately ran that started to bring the main backend service back up.

Detailed Timeline

→ 2:47am ET — Last successful trace to our backend service. Note that at this point we didn’t know there was an outage. Since the service was down, we actually weren’t getting any failed traces as well, just the complete absence of traces. Our alerting was not set to catch this scenario. Follow up actions at the bottom.

→ 3:03am ET — Our automated heartbeat test started failing. The Explo support team gets pinged, but not paged. Follow up actions at the bottom.

→ 3:26am ET — One member of the Explo support team notices issues and pages other members of the support team and engineering team.

→ 3:44am ET — Explo’s engineering team gets online and confirms the outage and marks the status page appropriately.

→ 3:48am ET — We don’t see any status updates on Heroku’s status page, but finally see that on Salesforce’s status page there is information about a potential outage, posted at 3:17am ET.

→ 3:49am ET — We email Heroku and Salesforce support to get help. We are unable to create tickets in their support system due to the outage.

→ 4:13am ET — The team understands the impact to the best of our ability. Given our inability to access any Heroku dashboard or resources, we decide that fully migrating over to AWS while flying blind with Heroku would potentially take longer than Heroku fixing the outage. We try to utilize our disaster recovery plan, however the DR exercise assumes that our service is down, database is corrupted, but that we still had access to general Heroku resources. The team is ideating on what we can do here.

→ 5:51am ET — We decide to investigate migrating over to AWS. However, at this point, the outage is so bad, we don’t even have access to our Postgres database that Heroku manages and backs up for us. We are unable to migrate, we are fully stuck. We are actively exploring other options.

→ 8:34am ET — Our engineering gets access back to our Postgres database that is managed by Heroku. We were able to discover this just by constantly trying to connect to it.

→ 8:35am ET — We decide the best course of action we can take is to try to move over to AWS. At this point in time, we knew this would be a heroic effort and would take at least a day, but it was the best option we had. We had our fingers crossed that Heroku would resolve their own issues before we figured this out. Our thinking was getting something partially up was better than a full blown outage.

→ 8:42am ET — We realize that we don’t have some environment variables (particularly secrets) that are stored in Heroku. We are still unable to access Heroku resources, so we are unable to retrieve those secrets. This increased the complexity of migrating to AWS by an order of magnitude, but our team decides to continue forging forward with this plan.

→ 9:46am ET — The Heroku status page is failing.

→ 9:50am ET — In addition to updates to the status page, we sent out broadcasts communicating the outage via Slack and email channels to customers

→ 9:55am ET — Our staging instance, which lives in the same private VPC as our production instance miraculously comes up and is able to handle requests. Production is still down.

→ 10:10am ET — Salesforce claims Heroku services are slowly and intermittently coming back up, this ended up being incorrect.

→ 10:14am ET — A portion of the engineering team is considering a way to repurpose staging to be used as production.

→ 11:00am ET — A few team members joins the Salesforce “customer facing incident webinar”. Nothing actionable was provided.

→ 12:28pm ET — Given the positive progress we were making to migrate to AWS, we informed our customers via the status page.

→ 2:28pm ET — We let our customers know that Heroku / Salesforce informs customers on the status page that “errors are resurfacing”.

→ 3:42pm ET — We let our customers know that Heroku / Salesforce says they are “developing a fix”. We are continuing to make progress on the AWS migration.

→ 5:00pm ET — We inform customers that we are getting close to the AWS service as we are hydrating the database with a backup. There will be restrictions to the backstop as we are optimizing for getting embeds working for end users and not a full blown replacement.

→ 5:48pm ET — Salesforce released an update indicating we had access to Heroku dashboard resources, we decided to wait for further information.

→ 6:05pm ET — We posted on our status page that Heroku is claiming their admin “dashboard is now back online”. We don't know exactly the implication of this yet and are testing on our side. We try a rolling restart of dynos to no success.

→ 6:15pm ET — Our team is ready to do the AWS cutover. However, given the recent messages from Heroku, we want to give it a bit more time to see if they’ll fix things as our AWS solution is a backstop and not a replacement. There are known restrictions. Here is the communication we were prepared to send to customers:

→ 6:16pm ET — Their team updated their message from 5:48pm ET and included instructions that we immediately ran that started to bring the main backend service back up. We are not sure if it will fully recover at this point.

→ 6:27pm ET — After monitoring dyno metrics and Datadog traffic, we think the backend service is stable enough to report to customers. Some customers are already noticing on their own. We updated our status page to indicate we are not going to cut over to the AWS service and have noticed positive signs with Heroku bringing our service back up. Heroku has not marked the incident as resolved, so we don't think we are in the full clear yet, but embedded dashboards and report builders are starting to properly load again.

→ 7:52pm ET — After monitoring traffic, the team is seeing the backend service to be fully healthy again. The outage is resolved.

We ended one of our many Slack threads with a message PR: 485!

An engineer said it felt something like this

Action Items

Below are some items our team is immediately actioning on:

• We will add an alert to the main backend service that checks for “requests haven’t been seen for N seconds” and other anomalous behavior detection. This exists on FIDO, but was never added to the other service, because of it’s historical stability.

• We have automated heartbeat tests that @s Slack groups, but doesn’t page. This has only ever failed during business hours, and almost never gets used. We will immediately connect this to our paging infrastructure.

• We will more aggressively send our customers mass pings to Slack and email earlier and evangelize our status page even more than we did today.

• Completing the AWS migration, which includes adding in the proper logging, monitoring, and alerting. Managing our own AWS infrastructure enables us to design for maximum redundancy and resiliency through customizable multi-AZ and multi-region architectures. We will keep our customers posted on the progress of the migration.

‍

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

Item 1
Item 2
Item 3

Unordered list

Item A
Item B
Item C

Text link

Bold text

Emphasis

^Superscript

_Subscript

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

Item 1
Item 2
Item 3

Unordered list

Item A
Item B
Item C

Text link

Bold text

Emphasis

^Superscript

_Subscript

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

Item 1
Item 2
Item 3

Unordered list

Item A
Item B
Item C

Text link

Bold text

Emphasis

^Superscript

_Subscript

ABOUT EXPLO

Explo, the publishers of Graphs & Trends, is an embedded analytics company. With Explo’s Dashboard and Report Builder product, you can a premium analytics experience for your users with minimal engineering bandwidth.

Learn more about Explo →

June 10th, 2025: Heroku Global Outage Analysis

Resources

Summary

Detailed Timeline

Action Items

What Is Product Development Process ?

What Is Product Strategy ? A Complete Guide

Product Adoption Dashboard : Metrics and Examples

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

ABOUT EXPLO