Background on Explo
Typically Explo’s platform connects directly to a customers database or data warehouse, providing data analysts and product managers access to build rich data experiences for white-labeling into their own application. Explo doesn’t store any of our customers’ data, rather we query the necessary data on-demand. Explo provides an optional caching layer, but our data querying infrastructure and it’s reliability and performance is existential to our product.
We’ve outgrown a few implementations of our data querying architecture. The initial improvements improved our resource allocation and changed our query patterns, but it soon became obvious we needed to embark on a larger change — a dedicated data querying microservice.
History: What were the problems?
For many early companies, eliminating the need for dev ops is a principal architectural and engineering goal, Explo included. Heroku’s PaaS solved Explo’s compute needs and provided a smooth ramp to SOC2 compliance without much effort on the part of the engineering team.
Cracks quickly showed with our approach, however. First, connection latency. Connecting to customer databases, often through ephemeral SSH tunnels added to latency of every query, through TCP establishment round-trips, SSH handshakes, SSL negotiation, and TCP slow start—it all added up.
Connections couldn’t be re-used in our architecture either—they were started for each query and disposed of. Typically some form of database pooling is instituted to alleviate this but this was challenging for Explo—we had thousands of database connections (and SSH tunnels) to maintain for those (mostly) idle connections. Additionally, the shared-nothing compute model of a typical 12-factor app hosted on Heroku precluded our ability to reasonably shard our architecture without significantly more investment (which we needed to distribute the overhead of thousands of pooled database connections across the many compute nodes we maintained with connections growing daily).
Second, customer query latency. Customers can craft arbitrary queries in Explo to power their analytics and data sharing needs. Often queries, especially during development, would take many seconds or longer to complete. Heroku has a 30-second connection timeout at the router level—our customer queries would commonly hit this limit and our requests would time out. Even worse, modern browsers typically have a limit of six open connections per hostname. Many page loads in Explo required more than six queries to run in parallel. Typically this could be solved by adopting HTTP/2, but Heroku still has no plans to support it.
Solving… err… working around customer query latency
We chose to quickly solve some of our customer query latency issues by introducing backgrounding of customer queries—we would capture each request writing the request to a Redis queue. We maintained a pool of background workers to handle customer requests and write back to the client through a Redis key-value store. Clients would poll with exponential backoff for their query results.
This approach solved our timeout issues (and the browser’s open connection limit) but added even more latency to the customer experience. Long queries would mostly always complete, but the user experience of faster queries suffered. We knew there was a better way without such compromises to end user experience.
We wanted to solve our query execution problems in one fell swoop after our initial stop-gap solution. We were at a point where we were secure in our understanding of Explo’s future and had a good sense of our future customer database connectivity needs. It was the time to invest in the future of the platform and we decided to make a big bet.
With that bet we also took on a few other vexing issues we had seen with Explo over the previous year. First, services colocation. Explo is hosted on Amazon in US East (through Heroku). Not all of Explo’s customers were hosted in US East of course. We wanted to be able to colocate our compute closer to our customers as needed.
Second, we increasingly had requests for on-premise or partial on-premise hosting as we on-boarded larger SaaS customers. Maintaining many instances of an application across infrastructure we didn’t maintain (without dev ops, mind you) wasn’t the business we wanted to be in. Finding a solution to support customers that required on-premise hosting was a nice to have.
Finally, some prospects, often through regulatory reasons, required data to stay resident within their home territories but were open to cloud hosting. In tangent with our first problem (services colocation) we wanted to support Amazon (or other provider) regions domiciled in certain regions of the world to support the regulatory requirements of our prospects but still provide the ease of a fully-managed solution.
We decided to implement a separate, isolated microservice to handle customer database interactions. The Fast Interoperable Data Orchestrator, or, FIDO for short, will handle all customer database interactions, maintaining persistent SSH tunnels and persistent database sessions (for those databases that support them).
What is FIDO? A sharded, multi-tenant microservice with lightweight clustering that manages database connections to registered datasources and coordinates scaling of connection pools and responds to requests for analytic and data sharing queries. The application internally routes to the appropriate shard, with each instance of the application providing an HTTP interface for querying and forwarding requests to the appropriate cluster member that maintains shard state for a requested datasource connection.
A FIDO installation is registered on a per Explo account basis. Customers who desire to or need to for architecture or regulatory reasons can maintain their own FIDO installation, registered with Explo’s cloud application. This application topology allows us to provide the complete cloud Explo experience without our customer’s data ever passing through infrastructure that they do not host themselves. Explo also will maintain a set of FIDO installations in common geographic regions, further decreasing customer latency by placing compute close to end users (and the backing data sources).
How is it going? For nearly all data sources, we can shorten query times by up to 5 seconds, depending on geographic location and SSH bastion constraints. That is often the difference between a reasonable user experience on a BI dashboard or analytics report and severe user frustration.
We’re thrilled to soon onboard new and existing customers onto FIDO.