Image credits: https://www.freecodecamp.org/news/content/images/size/w2000/2023/04/pexels-barry-tan-7994953.jpg

Securing north/south and east/west traffic @ DevRev

Published in

FACILELOGIN

9 min readAug 15, 2023

At DevRev, we are building an API-first dev-centric CRM that leverages data, design, and machine intelligence to empower developers (dev) to build, support, and grow their customers (revs), in the era of product-led growth. This blog post shares some insights on how we secure DevRev APIs (north/south traffic) at the edge, and the service-to-service interactions (east/west traffic).

The DevRev platform is designed to scale up to 1 million dev organizations, and 1 billion rev users. At the time of this writing the DevRev APIs cater closer to 1 million API requests on daily basis even at the very early stage of the product. In terms of API performance, we emphasize that all the APIs should operate with a very low latency. With this in mind, we wanted our security design to bring in only the valid, legitimate traffic into the DevRev platform. Anything that does not look right, we reject at the edge.

At the edge, we use Fastly Next-Gen WAF (powered by Signal Sciences) to monitor for suspicious and anomalous API traffic and protect in real-time, against attacks directed at our public APIs and the origin servers. Once the requests pass through the WAF, we use Fastly Compute@Edge to validate each request.

Fastly provides us with the capability to execute our code at the edge through a WebAssembly module. We’ve developed an edge gateway in Rust, which compiles into a WebAssembly module. This Rust code is responsible for rejecting any API requests lacking a valid JWT. JWT verification is just one of the tasks we perform at the edge. The edge gateway responsibilities also encompass cache management, reporting API statistics to Google BigQuery, sending logs to Datadog, enforcing captcha, URL rewriting, CORS management, API allow-listing based on various parameters, and proxying traffic to secure S3 endpoints, among others. Furthermore, we are in the process of introducing coarse-grained authorization at the edge. This additional measure will assist in filtering only legitimate traffic to DevRev services. The entire Rust code executing at the edge takes no longer than 5 ms to complete its tasks.

Fastly Compute@Edge serves as the entry point at the edge for DevRev services. At the origin, an API gateway intercepts all incoming traffic. The responsibilities of this API gateway go far beyond the functionalities typically found in open source or commercial API gateways. In fact, it functions as both an API gateway and an integrator, developed in-house at DevRev. Throughout the remainder of this blog, we will refer to it as the DevRev gateway.

As a second level of defense, we perform JWT verification at the origin using the DevRev gateway, even though it is redundant. Ideally, we should not receive any 401 errors from the origin, and we actively monitor this using Datadog alerts. The verification of a JWT takes less than 2ms at the origin. Additionally, we have implemented a token-based authentication mechanism between the Fastly edge and the DevRev gateway. This, coupled with IP allowlisting, ensures that no request can bypass the Fastly edge to reach the DevRev gateway.

The JWT carries the identity of the API user. An API user can be one of the following types:

An Auth0 user. We utilize Auth0 as the trusted Identity Provider for the DevRev platform. Auth0 authenticates users through methods such as OTP over email, social connections, and enterprise connections. To access the DevRev web app or mobile app, users must first authenticate via Auth0. Auth0 assigns a distinctive identification to each user known as the Auth0 user ID. This ID is formed by combining the connection name with the immutable identifier specific to the user within the associated connection.
A Dev user: A Dev user is a member of a Dev organization within the DevRev platform. All Dev users are Auth0 users; however, the reverse is not necessarily true. The DevRev web app and mobile app invoke APIs on behalf of Dev users, or the Dev users themselves can directly invoke DevRev APIs.
A Rev user: A Rev user is a customer of a Dev organization and has the authorization to access specific DevRev APIs. In most cases, the DevRev main app doesn’t actively authenticate Rev users; instead, it relies on the corresponding Dev organization for authentication. Based on a trust relationship with the Dev organization, Rev users are granted access to DevRev APIs. However, the DevRev support portal permits Rev users to log in directly. In an upcoming blog post, we will delve into the details of building this trust relationship and explain how we authenticate Rev users at both the edge and the origin.
A service account: A service account represents an application that communicates with the DevRev APIs. For instance, when you integrate the DevRev PLuG widget into your web app or use the PLuG mobile SDK in your mobile app, the PLuG functions as a service account. A service account can access DevRev APIs independently or on behalf of a Dev user or a Rev user.

The DevRev gateway at the origin serves as the entry point to the DevRev microservices backend. Once it verifies the JWT accompanying the API request, the gateway dispatches the request to the appropriate service. All services are developed in Golang and communicate with each other using gRPC.

The gateway and all other services are deployed within a Kubernetes cluster. Each service operates within its own namespace and is deployed behind an envoy proxy. When a service spins up, it is provisioned with a key by Istio, which also manages key rotation. These keys are subsequently utilized by each service for mTLS authentication between services. The same applies to the gateway.

mTLS is good enough to identify a service, but it has its own challenges as well. We’ve built a service-to-service authentication mechanism that combines mTLS with JWT due to the following reasons.

Flexibility and decoupling: JWT can be used in scenarios where you need more flexibility and decoupling between services. It allows you to issue tokens that can carry various claims and information about the user or entity. This can be useful in scenarios where you want to provide fine-grained access control or share specific user attributes between services.
Statelessness: JWT is a stateless authentication mechanism, meaning the server doesn’t need to store token-related information. This can be advantageous when scalability and performance are crucial, as the server doesn’t need to maintain session-related data.
Cross-Domain Communication: JWT can be used for cross-domain communication between different services. Since JWTs are self-contained and can carry service-related information, they can facilitate communication between services without requiring direct interaction or shared session state.

When a service spins up within the DevRev platform, it talks to the STS (Security Token Service) deployed in the same Kubernetes cluster. Through mTLS authentication, the service requests a JWT. This particular JWT is referred to as the Application Access Token (AAT). The AAT’s subject is a system-generated identifier linked to the Kubernetes service name of the corresponding service making the AAT request. In simpler terms, an AAT is accompanied by a corresponding service account, and the AAT’s subject is the identifier of that service account, which we call a service account DON.

The URI field within the X509 certificate corresponding to each service (or workload), issued by Istio contains the SPIFFE ID linked to that specific service. When the STS issues a JWT for a service that authenticates with the STS through mTLS, it appends the same SPIFFE ID found in the incoming X509 certificate as a claim to the JWT it creates and subsequently shares with the service. This process effectively binds the JWT to the corresponding service identity connected to the mTLS connection.

Each microservice is linked to a predefined service account, and a particular service has the ability to establish its own access control policies for these service accounts. For instance, the Janus service might permit read operations from the gateway service account, while the codex service could enable the gateway service account to impersonate a specific group of Rev users.

At the end of the day, every service is provisioned with a JWT, which it utilizes to access upstream microservices. These JWTs are of short duration, and as they approach expiration, the corresponding service is required to communicate with the STS once more to obtain a fresh JWT.

One fundamental best practice when generating a JWT is to define a restrictive audience. For instance, referring back to the service-to-service authentication using JWT discussed in the preceding section, the token generated by the STS for the gateway’s communication with the Janus service should specifically carry ‘janus’ as the audience value. Consequently, the Janus service cannot utilize the same JWT received from the gateway to communicate with the STS. This is because the token’s audience is ‘janus’, while the STS anticipates a token with an audience value of ‘sts’.

One drawback of this model is that it would lead to more frequent interactions between the STS and other services, resulting in increased communication overhead. Furthermore, each service would be required to manage distinct tokens for every upstream service it engages with. While we opted not to adopt this model with different audience values, we were still unwilling to take the risk of one service employing a token from another service to access an upstream service, essentially impersonating the original service.

Binding the JWT to the SPIFFE ID associated with the X509 certificate of a particular service proves beneficial in this context. Every upstream service not only verifies the JWT received from the downstream service but also confirms whether it is tied to the SPIFFE ID related to the underlying mTLS connection. This mechanism ensures that the Janus service cannot utilize the JWT acquired from the gateway to gain access to the STS as if it were the gateway.

Alongside the service context, the interactions between services also include the user context. The gateway forwards the JWT it receives from the client to the upstream services when necessary. This JWT carries the user context. In the current model, these client JWTs might originate from two different issuers: Auth0 and the STS. However, as we move forward, our goal is for all services to exclusively trust STS-issued tokens. This implies that clients will need to exchange the token they receive from Auth0 for an STS-issued token before gaining access to DevRev APIs.

Why would the client need to exchange the Auth0 token for an STS token, and why shouldn’t the gateway handle this conversion in the background, passing the STS-issued token to the upstream services? However, this approach would result in more frequent interactions between the gateway and the STS, requiring token exchange for each request. Such an approach introduces unnecessary overhead.

A given service possessing the JWT containing the user context will not have unrestricted access to any arbitrary service using that JWT. We enforce stringent access control policies at each service, ensuring that incoming requests are processed only after evaluating not only the user context but also the corresponding service context.

A token that has been previously issued to a client or user, whether by Auth0 or the STS, can be revoked for two reasons. The associated user or the organization to which the user belongs may no longer be part of the DevRev platform, or the user themselves or an admin of the organization could explicitly revoke a specific token.

To address the former, the DevRev gateway verifies the active status of a user or organization within the platform after each token validation. In order to reduce unnecessary service calls and database queries, the gateway maintains a cache of recognized users.

To explicitly revoke a token, the client can make use of the revoke API provided by the STS. After a token is revoked, the STS includes the metadata related to the revoked token in a cache accessible to the gateway for reading. The gateway then rejects any tokens corresponding to the token metadata found within the revoked token cache. We are currently working on making this list of revoked tokens available to the Fastly edge gateway, which will then reject any requests carrying a revoked token at the edge itself.

In this blog post, we provided a high-level overview of how we secure both north/south and east/west traffic at DevRev. In future blog posts, we will delve deeper into the key aspects of the DevRev microservices security design.

Securing north/south and east/west traffic @ DevRev

Written by Prabath Siriwardena