This discussion provides an overview of the common architectures used in risk management systems in capital markets and how the advent of new technologies—namely the cloud and streaming—have impacted their designs.
By Vassil Avramov, Chief Technology Officer, Ness
This discussion provides an overview of the common architectures used in risk management systems in capital markets and how the advent of new technologies—namely the cloud and streaming—have impacted their designs. There are multiple flavors of risk management systems between Middle and Front Office, depending on whether the focus is on hedging market risk, credit risk, or counterparty risk. While each analysis can vary widely, their IT architecture tends to be reasonably similar. There is, however, a clear difference in the way that near-time and event-based systems are architected compared to batch and on-demand ones. We’ll look at the common architectures for batch and event-based risk systems and how they map to common cloud offerings. We’ll also discuss new technologies and approaches allowing us to unify both architectures under a common architecture with valuable efficiencies.
Batch and On-Demand Risk Architectures
The general architecture for batch and on-demand risk systems is well established. While multiple technology choices are in play for each risk system, they almost always boil down to the following diagram.
One of the important techniques to note is that we try to partition the risk job into two distinct phases:
- A phase that breaks down the job into calculations (e.g., Monte Carlo or HVAR) that can be performed independently produces raw results like PnL vectors, which need to be aggregated.
- An aggregation phase that takes the results and pivots them in a Hyper Cube or Data Warehouse—an example of this is getting the expected loss with 99% confidence on a specific book, account, portfolio, or security.
Of course, these are essentially the Map and Reduce phases. One of the benefits of these two distinct phases is that we can calculate the Risk or PnL vectors on a security and then aggregate them into positions and different confidence levels in multiple ways–all without having to recalculate unitary vectors more than once. Another benefit is that the first phase is very parallelizable, and the second can be somewhat parallelizable. For example, we can aggregate non-overlapping portfolios or positions in parallel, but single aggregation units must be done serially.
A quick recap of the flow is as follows:
- Arrival of request at API gateway
- Splitting and grouping
- Sending results to a Data Warehouse or Hyper Cube
Leveraging the Cloud
Traditionally, the challenge of designing batch and on-demand risk systems involves compute resources, the proximity of the data to compute resources, and large amounts of data that need to be aggregated. With the infinite compute resources in the cloud and some smart-caching and distribution strategies, Ness teams have converted overnight risk jobs at some of the largest exchanges to run in tens of minutes or even less.
There are multiple ways to map the logical architecture described above onto various cloud offerings, each with unique performance, manageability, and portability tradeoffs. Below are a couple of examples, one focused on being serverless and the other on portability. Of course, there are several other options as well as various hybrids. The goal of these simplified diagrams is to illustrate our concepts. We’re focusing our technology choices on AWS because we believe it to be the most advanced cloud with the most extensive number of offerings for most enterprises.
Limitations of Batch and On-Demand Architectures
Of course, there’s a physical limit to how quickly this job can be run, even in the cloud. Let’s call this the job duration. One of the complex problems that the above architecture doesn’t address is that static descriptions of Trade Data, Reference Data, and Market Data aren’t actually static at all. They’re fast-moving data streams that change at completely different velocities. Recalculating the whole portfolio at every market data tick isn’t possible—and probably isn’t meaningful anyway—so we may want to pick a different approach to address this limitation.
Common strategies include:
- Recalculating at fixed intervals (e.g., every 10 minutes if 10 minutes is greater than the job duration)
- Recalculating the additive impacts on all new trades using market data as of the start of the day or some other fixed point
- Recalculating when some piece of market data changes by some specific amount
- Recalculating at fixed intervals and then applying Taylor approximation to the results (this is essentially the equivalent of the lambda architecture for Map Reduce)
The challenge, then, is to ensure that any risk calculation uses consistent time slices of data. For example, we want to make sure that when pricing a portfolio of trades at 11:00.00, all the trades are using a consistent slice of market data as of 11:00.00 and that all reference data is sourced from 11:00.00. This problem is compounded by the fact that most pricing models aren’t driven by observable market data, but instead derived market data (e.g., yield curves, volatility surfaces, etc.).
Real-time and Streaming Risk Architectures
The challenges we outlined above fall squarely into the realm of problems that streaming architectures have evolved to address.
Before we dive into the details, it’s worth pointing out that the challenges of streaming—dealing with events being streamed into processing systems out of the temporal order in which they occurred—aren’t exclusive to capital markets. Think of the Internet of Things (IoT) and all the devices that can have an interrupted connection before syncing back up. Fortunately, because this problem is so common, the solutions developed to address it are highly scalable and easily adaptable to the world of event-based risk management.
Although a thorough overview of streaming concepts is outside this discussion’s scope, it’s important to touch on some of the critical semantics present in standard streaming products like Kafka Streams, Flink, and Spark Streaming.
Some key streaming concepts include:
- Bi-temporal events. Events typically have two essential time attributes–the time the event occurred (the event time) and the time it was received by the system that processes it (the processing time). For most logical purposes, we want to key off the event time.
- Windowing. Windowing is a high-level semantic concept that allows us to define intervals for operating on an events stream. One of the most common windowing strategies is non-overlapping, rolling windows, enabling us to define a consistent view of all events that have arrived in the system in, let’s say, at 5-minute intervals.
- Triggers. Triggers are useful, among other things, to deal with late arrivals (e.g., when an event arrives much after the processing window, but we still want to consider it for a specific window).
Logical Architecture of Streaming Systems
The goal of logical streaming architectures is to invert the flow of the diagram so that the event stream pushes subscribers’ analytics views into a micro-batch approach.
Streaming applications are typically based on stateful stream processing. By this, we mean that processing is achieved by a series of tasks applied to streaming data where each task is horizontally scalable and partitioned into self-contained datasets.
Data and computation are co-located with local data access (in-memory or disk) to achieve the desired processing speed. The system computes the most optimal execution graph to avoid shuffling data between horizontally distributed computation nodes as much as possible.
To improve recovery time in case of failure, it’s best practice to periodically write snapshots of these states to remote persistent storage to facilitate processing recovery back to the time when the failure occurred.
Let’s re-examine the steps of batch and on-demand systems described above and see how they apply to a streaming architecture.
- The arrival of the request at the API gateway – With streaming, there is no request. Pipelines are running and consuming published data. Of course, there is an orchestration layer to start/stop these pipelines. However, since we’re likely running cloud infrastructure and using Infrastructure as Code, the orchestrator is also a processing component that reacts to submissions asking for specific risk flows.
- Enrichment/de-referencing – A transform/load pipeline turns incoming data into risk-model objects, abstracting data for Compute and Aggregate processes and isolating changes in the incoming data format.
- Splitting and grouping
- Sending results to a Data Warehouse or Hyper Cube – Publish and persist results.
- Analysis – Depending on requirements, the streaming pipeline executes real-time business-intelligence logic.
The Path to Unified Batch and Streaming
One of the most elegant aspects of modern streaming architectures is that they treat a batch as a bounded stream. As a result, we can now support batch and streaming use cases within the same overarching risk architecture. The diagram below demonstrates a streaming risk architecture on AWS.
Risk architectures have always been the ideal cloud use case since they require large amounts of intermittent compute resources and are thus ideally suited to the elasticity of the cloud. At Ness, we’ve seen many financial institutions start to leverage the public cloud for their workloads. Undoubtedly, cloud usage will increase as enterprises continue to become comfortable securely and efficiently leveraging the cloud’s advanced capabilities.
Throughout this discussion, we’ve pointed out that traditional batch risk architectures can now be primarily run on serverless and fully managed cloud offerings, increasing their efficiency and scalability. Now, we can even build near-time and high-volume event-based risk systems and traditional batch risk systems for our clients on the same managed architecture.