This discussion provides an overview of the common architectures used in risk management systems in capital markets and how the advent of new technologies—namely the cloud and streaming—have impacted their designs. There are multiple flavors of risk management systems between Middle and Front Office, depending on whether the focus is on hedging market risk, credit risk, or counterparty risk. While the analysis provided by each can vary widely, their IT architecture tends to be reasonably similar. There is, however, a clear difference in the way that near-time and event-based systems are architected compared to batch and on-demand ones.
We’ll look at the common architectures for both batch and event-based risk systems and how they map to common cloud offerings. We’ll also discuss some new technologies and approaches allowing us to unify both architectures under a common architecture with valuable efficiencies.
Batch and On-Demand Risk Architectures
The general architecture for batch and on-demand risk systems is well established. While there are multiple technology choices in play for each risk system, they almost always boil down to the following diagram.
One of the important techniques to note is that we try to partition the risk job into two distinct phases:
- A phase that breaks down the job into calculations (e.g., Monte Carlo or HVAR) that can be performed independently. These produce raw results like PnL vectors, which then need to be aggregated.
- An aggregation phase that takes the results and pivots them in some way in a Hyper Cube or Data Warehouse. An example of this is getting the expected loss with 99% confidence on a specific book, account, portfolio, or security.
Of course, these are essentially Map and Reduce phases. One of the benefits of these two distinct phases is that we can calculate the Risk or PnL vectors on a security and then aggregate them into positions and different confidence levels in multiple ways, all without having to recalculate unitary vectors more than once. Another benefit is that, typically, the first phase is very parallelizable and the second can be somewhat parallelizable. For example, we can aggregate non-overlapping portfolios or positions in parallel but single units of aggregation must be done serially.
A quick recap of the flow is as follows:
- Arrival of request at API gateway
- Enrichment / de-referencing
- Splitting and grouping
- Sending results to a Data Warehouse or Hyper Cube
Leveraging the Cloud
Traditionally, the challenge of designing batch and on-demand risk systems involves compute resources, the proximity of the data to compute resources, and large amounts of data that needs to be aggregated. With the infinite compute resources in the cloud, and some smart-caching and distribution strategies, Ness teams have been able to convert overnight risk jobs at some of the largest exchanges to run in tens of minutes, or even less.
There are multiple ways to map the logical architecture described above onto various cloud offerings, each with their unique tradeoffs on performance, manageability, and portability. Below are a couple of examples, with one focused on being serverless and the other on portability. Of course, there are a number of other options as well as various hybrids; the goal of these simplified diagrams is to illustrate our concepts. We’re focusing our technology choices on AWS, because we believe it to be the most advanced cloud with the largest number of offerings for most enterprises.
Limitations of Batch and On-Demand Architectures
Of course, there’s a physical limit to how low quickly this job can be run, even in the cloud. Let’s call this the job duration. One of the difficult problems that the above architecture doesn’t address is that static descriptions of Trade Data, Reference Data, and Market Data aren’t actually static at all. In reality, they’re fast-moving streams of data that change at completely different velocities. Recalculating the whole portfolio at every market data tick isn’t possible—and probably isn’t meaningful anyway—so we may want to pick a different approach to address this limitation.
Common strategies include:
- Recalculating at fixed intervals (e.g., every 10 minutes if 10 minutes is greater than the job duration)
- Recalculating the additive impacts on all new trades using market data as of a start of day or some other fixed point
- Recalculating when some piece of market data changes by some specific amount
- Recalculating at fixed intervals and then apply some sort of Taylor approximation to the results. (This is essentially the equivalent of the lambda architecture for Map Reduce.)
The challenge, then, is to ensure that any risk calculation uses consistent time slices of data. For example, we want to make sure that when pricing a portfolio of trades at 11:00.00 all the trades are using a consistent slice of market data as of 11:00.00 and that all reference data is sourced from 11:00.00. This problem is compounded by the fact that most pricing models aren’t driven by observable market data but, rather, derived market data (e.g., yield curves, volatility surfaces, etc.).
Real-time and Streaming Risk Architectures
The challenges we outlined above fall squarely into the realm of problems that streaming architectures have evolved to address.
Before we dive into the details, it’s worth pointing out that the challenges of streaming—dealing with events that are being streamed into processing systems out of the temporal order in which they occurred—isn’t exclusive to capital markets. Think of the Internet of Things (IoT) and of all the devices that can have an interrupted connection before syncing back up. Fortunately, because this problem is so common, the solutions developed to address it are extremely scalable and easily adaptable to the world of event-based risk management.
Although a thorough overview of streaming concepts is outside of the scope of this discussion, it’s important to touch on some of the key semantics present in common streaming products like Kafka Streams, Flink, and Spark Streaming.
Some key streaming concepts include:
- Bi-temporal events. Events typically have two important time attributes: the time the event occurred (the event time) and the time it was received by the system that processes it (the processing time). For most logical purposes, we want to key-off the event time.
- Windowing. Windowing is a high-level semantic concept that allows us to define intervals for operating on an events stream. One of the most common windowing strategies is non-overlapping, rolling windows. This enables us to define a fully consistent view of all events that have arrived in the system in, let’s say, 5-minute intervals.
- Triggers. Triggers are useful, among other things, to deal with late arrivals (e.g., what happens when an event arrives much after the processing window but we still want to take it into account for a specific window).
Logical Architecture of Streaming Systems
The goal of logical streaming architectures is to invert the flow of the diagram so that the event stream pushes analytics views to subscribers into a micro-batch type of approach.
Streaming applications are typically based on stateful stream processing. By this we mean that processing is achieved by a series of tasks applied on streaming data, where each task is horizontally scalable and data is partitioned into self-contained datasets.
Data and computation are co-located with local data access (in-memory or disk) to achieve the desired processing speed. The system computes the most optimal execution graph to avoid as much as possible shuffling data between horizontally distributed computation nodes.
To improve recovery time in case of failure, it’s best practice to periodically writing snapshots of these states to remote persistent storage to facilitate processing recovery back to the point in time when the failure occurred.
Let’s re-examine the steps of batch and on-demand systems that we described above and see how they apply to a streaming architecture.
- Arrival of request at API gateway – With streaming, there is no request; pipelines are running and consuming published data. Of course, there is an orchestration layer to start/stop these pipelines. However, since we’re likely running a cloud infrastructure and using Infrastructure as Code, the orchestrator is also a processing component that reacts to submissions asking for specific risk flows.
- Enrichment/de-referencing – A transform/load Pipeline turns incoming data into risk-model objects, abstracting data for Compute and Aggregate processes and isolating changes in the incoming data format.
- Splitting and grouping
- Sending results to a Data Warehouse or Hyper Cube – publish and persist results
- Analysis – Depending on requirements, the streaming pipeline can be used for executing real-time business-intelligence logic.
The Path to Unified Batch and Streaming
One of the most elegant aspects of modern streaming architectures is that they treat a batch as a bounded stream. As a result, we can now support both batch and streaming use cases within the same overarching risk architecture. The diagram below demonstrates a streaming risk architecture on AWS.
Risk architectures have always been the ideal cloud use case since they require large amounts of intermittent compute resources and are thus ideally suited to the elasticity of the cloud. At Ness, we’ve seen many financial institutions start to leverage the public cloud for their workloads. Without a doubt, cloud use will increase as enterprises continue to become comfortable securely and efficiently leveraging the cloud’s advanced capabilities.
Throughout this discussion, we’ve pointed out that traditional batch risk architectures can now be largely run on serverless and fully managed cloud offerings, both of which increase their efficiency and scalability. Now, we can even build near-time and high-volume event-based risk systems, as well as traditional batch risk systems, for our clients on the same managed architecture.
To learn more about risk architectures, contact us today.