Risk Architectures: from batch to streaming data architecture

Ness - Risk Architectures: From Batch To Streaming

Executive Summary

Get an overview of the standard architectures used in risk management systems for capital markets and how cloud and streaming technologies have impacted their designs.

There are multiple flavors of risk management systems for the Middle and Front Office, depending on whether the focus is on hedging market risk, credit risk, or counterparty risk.

While each analysis can vary widely, their IT architecture tends to be similar.

Streaming vs. Batch

However, there is a clear difference in how near-time and event-based systems are architected compared to batch and on-demand systems.

Let’s now discuss the following points:

  • Common architectures for batch and event-based risk systems and how they map to common cloud offerings.
  • Trending technologies and approaches to unify both architectures under a common architecture to ensure efficiencies.

Batch and On-Demand Risk Architectures

The general architecture for batch and on-demand risk systems is well established.

While multiple technology choices are in play for each risk system, they are commonly depicted in the following diagram.

Risk management is partitioned into two distinct phases:

  1. Break down the job into calculations (e.g., Monte Carlo or HVAR) that can be performed independently to ensure raw results like PnL vectors, which need to be aggregated.

  2. Take the results and pivot them in a Hyper Cube or Data Warehouse (E.g., Getting the expected loss with 99% confidence on a specific book, account, portfolio, or security.)

Of course, these are essentially the Map and Reduce phases.

One of the benefits of these two phases is that we can calculate the Risk or PnL vectors on security and then aggregate them into positions and different confidence levels in multiple ways without having to recalculate unitary vectors more than once.

Another benefit is that the first phase is parallelizable, and the second phase is parallelizable to some extent.

For example, we can aggregate non-overlapping portfolios or positions in parallel, but single aggregation units must be done serially.

A quick recap of the flow is as follows:

  1. Arrival of request at API gateway
  2. Enrichment/de-referencing
  3. Splitting and grouping
  4. Calculation
  5. Sending results to a Data Warehouse or Hyper Cube
  6. Analysis

The Cloud Leverage

Traditionally, the challenge of designing batch and on-demand risk systems involves computing resources, the proximity of the data to compute resources, and large amounts of data that need to be aggregated.

With the infinite compute resources in the cloud and some smart-caching and distribution strategies, Ness teams have converted overnight risk jobs at some of the largest exchanges to run in minutes or even less.

There are multiple ways to map the logical architecture described above onto various cloud offerings, each with unique performance, manageability, and portability tradeoffs.

Below are a couple of examples, one focused on serverless and the other on portability.

Of course, there are several other options including various hybrids.

The goal of these simplified diagrams is to illustrate our concepts. We’re focusing our technology choices on AWS because we believe it to be the most advanced cloud with the most extensive number of offerings for most enterprises.

Limitations of Batch and On-Demand Architectures

There is a physical limit to how quickly this job can be run, even in the cloud. Let’s call this the job duration.

One of the complex problems the above architecture doesn’t address is that static descriptions of Trade Data, Reference Data, and Market Data aren’t fixed.

They’re fast-moving data streams that change at different velocities. As a result, recalculating the whole portfolio at every market data tick isn’t possible—and probably isn’t meaningful anyway—so we may want to pick a different approach to address this limitation.

Common strategies include:

  • Recalculating at fixed intervals (e.g., every 10 minutes if 10 minutes is greater than the job duration) 
  • Recalculating the additive impacts on all new trades using market data at the start of the day or some other fixed point 
  • Recalculating when some piece of market data changes by some specific amount
  • Recalculating at fixed intervals and then applying Taylor approximation to the results (this is essentially the equivalent of the lambda architecture for Map Reduce) 

The challenge, then, is to ensure that any risk calculation uses consistent time slices of data.

For example, we want to ensure that when pricing a portfolio of trades at 11:00.00, all the trades use a consistent slice of market data as of 11:00.00, and all reference data is sourced from 11:00.00.

This problem is compounded by the fact that most pricing models aren’t driven by observable market data but by derived market data (e.g., yield curves, volatility surfaces, etc.).

Real-Time and Streaming Risk Architectures

The challenge of streaming vs. batch processing or batch data vs. streaming data, we outlined fall squarely into the problems that streaming architectures have evolved to address.

Before we dive into the details, it’s worth pointing out that the challenges of data streaming architecture—dealing with events being streamed into processing systems out of the temporal order in which they occurred—aren’t exclusive to capital markets.

Think of the Internet of Things (IoT) and the devices that can have an interrupted connection before syncing back up. Fortunately, because this problem is so common, the solutions developed to address it are highly scalable and easily adaptable to event-based risk management.

Although a thorough overview of streaming concepts is outside this discussion’s scope, it’s important to touch on some critical semantics in standard data streaming technology products like Kafka streams, Flink, and Spark Streaming.

Key streaming concepts include:

  • Bi-temporal events: Events have two essential time attributes, the time the event occurred (the event time), and the time it was received by the system that processes it (the processing time). For logical reasons, we want to key off the event time.

  • Windowing: Windowing is a high-level semantic concept that allows us to define intervals for operating on an events stream. One of the most common windowing strategies is non-overlapping, rolling windows, enabling us to define a consistent view of all events that have arrived in the system in, let’s say, at 5-minute intervals.

  • Triggers: Triggers are useful in dealing with late arrivals (e.g., when an event arrives much after the processing window, but we still want to consider it for a specific window).

Logical Architecture of Streaming Systems

The goal of logical streaming architectures is to invert the diagram flow so that the event stream pushes subscribers’ analytics views into a micro-batch approach.

A data streaming platform or streaming application is based on stateful stream processing. By this, we mean that processing is achieved by a series of tasks applied to streaming data where each task is horizontally scalable and partitioned into self-contained datasets.

Data and computation are co-located with local data access (in-memory or disk) to achieve the desired processing speed. The system computes the most optimal execution graph to avoid shuffling data between horizontally distributed computation nodes as much as possible.

To improve recovery time in case of failure, it is a best practice to periodically write snapshots of these states to remote persistent storage to facilitate processing recovery back to when the failure occurred.

Let’s re-examine the steps of batch and on-demand systems described above and see how they apply to streaming architectures.

  1. The arrival of the request at the API gateway: With streaming, there is no request. Pipelines are running and consuming published data. Of course, there is an orchestration layer to start/stop these pipelines. However, since we’re likely running cloud infrastructure and using Infrastructure as Code, the orchestrator is also a processing component that reacts to submissions asking for specific risk flows.
  2. Enrichment/de-referencing: A transform/load pipeline turns incoming data into risk-model objects, abstracting data for Compute and Aggregate processes and isolating changes in the incoming data format.
  3. Splitting and grouping 
  4. Calculation
  5. Sending results to a Data Warehouse or Hyper Cube: Publish and persist results.
  6. Analysis: Depending on requirements, the streaming pipeline executes real-time business-intelligence logic.

The Path to Unified Batch Streaming

One of the most elegant aspects of modern streaming architectures is that they treat a batch as a bounded stream.

As a result, we can now support batch and streaming use cases within the same overarching risk architecture.

The diagram below demonstrates a streaming risk architecture on AWS.


Risk architectures have always been the ideal cloud use case since they require large amounts of intermittent computing resources and are ideally suited to the elasticity of the cloud.

At Ness, we’ve seen many financial institutions starting to leverage the public cloud for their workloads. Undoubtedly, cloud usage will increase as enterprises become comfortable with the cloud’s advanced capabilities.

Throughout this discussion, we’ve pointed out that traditional batch risk architectures can now be primarily run on serverless and fully managed cloud offerings, increasing their efficiency and scalability.

Now, we can even build near-time and high-volume event-based risk systems and traditional batch risk systems for our clients on the same managed architecture.