Implementing Algo Strategies on FPGAs – Part 2

Author: Sanjay Shah, CTO, NanoSpeed
Date: 18 September 2014

While FPGAs offer a vast concurrent resource to develop your trading strategies straight in the FPGA fabric, a system with FPGA embedded processors offers an easier transition.

This is part 2 in a series on implementing algo strategies on FPGAs. Part 1 examined which strategies could – and should – be implemented on FPGAs.

FPGAs are now even larger than the largest CPUs. With thousands of embedded simple Arithmetic Logic Units, they provide a vast concurrent resource for algo strategies. However, programming FPGAs is still largely a domain of specialist FPGA engineers – using hardware programming languages such as VHDL or Verilog – who can be expensive and hard to find.

But there is a less painful approach – porting your C++ algo strategies onto FPGAs that already have processors embedded. This will definitely help you reduce latencies significantly.

FPGAs such as Arria V from Altera and Xilinx’s Zynq already have a dual-core ARM9 processor embedded. Whilst the processor may not be hugely powerful, it does offer nanosecond access to and from the FPGA fabric logic via ARM’s AXI4 (Advanced eXtensible Interface gen4). This allows us to develop trading architectures such as this:

                                 FPGA Embedded ARM Processor

For low to medium complexity strategies such as stochastic modelling, Monte Carlo simulations and short call/ put ladder spread, this becomes a completely self-contained trading platform. FPGA building blocks such as market data feed handlers, order-book, risk checks and TCP/UDP stacks can all provide the conduit for the trader’s algo strategies. To a large extent, they can be transparent to the trader, who can concentrate on porting or developing new C++ strategies onto the ARM processor.

The development environment is very similar to the environment that most software strategies developers would be very familiar with, allowing functions such as tracing and debug:

                 Embedded Processor Development Environment

Altera’s new Arria 10 FPGAs are around the corner, and with the processor running at 1.5GHz and the DDR4 memory access at a similar frequency, the processor system performance is pretty good for low to medium complexity strategies, though quite a bit lower than the likes of Xeon processors. The biggest advantage of these processors is the superfast access to the other blocks in the FPGA and not having to hop across the PCIe bus, saving many microseconds. In Q1 2015, Altera will be bringing out Stratix 10 FPGAs with a much more powerful processor embedded – the quad-core ARM A53:

                 More powerful processors embedded in next gen FPGAs

This will considerably speed up the low to medium complexity strategies that you may have developed on the ARM9 and allow porting more powerful trading strategies on to the FPGA, giving a straightforward roadmap for the strategies. In terms of performance, these processors, running at 2.5GHz, are closer to the Xeon processors you find in today’s servers.

For the ultimate performance of your algos, you need to offload them partially or fully to FPGA logic fabric using hardware languages.


Whilst FPGAs offer a vast concurrent resource to develop your trading strategies straight in the FPGA fabric, a system with FPGA embedded processors offers an easier transition. This allows the development of the strategies in more familiar languages such as C++ and using more standard software-based development environments. Because the processors currently on the market may not be as powerful, you may only be able to offload some of the strategies, but you can choose to partition your algos such that the algos which are latency-critical are run on the embedded processor. This will give you significant speed advantages where they are most needed, and an efficient, easy-to-implement solution until more powerful embedded processors become available next year.