Implementing Algo Strategies on FPGAs – Part 1

Author: Sanjay Shah, CTO, NanoSpeed
Date: 12 August 2014

Is it worth implementing algo strategies on FPGAs? The answer is that it depends on the strategies.

Today’s state-of-the-art CPUs, such as Intel’s Xeon Phi, have more than 5 billion transistors. In comparison, the largest FPGAs today have about 6.8 billion transistors. Ever since FPGAs first appeared in the 1980s, they have been outstripping CPUs in terms of growth in size year-on-year.

FPGAs have logic cells in the core logic fabric, memory and other blocks with programmable interconnect channels.                     FPGAs: Under the Bonnet

Over the years, quite a sizable library of IP has been amassed for FPGAs, including open-source IP from communities such as OpenCores. In a similar way to the software community, FPGA designers can use this pre-tested IP to build their designs rapidly.

A recent trend has seen FPGAs combine programmable logic blocks and interconnects with embedded CPUs and related peripherals for a complete “System on a Chip” (SoC). I will be exploring the use of these embedded CPUs for algo strategies in Part 2 of this series.

For now, I want to explore the other exciting building-blocks prevalent in today’s FPGAs – the “digital signal processing” (DSP) block with ALUs (arithmetic logic units). There can be hundreds, or even thousands of programmable DSP blocks (ALUs) on each FPGA.

So now there are thousands of small processors, all able to perform calculations of small to medium complexity in parallel. This is vastly more powerful than the parallelism provided by today’s cutting-edge CPUs and GPUs. In fact, if you’re doing options price modelling or stochastic risk modelling on FPGAs, you can get more than a 100-fold increase in performance compared to GPUs and even more compared to CPUs.                  DSP Block in FPGA

Along with the DSP blocks, the other major factor in this performance gain is the memory cache. FPGAs have built-in distributed RAM that is extremely fast, allowing bandwidth of 100TB/s to be achieved at the data-path level.

FPGAs are implemented using hardware description languages such as Verilog and VHDL. There are some C-to-Verilog translation tools on the market seemingly offering a “silver bullet” for software engineers who want to move algos to FPGAs. Although it’s clear that the origins of Verilog are in C, there is a paradigm shift from traditional algo implementations targeting CPUs and even GPUs. Throughout the design process, the designer needs to be thinking at several levels: the temporal domain, optimization methods and hardware. It’s debatable whether these translation tools can ever be as good as hand-written Verilog code in terms of area and latency optimization.

The designer does have a helping hand from the vast library of pre-tested IP that is available from the FPGA vendors and from the designer community.

It’s all very well working with these hugely complex FPGA designs, but they are useless if they are not well verified. With clever use of SystemVerilog and other verification methodologies built on top of SystemVerilog, such as OVM and UVM, the verification task is less time-consuming than the traditional test-bench-based verification approach, which used to take twice the time the design took. Some cutting-edge firms are also using Python-based verification methodology to reduce the time further. Even with this, if the design takes two months, expect to spend two months verifying it. This is considerably more than purely software-based systems.

The design is mapped to the FPGA logic, DSPs and RAM by using vendor-specific tools. After mapping, the design is “placed-and-routed” to allocate the logic to the specific locations and to use the configurable routing resources to interconnect the blocks in the design. Then there is the “timing closure” process to ensure that the design meets certain performance criteria. Finally, the FPGA programming file is generated. This whole targeting process can take anything from minutes to hours, depending on the design complexity.


Using FPGAs for algo strategies creates a large and massively concurrent compute resource that is able to give 100- to 1000-fold increase in performance. So if it’s performance gains you’re after, it’s worth using FPGAs. Apart from stochastic modelling and Monte Carlo simulations, FPGAs are a good fit for targeting propriety options strategies, FX strategies and even back-testing strategies. And the power? That’s typically about 25W! The main caveat is that you would have to become proficient in writing your strategies in Verilog or VHDL.

Some FPGAs have embedded processors that are getting pretty powerful now. A less painful approach may be to port your C++ strategies on to the FPGA embedded processors. I will explore this approach in Part 2.