Code Offer

Code Is Poetry!

QWFIX C++ Equities, Trading, Research and Management

QWFIX C++ Equities is the latest development of the QWFIX product line that targets the US domestic equities trading. Initially we will support five exchanges (NYSE, ARCA, NASDAQ, BATS and Direct Edge).

The Challenge

Equities trading faces a few challenges compared to futures trading. First of all, the algorithm may have to keep track of every market data message, filter out the ones it is not interested in and track the books of the “watch list” and the queue position of its own orders all the time.

Considering the amount of data involved (5-10GB per exchange per day), and the trading pattern (portfolio of tens or even hundreds of stocks), the key is to build a low latency feed handler that can reduce the tick-to-trade latency.

Feed handler also has to interact with the order management system, because it needs to keep track of order queue positions as well.

The Philosophy

The key is efficiency. Building efficient trading system will improve operation efficiency (less machines and lower co-location and connectivity cost) and trading profit (being ahead of others to grab the liquidity). Doing so will require efficient management of design and implementation.

The Speed War

In a top options market making firm, they use one machine to only trade one symbol, basically trying their best to reduce any possible latency.

In our research, we can see occasionally a possible miss-priced liquidity is taken within less than 100 microsecond round trip, immediately after it was posted. The market data won’t lie, one can not hide his/her trade.

Assuming your trading system is 200 nanosecond slower than the competitor’s system per stock in term of market data handling. Assuming your algorithm has to keep track of 100 stocks. So your system is on average 10 microsecond slower than competitor’s system, which is enough to make a difference when competing on same or similar signals.

Usually a top high frequency trading firm uses one machine only to trade about 20-30 stocks, with many outgoing order entry connectivity to send orders in round robin.

The Product – True Lock Free System

The QWFIX C++ Equities system is true lock free. Apart from optimizing our system for cache locality and pipeline friendly, we made it almost completely lock-free.

The only place we use lock is to lock guard sending portfolio of orders from several threads, and we used a spin lock to do it. Apart from that single lock, the entire system is completely lock-free, even free from atomic instructions.

The result is amazing, the system can run through a whole day’s worth of historical data (about 10GB), while keeping track of the book of top 600 active stocks and simulate order queue position, within about a minute.

Our design gives user to ability to explore maximum parallelism. Usually one core is dedicated to one thread for one algorithm. Market data update and retrieval can be done in parallel within difference threads and the system won’t be slowed down at any rate no matter how many threads are simultaneously retrieving the market data.

The only place that requires the spin lock is the order management part. However that effect is minimum. First of all, the API is designed to take a portfolio of orders at a time. Secondly, the latency introduced by that single lock is just about 100-200 nanoseconds if there is no contention.

Management

The QWFIX C++ Equities system shares the same management console with our futures product. More UI widgets are added for equities trading management.

Research

The API is carefully designed to unify the research and production environment. The same algorithm implementation can be directly used to simulate trading with historical market data. That method is used as a final step of the strategy development.

In another word, developer use our API to develop algorithms. Before using real money to trade, we can run the algorithm in “simulator mode” with historical data. It will accurately simulate the algorithm. It will be the same API and same code for both production and simulation. Cost and risk will be significantly reduced in this way.

For the development in early stages, we use HDF5 to store the market data with different levels of details. For example, top of book, top n levels of book, etc. The data set will be used by languages such as Python, C++, Java or .Net etc for research purposes.

Development Status

QWFIX Equities is expected to be in production within two months. We are working on development, testing and research in parallel.

Order Management in Nanoseconds

On one hand, if we compete on speed, there will only be one winner in the market. On the other hand, speed only matters if people are competing on the same (or similar) signal. That’s why even if people don’t even know when they are going to need to be fast, they can’t afford to lose the speed advantage.

Technically speaking, FIX protocol is difficult to be optimized for speed. It is simply not designed for low latency trading, in terms of both encoding/decoding protocol and session protocol. However, we can still manage to optimize the FIX based order handling to achieve low latency. After all, the exchange match engine will most likely put the packets it receives earlier to the front of the order book queue.

A good design should achieve not only high performance, but also usability (programming  friendliness) and low maintenance. The design of QWFIX has achieved all those goals without sacrificing any.

The key is the Order Management API (OMS API). It completely hides the low level message encoding/decoding and session management. Application only need to deal with orders (create, cancel or modify) while still enjoying sub-microsecond latency. The framework supports not just the FIX protocol, but virtually any protocol used in any exchange in this world (such as OUCH or ETS).

Another challenge during the design time is the support for multiple exchanges. Each exchange needs special implementations more or less. We managed to make developers feel as comfortable as possible working with QWFIX. In our design, there is only one universal Order class. There are two types of OrderManager class that work together. One type is exchange independent. Another type is exchange dependent that provides exchange specific order handling functionalities.

The OMS API is designed to fit into the entire QWFIX HFT framework. The entire system is not only optimized for latency, but also for throughput as well.

Benchmark:

We used CME NewOrderSingle FIX message to perform the benchmark.

On our 3.33 GHz  Intel Xeon server, we achieved 0.8707 us (870 ns) per message for the following operations combined:

  • Create a FIX NewOrderSingle Message
  • Set fields in the message
  • Booking keeping with OMS API
  • Encode message to FIX data stream
  • Release the FIX message

Note 0.8707 us is the maximum latency. Because we can use up to two threads per outgoing FIX session, the actual throughput won’t be limited by that latency number if the network is not the bottleneck.

I can further improve the performance if our system’s performance is challenged by any other system.

Faster FAST Decoder – Updated

Posted on February 9, 2011 by

After I posted the blog about the “faster FAST decoder“, I made some further improvement. Here is the latest result using the same CME data file.

Total time = 11.797371 seconds;
Packets/entry/maxEntries = 34217764/165204011/116;
0.344773 microsecond per packet; 0.071411 microsecond per entry.

Tested on the same 3.3 GHz Xeon.

Jeff made some improvement and now he can process the same file within 9.5 seconds on a 3.3 GHz Core i7. So his result is still about 20% better.

After talking with him I realized our decoders have different user requirements. I tried to make my decoder more general purpose at the cost of a little bit of performance.

All I can say is given the same user requirement I can make an implementation as fast as his. But I would like to stay with my current normalized classes.

My implementation will be able to fill the decoded fields in a predefined class, without any assumption about the message template. Every field has a bool type presence flag that occupies 8 bytes in the structure. Message is managed efficiently within an object pool and every time before the message is returned to the pool it has to be “reseted”, i.e. the presence flags has to be memset to false.

Below is the definition of the two most used messages in CME, MDIncRefresh and MDEntries. Note is was automatically generated by my system. Actually my system can generate nice code like this from any given FAST template.

The decoder knows how to efficiently fill in the structure. It can take many statically generated template handlers as well as dynamically generated handlers (during runtime). Of course statically generated handlers performs better but dynamically generated handler guarantees the flexibility and robustness. A unique “signature” will be generated along with each different version of FAST template to guarantee that the system can dynamically detect template version change at startup.

Another nice feature about this design is that the decoder can ignore some fields one is not interested in, with a little bit of performance improvement. For example, I never cared about the TradeDate field because for the instruments I am interested in it is just the current session date.

Unfortunately this flexibility does result in some cost in term of performance. Commenting out the memset message reset will have 1.5 seconds of improvement using the same CME data. Not to mention the cost of setting the “presence flags” during decoding could be similar or even higher.

Overall I am quite happy with my FAST decoder. It can decode at sustainable throughput of over 200MB/s.

Faster FAST Decoder – Market Data Handler in Nanoseconds

Posted on February 7, 2011 by

It all started from a linkedin post. A post about the performance of a particular FAST decoder turned into a series of comments with claims and questions.

One particular comment caught my interest, from Mr. Jeffrey M. Birnbaum. He posted the benchmark of his decoder with some exceptional numbers. Although it raised a lot of doubts I know it is not impossible. As a matter of fact I wrote in a previous comment with some thoughts about how I can do to improve my decoder. It is just some thoughts and I never bothered to implement it. Jeffery’s work inspired me to start implementing my ideas.

So I spent a whole weekend on the improvement. The result is quite significant. I see a three times performance improvement with my new implementation. So I emailed Jeffrey with my result and thanked him for his inspiration. Jeff kindly returned my email. And I learned two things from his email. First of all, it appears that his method is different than mine. Secondly, he is extremely confident about the superiority of his decoder.

So I decided to take his challenge and enter into a friendly match against him. Just like Jeff said, “I like a little competition and in the end we will make each other a bit better”. I have to say further improvement is not easy. I know what to do but it’s all tedious work and it takes efforts to get there. To make long story short, I managed to make some time for the work and right now, after a week, I have some result.

Before we go any further, let me first explain the CME market data I used. CME disseminates both quote change and trade information in FAST format. The FAST message (or segment) is wrapped in UDP multicast packets. Currently one UDP packet only contains one FAST message. The FAST encoding method can compress information with very high compression ratio.

The structure of FAST message is very flexible and quite complicated. It is basically a tree structure with predefined schema (FAST template in XML format). The most commonly used message is called “IncrementalRefresh”. The ”IncrementalRefresh” may contain several entries with quote and trade changes.

For example, if the current market best bid and offer is:
Offer:    129000 X 200
Bid:        128975 X 50
And there are two orders sitting at best bid with 25 contracts each. If someone sent a market order to take 50 contracts from bid side. It may result in an ”IncrementalRefresh” with 3 “MDEntries”:
Trade  25@128975; Trade  25@128975; Quote Remove Bid Level 1

There may be some confusions about FAST benchmark result because some refer “message” as “UDP packet” and some refer ”message” as each “Fast Segment”. In the example above, if ”message” is referred to as each “Fast Segment”, then the total messages is 3; otherwise it is 1.

Jeff suggests that “MB/sec is the only fair comparison” and I agree with him.

Now here comes my result, tested on my 3.33GHz Xeon server with GCC 4.4.0.

Total time = 14.743588 seconds;
Packets/Entries/MaxEntriesPerPacket = 34217764/165204011/116;
0.430875 microsecond per packet; 0.089245 microsecond per entry.

Benchmarked using a file with size of 2,838,336,532 bytes. In the benchmark result listed above, the “packet” refers to UDP packets from CME. Entry only refers to MDEntry contained in the packet (not including the MDIncRefresh or MDSnapshotFullRefresh body).

We can tell that for that day there are an average of  (0.43/0.089) = 4.83 MDEntries per UDP packet. However the biggest message contains 116 entries in a single packet.

Can I further improve it? Of course I can! Now I have optimized the integer, PMAP, and stream handling. String handling is only partially optimized. And dictionary is not optimized at all (it is still using several std::vector arrays). I can easily squeeze dozens of extra nanoseconds from each entry’s processing.

The beauty about FAST protocol is that once the implementation is done, one only needs to get a template file from exchanges that supports FAST based market data (Arca book, Eures, EBS etc), in order to decode the market data from the exchange. So yes, my implementation is universal.

So what does it mean:

  1. For some strategies, if one always needs to compete with others on certain amount of shares, it could mean responding many microseconds faster than competitors under certain market condition.
  2. We’ve heard of performance claim of FPGAs. But both Jeff and I agree it is a bit overrated. It will be difficult for FPGA to match the performance of efficiently implemented pure software decoder, if not impossible. Now the biggest bottleneck is TCP/UDP network processor. It might be something FPGA can do to make significant improvement. But unfortunately I haven’t seen any FPGA product going to that direction.

May the best impl win! And happy trading.

P.S. In case you are interested in the “friendly match”between Jeff and me, all I can tell is that Jeff’s original number is even slightly better than mine (he is using a 2.99 GHz Core i7 and a different data set). While I’m still waiting for his updated result, I would say it will be difficult for any of us to blow the doors off the other’s.

Update: As of Feb. 8th 2011, At midnight, Jeff sent me his up-to-date benchmark result with the same data on a 2.93 GHz Core I7. It is 12.5 seconds to process the whole file! In this case I will be obliged to further improve my implementation. Hopefully I can get it done by tomorrow. Let the game continue! And stay tuned!

QWFIX C++ HFT CME, HFT Trading, Research and Management

Posted on January 18, 2011 by

QWFIX HFT CME is an entire suite of high frequency trading platform, fully certified, running in production for CME trading. The main features include:

  • QWFIX C++: Pure C++ algorithm execution engine
    • Sub 5 microsecond tick-to-trade latency
    • Simple, flexible order management (OMS) API
    • Strategy engine API; implement and QA a new algorithm within 2 days
  • GUI based configuration, real-time monitoring and research tools
  • Research tools
    • Realistic C++ based online simulator (test and QA your production code with real-time market data)
    • Realistic .Net based offline simulator API (test your algorithm with historical market data)
    • Post-trade analyzer API (analyze your historical trade log)
    • GUI based historical tick-by-tick market data analyzer

QWFIX C++ API

The design and implementation of QWFIX focuses on simplicity without sacrificing performance.

Performance

QWFIX uses the following techniques to achieve sub 5 microsecond tick-to-trade performance.

  • Lock free threading model: All performance critical code uses spin locks only
  • Customized memory management: All performance critical code uses customized and lock-free memory management with preallocated memory pool
  • Advanced Architecture: Threading model, I/O model and memory management are designed to collaborate to achieve best overall performance
  • OS/Hardware optimization: Designed to work with Linux RT Kernel and Solarflare network card with OpenOnLoad driver

Programmer Friendly API

QWFIX C++ API is extremely simple and easy to use. It usually takes one day to implement a new algorithm, test and QA it in the next day, and put it in production in the third day.

Order Management API

One order manager is created for each instrument on each trading day. API is provided to send, cancel and modify orders. Static information, such as Symbol and SecurityDesc, is only initialized once when the order manager is created, in order to improve performance.

Strategy API

Multiple strategies can run simultaneously. Each strategy is identified by a unique number (integer, I usually use a date number such as 20110101).

Implementing a new strategy is extremely easy. Programmer only need to subclass a Strategy object and implement their own event handlers.

 

High Performance Logging

Logging is fast and easy. It takes two lines of code to add an entry to system log. It takes about 200 nanosecond and will be written to disk asynchronously. The logged entry can be viewed with GUI real-time management tool; or retrieved with post trade analysis API.

Configuration & Management GUI

Enterprise Manager

Enterprise Manager is used to configure the QWFIX system. It features the following:

  • Customizing FIX schema (dictionary). CME may occasionally change the requirement for FIX message by adding or modifying required tags in message
  • Customizing FIX sessions. Specify the initiator and acceptor CompID, IP addresses, scheduler and other parameters
  • Customizing FIX engine.

QWTradeMonitor

QWTradeMonitor is used to manage automated trading process in real-time.

QWTradeMonitor can also be used as a debugging tool during algorithm implementation.

Real-time Log Alerts

The real-time logs/alerts generated by remote process can be monitored with the GUI tool. Note the same logs can also be retrieved through post-trade API programmatically.

FIX Message Viewer

Each FIX message communication can be viewed with full details, including 3 timestamps. Please not how fast it is from before message creation (timstamp1) to the send() call returns (timestamp3) for the logon message. Note it takes a little longer to process Logon message. For order related messages it only takes about 2-3 microseconds.

Real-time Remote Market Data

Market data can be displayed along with brief order information (total orders and total quantities) at each price level.

Real-time Orders, Execution, Position and P&L

Every order, every execution and every field in the FIX message can be monitored in real-time, along with real-time position and P&L.

Real-time Control

Remote trading strategies can be fully controlled by QWTradeMonitor.

Researching Tools

C++ Online Simulator

QWFIX C++ online simulator can be used to realistically test and QA production automated trading system.

.Net Offline Simulator

.Net offline simulator can be used to realistically research trading strategies.

.Net Post Trade Analysis API

.Net post trade API can be used to do post trade analysis. It takes about 10 lines of code to retrieve every detail of the entire order flow. Orders can be grouped by different strategy. Each order or execution can be matched against the exact market data at the moment when it was processed.

Every status change can be further analyzed visually with the GUI based research tool QWTradeAnalyzer.

GUI Based QWTradeAnalyzer

QWTradeAnalyzer can be used to analyze tick-by-tick historical market data.

  • Multiple instruments can be synchronized by timestamps to explore spread/cross-asset-class trading opportunity
  • Orders and executions can be synchronized with tick data for trading strategy research or post-trade analysis.

Leave a comment

Information

This entry was posted on November 21, 2012 by .