Code To Glory: How Intel And Kx Systems Beat The Benchmarks

Intel® Xeon Phi™ Real-Time Analytics Optimisations Target Financial Services

Key Takeaways

  1. 1

    Real-time analysis of massive market data volumes delivers profitable insights

  2. 2

    Accelerated query performance is the ultimate resource to identify trading opportunities

  3. 3

    Coding for parallelism and vectorisation is time well spent, delivering exceptional results

Why companies adopt specific technologies over others involves a number of factors, among them is how well a new system will embed within an existing IT infrastructure. It would be naïve to think that the focus is on benchmarks alone, but in the Financial Services Industry (FSI) where analytics and high frequency trading can signal the difference between profit and loss, real-time performance is a key selling point.

There are tools from companies such as Securities Technology Analysis Center* (STAC) that can benchmark platforms designed for market trading but access to these resources commands a membership fee. As for developers, system builders and others with not so deep pockets, who simply want to demonstrate the efficacy of a specific platform configuration, a rather more creative approach is necessary.

Indeed, when Big Data consultant Mark Litwintschik decided to write a blog benchmarking query response times using a wide range of high performance data storage and processing systems, he was well aware that the only way to conduct these tests reliably would be to use the same data on each system. Clearly, access to live market data would be a big ask and, being constantly variable, would be too inconsistent to produce meaningful benchmarks.

As luck would have it, he had found a suitable horde of data available to download freely from Github. This was trading data of a different sort that Todd W. Schneider had used to analyse a record of 1.1 billion taxi journeys across New York City from 2009 to 2015. Although Schneider details his own methods using PostgreSQL* and PostGIS*, Litwintschik had other ideas and decided this dense dataset of around 600GB would suffice to use for rapid database querying that systems perform during online analytical processing (OLAP) tasks.

“There is a lot of performance being left on the table by not using the threads and vectors.”

Litwintschik goes to great pains to detail the variety of approaches he uses to ingest the data in a form best suited to each system, ranging from remote cloud facilities to GPU- and CPU-based hardware platforms, with networked or local data storage. So far, he’s tested close to 20 systems including Amazon Athena*, Google BigQuery*, Elasticsearch*, EMR*, MapD*, PostgresSQL and Redshift*. He concedes that these are not like-for-like comparisons, but he could see patterns emerging with MapD systems configured with Intel® Xeon® server hardware combined with Nvidia* GPUs outperforming the cloud platforms, which, perhaps isn’t so surprising.

More recently though, there have been some surprises. Litwintschik had the chance to benchmark the Intel® Xeon Phi™ processor using the Kx Systems* software combo of kdb+ database engine and its integral query language q. Used extensively in Financial Services, kdb+/q can handle huge data volumes and treats data resident either in-memory or stored on SSD (or disk) as one resource. It’s a feature that enables real-time analytics to be performed on both current and historical data for a variety of applications critical to stock market trading.

From the very beginning Litwintschik observed that the platform, comprising four Xeon Phi 7210 servers each with 96GB of DDR4 RAM, 16GB MCDRAM and 1.2TB of local SSD storage, had commendable qualities noting that the 30 minutes data load time into the internal format of kdb+ was “one of the fastest load times” he had recorded for this dataset.

By design, Kx Systems’ kdb+/q can tap into the configuration options within the Xeon Phi platform to take full advantage of its unique features to manage workloads efficiently and minimise potential I/O bottlenecks. Certainly, the results reflected these optimisations, as Litwintschik remarked: “These query times are by far the fastest I've ever seen on any CPU-based system and the third query finished faster than on any other system I've ever benchmarked before. It's amazing to see what is possible when data locality is as optimised as it is in this setup.”

Should this be surprising? It really depends on whether you can be sure of the benchmark testing integrity, especially where trading systems are concerned, as Ian Lloyd, Technical Account Manager for Financial Services at Intel explains:

“The vast majority of workloads within the city are going to be on Intel Xeon servers, not on Xeon Phi. These data centres amount to large homogeneous estates and the banks like to keep it that way, as it’s a lot easier for the facility operators. Intel has a team dedicated to finance that helps the banks achieve these optimisations running on the estate that they already use.”

It’s a nice touch but not without its constraints. Rather than being handed a bank’s precious intellectual property in its entirety, Intel’s expert team examines select code extracts. They undertake threading and vectorisation work and then present the code back to the bank, detailing exactly what was done in order to enhance the performance. As a result, when these findings are more widely applied, dramatic speed improvements are not unusual, demonstrating that, just by doing this work, there’s a way to optimise existing hardware.

As Lloyd observes, Litwintschik’s testing reflects similar efforts: “What’s interesting about these benchmarking results is they highlight how you can tap into the extraordinary performance of the Xeon Phi with some relatively straightforward hardware configuration choices. However, these platform capabilities need to be matched by adept coding to fully optimise a system.”

Lloyd readily admits that working on code enhancements is not the result of a few simple tweaks but instead prefers to compare the effort involved to that required to utilise a GPU-based system, which, lest we forget, being an accelerator/co-processor, will always require a host CPU.

“If you’re going to land your workload on a GPU then, typically, you’re going to use a proprietary programming language in order to do it. So you’re moving away from standard x86 code and from a processor-based system that’s bootable in its own right. If you put in that same effort on your code with Xeon Phi, you’ll get the benefit of the parallelism. Keep your software on Xeon and Xeon Phi, then, not only is it essentially the same source code, the compiled binaries are the same as well – they’ll run on both.”

Being able to maintain an x86 code stream certainly has its appeal, but it would seem that performance optimisations will only arise if you’re prepared to roll up your sleeves and put in the hours. But where to start? Lloyd offers some suggestions: “The two key things you need to do are: thread your code so you’re increasing the parallelism and then start using the vectors. Vectors are on both Xeon and Xeon Phi. The more parallelism and vectorisation that you do, the better it will perform on Xeon Phi. Likewise, on standard Xeon processors, so much so that you may not even need to buy Xeon Phi (or any other accelerator). With a lot of the newer, higher core count Xeons, there is a lot of performance being left on the table by not using the threads and vectors.”

If you really want to push ahead, Lloyd also recommends using Intel® Parallel Studio, a suite of applications featuring Intel® Compiler that, when the latest Intel® Skylake Xeon processors appear, will offer immediate support for this new micro-architecture. In finance, maths libraries have a significant role performing algorithmic functions and a component worth linking to is the Intel Math Kernel Library (MKL), which is optimised to use the best instructions within the processor.

Also available separately, Intel® VTune™ Amplifier is another extremely useful tool that highlights hotspots in the code, as Lloyd explains: “VTune allows you to see where your code is spending the majority of its time. If you’re continually running round in a single thread, it’ll point you to that loop and make recommendations, such as parallelise or unroll the loop. By following that up, you can benefit from the parallelism of the processors and spread that single thread over multiple threads.”

Utilising the massive parallelism available with Xeon Phi’s multicore processor is where Kx Systems has been focusing its efforts, as it maintains its position as a best in class financial services analytics application. In the process, kdb+/q – an amazingly light weight binary of just 587KB – demonstrates that coding is key to maximising the potential of Intel’s latest HPC platform.

As the dogged big data benchmarking blogger Mark Litwintschik concludes: “It's great to see such strong performance from the Xeon Phi CPUs. These are early days in the battle between GPUs and CPUs for supremacy in OLAP workloads. It will be interesting to see how it plays out.”

*Trademarks are the property of their owners


For more information:

Get our most popular planning guides and insights, delivered straight to your inbox.

Sign up now

Related Articles

Securing fast payments for the Internet of Things.

Read more

Big Data is helping banks fight financial crime and meet regulatory requirements.

Read more

Relevant Materials

Meet the demands of regulators and customers, minimizing risk while remaining profitable. Intel®-based solutions can help manage, structure, and draw insights from the vast amounts of data and provide a scalable infrastructure to meet workload needs now and in the future.

Learn more

The Intel® Xeon Phi™ processor is a bootable host processor that delivers massive parallelism and vectorization to support the most demanding high-performance computing applications. The integrated and power-efficient architecture delivers significantly more compute per unit of energy consumed versus comparable platforms to give you an improved total cost of ownership. The integration of memory and fabric topples the memory wall and reduces cost to help you solve your biggest challenges faster.

Learn more

In today's dynamic business environment companies are turning to real-time analytics for business intelligence to provide the actionable insight for new services and business models to stay competitive. IT data centers are looking to provide comprehensive analytic solutions that can rapidly process vast amounts of streaming data and deliver real-time actions and customer services.

Learn more