Why companies adopt specific technologies over others involves a number of factors, among them is how well a new system will embed within an existing IT infrastructure. It would be naïve to think that the focus is on benchmarks alone, but in the Financial Services Industry (FSI) where analytics and high frequency trading can signal the difference between profit and loss, real-time performance is a key selling point.
There are tools from companies such as Securities Technology Analysis Center* (STAC) that can benchmark platforms designed for market trading but access to these resources commands a membership fee. As for developers, system builders and others with not so deep pockets, who simply want to demonstrate the efficacy of a specific platform configuration, a rather more creative approach is necessary.
Indeed, when Big Data consultant Mark Litwintschik decided to write a blog benchmarking query response times using a wide range of high performance data storage and processing systems, he was well aware that the only way to conduct these tests reliably would be to use the same data on each system. Clearly, access to live market data would be a big ask and, being constantly variable, would be too inconsistent to produce meaningful benchmarks.
As luck would have it, he had found a suitable horde of data available to download freely from Github. This was trading data of a different sort that Todd W. Schneider had used to analyse a record of 1.1 billion taxi journeys across New York City from 2009 to 2015. Although Schneider details his own methods using PostgreSQL* and PostGIS*, Litwintschik had other ideas and decided this dense dataset of around 600GB would suffice to use for rapid database querying that systems perform during online analytical processing (OLAP) tasks.