Download the Benchmark at GitHub
A key-value store is similar to a file system where a filename (key) is used to access shared data (values), but differs from a file system by adding rich set of commands to modify or process data independently of the requester. Specialization of key-value stores has lead to two dominant tools, Redis and Memcached, with different features and execution models.
The common functionality of reading and writing data allows the two applications to be compared to each other and other load-store model like EMS. However, unlike Memcached and Redis, EMS has no server process the program communicates with, instead EMS is based directly on the CPU's cache coherent memory where synchronization is a property of the data, not the tasks.
EMS extends the basic load/store model with synchronization primitives to support parallel programming. Access to memory happens by transitioning the memory through a finite state machine designed for parallel computing.
Any number, kind, or combination of programs may use EMS memory simultaneously. An hybrid system running multiple web servers, search and recommendation engines, key-value stores, and message queues can be hosted on a single multi-core system as shown in the following figure. Shared memory makes possible sharing of key-value pairs between different types of servers, acting as a protocol translator. Because memory is not rate limited or a potential bottleneck, access to EMS data does not need to be rationed or performed asynchronously.
In addition to utilizing multi-core CPUs, hybrid systems may also exploit GPUs to power the number crunching analytics of recommendation engines, and manycore accelerators to perform unstructured searching and analytics. Software services may communicate with other local processes via EMS shared memory, or use a network protocol to communicate with local or remote processes. Tuned hardware-software platforms can be time-shared by any number of combination of services or users, possibly using isolation and encryption directly provided by the OS or hardware
The experiment writes data to a k-v store and then reads it back, timing each operation and checking it for correctness. Latency of an operation (elapsed time from when the client initiates the request until a response is received) and capacity (number of operations per second handled by the servers) of reading and writing a key-value are measured in these experiments.
50,000 key-value pairs are used, each key contains 100 randomly chosen characters, and the value contains a unique 4096 byte string. Both the client and server programs (except Redis) are parallel, and the number of processes is varied from 1-40, measuring the performance at different scales. The server platform was a 32-core x86 server with 60GB of RAM.
Memcached is multithreaded and by default uses 4 threads, the benchmarks varied the number of server processes but more than 4 threads did not yield more performance for either Memcached or EMScached so those results are not reported. It was necessary to use EMS's parallel loops and synchronization primitives for both Memcached and Redis client programs to create parallel worker tasks and enforce a barrier synchronization to accurately measure elapsed time of the different benchmark phases.
Because no additional hardware resources must be allocated to use additional threads or processes, it is possible to oversubscribe hardware with more threads than cores, relying the operating system's time-sharing and load-balancing mechanisms to schedule task execution on the available resources. It is important to bear in mind that the client and server processes of these experiments are running on the same system so their resources must be summed: 16 servers and 16 clients is a total of 32 tasks; 32 clients and 16 servers is 48 tasks, oversubscribing the 32-core system by 50%.
A context-switch is already part of using the network, making oversubscription a natural way to mask the unavoidable latency associated with using the network, however once there is enough work to mask all the latency, additional oversubscription typically has a detrimental effect on performance, increasing execution time. This penalty is highly dependent on how tasks are scheduled to use machine resources.
The system used for benchmarking is a virtual machine it is not known if the virtual cores used are separate hardware resources, Simultaneous Multithreaded (SMT) cores, or virtual cores. Depending on how processes are scheduled, oversubscription may add more hardware resources or share existing resources. For example, doubling the number of processes could double the amount of cache or memory bandwidth available to the entire job, or halve the effective cache size and bandwidth per core.
EMS local memory access is many times lower latency and higher bandwidth than data access via a network and server, permitting upper-layer functionality to be implemented on EMS using an interpreted language and still achieve performance rivaling optimized native code.
The time to perform each operation (from initiation to response) is measured and the minimum time is recorded as the minimum latency. This measurement demonstrates the fastest observed execution of the critical path through the client-server pair, and represents the performance potential of the existing code. This timing can also be used as the Tserial component of Amdahl's law (the portion which cannot be sped up with additional parallelism).
When EMS is used for finding, synchronizing, and marshaling shared data, performance of an untuned interpreted language implementing a key-value store can rival that of highly optimized natively compiled key-value store, but enjoys all the benefits of using high-level rapid development tools. EMS encourages integration, customization, and acceleration of legacy applications into high performance systems that are easier to scale, deploy, and maintain.