Extended Memory Semantics (EMS) complements serial programming
transactional and other fine-grained synchronization capabilities
to support parallel programming.
Much of the challenge in implementing distributed and parallel programs derives from finding, marshaling, and synchronizing data. Extended Memory Semantics (EMS) unifies these tasks into a single programming and execution model. EMS implements a shared address space with a rich set of primitives for parallel access of data structures. It is not a source of parallelism itself, instead it complements other parallel programming models and integrates shared memory data access and synchronization.
EMS leverages existing tool chains instead of replacing them and is compatible with legacy applications, libraries, frameworks, operating systems, and hardware. Because EMS implements memory independently of processes, it may persist independently of any application, and it's state may be replicated, archived, or forked. Applications may attach and detach from the memory in much the same way applications use a shared database or filesystem.
EMS makes possible shared memory parallelism in Node.js (and soon Python). Extended Memory Semantics (EMS) is a unified programming and execution model that addresses several challenges of parallel programming:
- Allows any number or kind of processes to share objects
- Manages synchronization and object coherency
- Implements persistence to NVM and secondary storage
- Provides dynamic load-balancing between processes
- May substitute or complement other forms of parallelism
EMS internally stores tags that are used for synchronization of
user data, allowing synchronization to happen independently of
the number or kind of processes accessing the data. The tags
can be thought of as being in one of three states, Empty,
Full, or Read-Only, and the EMS primitives enforce
atomic access through automatic state transitions.
readFEmeans "Read when full and mark empty",
writeEFmeans "Write when empty and mark full",
writeXFmeans "Write unconditionally and mark full", etc. In the most simple case, full-empty tags are used to block readers until memory is marked full by a writer thread that itself blocks until the memory is marked empty. This is effectively a dataflow or producer-consumer execution model that may involve any number of producers and consumers.
require('ems')(...) statement is executed by a program,
EMS first creates a shared memory region to rendezvous and
communicate with other EMS threads, then,
using the built-in
creates the additional threads executing
using one of two execution models: fork-join or Bulk Synchronous Parallel (BSP).
BSP invokes the same script as the master thread (found in
whereas fork-join execution invokes parallel region around a function.
Under BSP, all threads execute the entire program unless statements are explicitly skipped. Fork-join parallelism has a single entry point and executes sequentially until a parallel region is started with
ems.parallel( func ).
#pragma omp paralleldirective. Under BSP, all threads enter the main program and execute all statements, synchronizing at barriers.
ems.parForEach( func )loops distribute iterations among the threads using several load balancing scheduling options.
These experiments were run in January 2016 on an Amazon EC2 instance:
c4.8xlarge (132 ECUs, 36 vCPUs, 2.9 GHz, Intel Xeon E5-2666v3, 60 GiB memory
A benchmark similar to STREAMS gives us the maximum speed EMS double precision floating point operations.
The results of running the Word Count example on documents from Project Gutenberg. 2,981,712,952 words in several languages were parsed, totaling 12,664,852,220 bytes of text.
Immediate Transactions: Each process generates a transaction on integer data then immediately performs it.
Transactions from a Queue: One of the processes generates the individual transactions and appends them to a work queue the other threads get work from. Note: As the number of processes increases, the process generating the transactions and appending them to the work queue is starved out by processes performing transactions, naturally maximizing the data access rate.
Immediate Transactions on Strings: Each process generates a transaction appending to a string, and then immediately performs the transaction.
Elem. Ref'd: Total number of elements read and/or written
Table Updates: Number of different EMS arrays (tables) written to
Trans. Performed: Number of transactions performed across all EMS arrays (tables)
Trans. Enqueued: Rate transactions are added to the work queue (only 1 generator thread in these experiments)
Transactional Memory (TM) provides atomic access to multiple shared objects in a manner similar to transactional databases. EMS implements mutual exclusion on specific data elements using the Full/Empty tags, and shared read-only access with a multiple readers-single writer tag.
Parallel-safe stacks and queues are built-in intrinsics based on Full/Empty tags. Stacks and queues are by definition serial data structures and do not support any concurrency. Although highly efficient, a shared resource like these can become a hot-spot when dozens of threads compete for access.
|Scaling Goal||Solve the same problem, only faster||Solve a bigger problem in the same amount of time|
|Problem Size||Stays constant while number of processors increases||Grows with the number of processors|
|Scaling is limited by||Inter-process communication||Problem size|
|Resiliency||Single failure causes entire job to fail, SLAs achieved through efficient checkpoint-restart.||Failed sub-tasks are detected and retried. SLAs achieved through fault resiliency.|
|Programming Models||MPI, GASNet, Chapel, X10, Co-Array Fortran, UPC||Batch jobs, Map-Reduce|
EMS builds on three experimental computer architectures from the 1980's: the NYU Ultracomputer, the MIT J-Machine, and the Tera Multi-Threaded Architecture (MTA). Specifically, the Ultra introduced combining networks as a basic architectural feature, the J-Machine made moving a task to data as easy as moving data to the processor, and the MTA used massive multithreading to mask latency and had fine-grained synchronization associated with the data, not tasks.
EMS should detect when mapped indexes are used and data is initialized as
because execution deadlocks: the index is now also marked
meaning no writers may mark the data as full, but the empty target data is a barrier to
releasing the lock.
Other programs included with the distribution for demonstration, test, and benchmarking purposes.
- Matrix multiply
- Sobel Filter
- MongoDB Replica Server