Multithreaded Unsupervised Clustering
3DRobotics  20150107
http://synsem.com/3DRobotics 
Automated, Unguided, Unsupervised Clustering
Performed Using Shared Memory Multithreaded Parallelism

▶ What Kind of Clustering? 
Categorization of arbitrary things by likeness

Cluster techniques are generic 

Enables categorization without knowing what you're looking for
▶ Machine Learning
▶ Market Segmentation
▶ Sentiment Analysis
▶ Allegiance Switching
▶ Regression Analysis
▶ Association Rules
▶ Structured Prediction
▶ Feature Learning
▶ Dimensionality Reduction
▶ Anomaly Detection
▶ Recommendation Engines
▶ Computer Vision
▶ Astronomy
▶ Agriculture
Execution is dominated by making comparisons, which happens in such great quantity that even simple comparisons can dominate execution time.

Optimization means reducing the number of comparisons performed
▶ Brute force alltoall comparisons are $$O(n^2)$$ 
Sort by distance, then by anchor
Ordered storage permits fast binary searches to find halfplanes
Sort Over Multiple Phases
▶ From least significant to most significant digits
▶ Serial execution of phases, but each sort phase may execute in parallel
▶ Phase sort may use any stable sorting technique
▶ Phases allow comparison of keys of arbitrary length
Synthesize a sorted list from a histogram
$$O(n+k)$$  Excellent performance when number of keys is small 
Challenges with parallelization: ▶ Hotspot on counts ▶ Finegrained synchronization needed for parallel prefix sum (offsets) ▶ Load imbalance on output fanout 
$$O(\log n)$$  Strong Scaling for reductions 
Challenges with parallelization: ▶ Finegrained synchronization during reductions ▶ Load imbalance ▶ Bandwidth to memory using strided access 



Scaling Goal  Solve the same problem, only faster  Solve a bigger problem in the same amount of time 
Problem Size  Stays constant while number of processors increases  Grows with the number of processors 
Scaling is limited by  Interprocess communication  Problem size 
Resiliency  Single failure causes entire job to fail, SLAs achieved through efficient checkpointrestart.  Failed subtasks are detected and retried. SLAs achieved through fault resiliency. 
Programming Models  MPI, GASNet, Chapel, X10, CoArray Fortran, UPC  Batch jobs, MapReduce 
Use Radix Sort for compound key
▶ Sort by distance to the needed precision
▶ Sort again by anchor
Collapse nested parallelism to 1D parallelism which is easier to manage
Pennant Tree
▶ Binary tree where one child is also the parent
▶ Two anchors which mutually form the smallest new anchor are merged
▶ Repeat until all anchors are merged
Cray XMP 
x86 Cluster 
Multithreading, Clustering, Sorting, Parallelism, Geometry
▶ Unsupervised Clustering is practical in many application domains
▶ The Anchors Hierarchy reduces the number of comparisons required
▶ Atomic ReadModifyWrite operations are critical for parallelism
▶ Adding work can make execution faster if used to implement parallelism
▶ Duality of clustering and computational geometry
Here is where we will say something.