A reading list for machine learning systems
Frameworks
- [VLDB '20] PyTorch Distributed: Experiences on Accelerating Data Parallel Training
- [NeurIPS '19] PyTorch: An Imperative Style, High-Performance Deep Learning Library
- [OSDI '18] Ray: A Distributed Framework for Emerging AI Applications
- [OSDI '16] TensorFlow: A System for Large-Scale Machine Learning
Parallelism & Distributed Systems
- [ICML '21] PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models
- [OSDI '20] A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters
- [ATC '20] HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism
- [NeurIPS '19] GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
- [SOSP '19] A Generic Communication Scheduler for Distributed DNN Training Acceleration
- [SOSP '19] PipeDream: Generalized Pipeline Parallelism for DNN Training
- [EuroSys '19] Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks
- [arXiv '18] Horovod: fast and easy distributed deep learning in TensorFlow
- [ATC '17] Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters
- [EuroSys '16] STRADS: A Distributed Framework for Scheduled Model Parallel Machine Learning
- [EuroSys '16] GeePS: Scalable Deep Learning on Distributed GPUs with a GPU-specialized Parameter Server
- [OSDI '14] Scaling Distributed Machine Learning with the Parameter Server
- [NIPS '12] Large Scale Distributed Deep Networks
GPU Cluster Management
- [NSDI '22] MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters
- [OSDI '21] Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning
- [NSDI '21] Elastic Resource Sharing for Distributed Deep Learning
- [OSDI '20] Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads
- [OSDI '20] AntMan: Dynamic Scaling on GPU Clusters for Deep Learning
- [NSDI '20] Themis: Fair and Efficient GPU Cluster Scheduling
- [EuroSys '20] Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning
- [NSDI '19] Tiresias: A GPU Cluster Manager for Distributed Deep Learning
- [ATC '19] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads
- [OSDI '18] Gandiva: Introspective cluster scheduling for deep learning
Memory Management for Machine Learning
- [ATC '22] Memory Harvesting in Multi-GPU Systems with Hierarchical Unified Virtual Memory
- [HPCA '22] Enabling Efficient Large-Scale Deep Learning Training with Cache Coherent Disaggregated Memory Systems
- [ASPLOS '20] Capuchin: Tensor-based GPU Memory Management for Deep Learning
- [ASPLOS '20] SwapAdvisor: Push Deep Learning Beyond the GPU Memory Limit via Smart Swapping
- [ISCA '19] Interplay between Hardware Prefetcher and Page Eviction Policy in CPU-GPU Unified Virtual Memory
- [ISCA '18] Gist: Efficient Data Encoding for Deep Neural Network Training
- [PPoPP '18] SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks
- [MICRO '16] vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design
Scheduling & Resource Management
- [EuroSys '22] Out-Of-Order BackProp: An Effective Scheduling Technique for Deep Learning
- [ATC '21] Zico: Efficient GPU Memory Sharing for Concurrent DNN Training
- [NeurIPS '20] Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning
- [OSDI '20] PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications
- [MLSys '20] Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications
- [SOSP '19] Generic Communication Scheduler for Distributed DNN Training Acceleration
- [EuroSys '18] Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters
- [HPCA '18] Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective
Serving Systems (& inference acceleration)
- [ATC '22] Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing
- [OSDI '22] Achieving μs-scale Preemption for Concurrent GPU-accelerated DNN Inferences
- [ATC '21] INFaaS: Automated Model-less Inference Serving
- [OSDI '20] Serving DNNs like Clockwork: Performance Predictability from the Bottom Up
- [ISCA '20] MLPerf Inference Benchmark
- [SOSP '19] Nexus: A GPU Cluster Engine for Accelerating DNN-Based Video Analysis
- [ISCA '19] MnnFast: a fast and scalable system architecture for memory-augmented neural networks
- [EuroSys '19] μLayer: Low Latency On-Device Inference Using Cooperative Single-Layer Acceleration and Processor-Friendly Quantization
- [EuroSys '19] GrandSLAm: Guaranteeing SLAs for Jobs in Microservices Execution Frameworks
- [OSDI '18] Pretzel: Opening the Black Box of Machine Learning Prediction Serving Systems
- [NSDI '17] Clipper: A Low-Latency Online Prediction Serving System
Very Large Models
- [arxiv '21] ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning
- [ATC '21] ZeRO-Offload: Democratizing Billion-Scale Model Training
- [FAST '21] Behemoth: A Flash-centric Training Accelerator for Extreme-scale DNNs
Deep Learning Recommendation Models
- [OSDI '22] Ekko: A Large-Scale Deep Learning Recommender System with Low-Latency Model Update
- [EuroSys '22] Fleche: An Efficient GPU Embedding Cache for Personalized Recommendations
- [ASPLOS '22] RecShard: statistical feature-based memory optimization for industry-scale neural recommendation
- [HPCA '22] Hercules: Heterogeneity-Aware Inference Serving for At-Scale Personalized Recommendation
- [MLSys '21] TT-Rec: Tensor Train Compression for Deep Learning Recommendation Model Embeddings
- [HPCA '21] Tensor Casting: Co-Designing Algorithm-Architecture for Personalized Recommendation Training
- [HPCA '21] Understanding Training Efficiency of Deep Learning Recommendation Models at Scale
- [ISCA '20] DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference
- [HPCA '20] The Architectural Implications of Facebook’s DNN-based Personalized Recommendation
- [MICRO '19] TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning
Hardware Support for ML
- [ISCA '18] A Configurable Cloud-Scale DNN Processor for Real-Time AI
- [ISCA '17] In-Datacenter Performance Analysis of a Tensor Processing Unit
ML at Mobile & Embedded Systems
- [MobiCom '20] SPINN: Synergistic Progressive Inference of Neural Networks over Device and Cloud
- [RTSS '19] Pipelined Data-Parallel CPU/GPU Scheduling for Multi-DNN Real-Time Inference
- [ASPLOS '17] Neurosurgeon: Collaborative Intelligence Between the Cloud and Mobile Edge
ML Techniques for Systems
- [ICML '20] An Imitation Learning Approach for Cache Replacement
- [ICML '18] Learning Memory Access Patterns