Machine Learning Systems

KungFu: Making Training in Distributed Machine Learning Adaptive

When using distributed machine learning (ML) systems to train models on a cluster of worker machines, users must configure a large number of parameters: hyper-parameters (e.g. the batch size and the learning rate) affect model convergence; system …

Spotnik: Designing Distributed Machine Learning for Transient Cloud Resources

To achieve higher utilisation, cloud providers offer VMs with GPUs as lower-cost transient cloud resources. Transient VMs can be revoked at short notice and vary in their availability. This poses challenges to distributed machine learning (ML) jobs, …

CrossBow: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers

Deep learning models are trained on servers with many GPUs, and training must scale with the number of GPUs. Systems such as TensorFlow and Caffe2 train models with parallel synchronous stochastic gradient descent: they process a batch of training …

Taming Hyper-parameters in Deep Learning Systems

Deep learning (DL) systems expose many tuning parameters (“hyper-parameters”) that affect the performance and accuracy of trained models. Increasingly users struggle to configure hyper-parameters, and a substantial portion of time is spent tuning …

TensorLayer: A Versatile Library for Efficient Deep Learning Development

Recently we have observed emerging uses of deep learning techniques in multimedia systems. Developing a practical deep learning system is arduous and complex. It involves labor-intensive tasks for constructing sophisticated neural networks, …

Optimizing Network Performance in Distributed Machine Learning

To cope with the ever growing availability of training data, there have been several proposals to scale machine learning computation beyond a single server and distribute it across a cluster. While this enables reducing the training time, the …