Machine Learning Systems

Ekko: A Large-Scale Deep Learning Recommender System with Low-Latency Model Update

Deep Learning Recommender Systems (DLRSs) need to update models at low latency, thus promptly serving new users and content. Existing DLRSs, however, fail to do so. They train/validate models offline and broadcast entire models to global inference …

A Theoretical Understanding of Gradient Bias in Meta-Reinforcement Learning

Gradient-based Meta-RL (GMRL) refers to methods that maintain two-level optimisation procedures wherein the outer-loop meta-learner guides the inner-loop gradient-based reinforcement learner to achieve fast adaptations. In this paper, we develop a …

TorchOpt: An Efficient library for Differentiable Optimization

Recent years have witnessed the booming of various differentiable optimization algorithms. These algorithms exhibit different execution patterns, and their execution needs massive computational resources that go beyond a single CPU and GPU. Existing …

MegBA: A GPU-Based Distributed Library for Large-Scale Bundle Adjustment

Large-scale Bundle Adjustment (BA) requires massive memory and computation resources which are difficult to be fulfilled by existing BA libraries. In this paper, we propose MegBA, a GPU-based distributed BA library. MegBA can provide massive …

Efficient Reinforcement Learning Development with RLzoo

Many multimedia developers are exploring for adopting Deep Re- inforcement Learning (DRL) techniques in their applications. They however often find such an adoption challenging. Existing DRL libraries provide poor support for prototyping DRL agents …

Fast and Flexible Human Pose Estimation with HyperPose

Estimating human pose is an important yet challenging task in multimedia applications. Existing pose estimation libraries target reproducing standard pose estimation algorithms. When it comes to customising these algorithms for real-world …

KungFu: Making Training in Distributed Machine Learning Adaptive

When using distributed machine learning (ML) systems to train models on a cluster of worker machines, users must configure a large number of parameters: hyper-parameters (e.g. the batch size and the learning rate) affect model convergence; system …

Spotnik: Designing Distributed Machine Learning for Transient Cloud Resources

To achieve higher utilisation, cloud providers offer VMs with GPUs as lower-cost transient cloud resources. Transient VMs can be revoked at short notice and vary in their availability. This poses challenges to distributed machine learning (ML) jobs, …

CrossBow: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers

Deep learning models are trained on servers with many GPUs, and training must scale with the number of GPUs. Systems such as TensorFlow and Caffe2 train models with parallel synchronous stochastic gradient descent: they process a batch of training …

Taming Hyper-parameters in Deep Learning Systems

Deep learning (DL) systems expose many tuning parameters (“hyper-parameters”) that affect the performance and accuracy of trained models. Increasingly users struggle to configure hyper-parameters, and a substantial portion of time is spent tuning …