There are different cases when you want to reduce your model size and hardware utilization.
Our case revolves around optimizing the cloud by identifying a sensible trade-off between
computation costs and model quality when dealing with a large amount of simultaneously
working neural networks. We want to better understand the comparisons and combinations with
tested approaches like distillation, pruning, and quantization.