In the realm of AI, a great deal of consideration is optimising training. There is a lot less information out there on optimising models. However serving models for forecast is the place where we make money out of ML. Without a doubt, the expense of serving forecasts might be a central point in the all out profit from speculation for a ML application. In this post we will show you some methods for improving TensorFlow models to meet performance expectations, to assist you with diminishing the expense and increment the exhibition of your ML arrangement.
To optimize models for serving, some important aspects are:
- Model size – More will be the size of model, More hardware will it consume. heavy models take more storage,memory and high network bandwidth. small/Compact size models loads faster in the memory.When we use hardware acceleration for prediction, we want to ensure our model fits inside the memory of the device. Model size has a specific effect in circumstances where we are serving the model on an edge or mobile devices with restricted abilities and hardware. We need the model to download as quickly as possible, utilising minimal data transfer capacity, and take up as little memory and capacity impression as possible.
- Prediction speed – Forecasting speed is another metric we care about for serving. Whenever we play out our forecasts on the web, we ordinarily need results to be returned as quick as could really be expected. In numerous internet based applications, serving latency is critical to client experience and application prerequisites. In any case, we care about speed in any event, when we process our forecasts in cluster. Prediction speed has an immediate relationship to the expense of serving, since it is directly connected with how much infrastructure will be utilised.
- Prediction Throughput – Apart from prediction speed , other attributes also come into play to determine throughput, including batching of predictions, hardware acceleration, load balancing and horizontal scaling of serving instances.
How does TensorFlow helps in optimisation?
There are various optimisation techniques available in tensorflow library which can help us to optimise our ml model. It helps in :-
- Reducing latency and cost for inference for both cloud and edge devices (e.g. mobile, IoT).
- Deploying models on edge devices with restrictions on processing, memory and/or power-consumption.
- Reducing payload size for over-the-air model updates.
Some optimisation techniques
Quantized models are those where we represent the models with lower precision, such as 8-bit integers as opposed to 32-bit float.
Sparsity and pruning
Sparse models are those where connections in between operators (i.e. neural network layers) have been pruned, introducing zeros to the parameter tensors. It helps in reducing the size of model with improvement in its accuracy
Clustered models are those where the original model’s parameters are replaced with a smaller number of unique values.
The toolkit provides experimental support for collaborative optimisation. This enables you to benefit from combining several model compression techniques and simultaneously achieve improved accuracy through quantisation aware training.
Benefits of post-training quantization
- 4x reduction in model sizes
- Models, which consist primarily of convolutional layers, get 10–50% faster execution
- RNN-based models get up to 3x speed-up
- Due to reduced memory and computation requirements, we expect that most models will also have lower power consumption