Understanding what to do when encountering throttling limits in Azure when using App Layering.
Instructions
Azure imposes I/O limits on the virtual disks and network adapters. If an IOps limit is reached for one of these components, then Azure will throttle the connection. We have found that typically the network connection is throttled. Deliberately limiting the connection speed will prevent the IO limit from being reached.
Prepare:It is already best practice to have your ELM deployed on a premium storage account. Please also set a premium storage account in your Azure connectors. This should be a different storage account than the one being used by the ELM.
We will use Wondershaper, a GPL utility, to restrict the network interface speed. A copy of this is attached to this article. You may need to sign in to download it. If you are still unable to download the attachment, please contact support to provide a copy.
Installing and configuring:Use an SCP tool of your choice (ex. WinSCP) to copy the utility to /home/<your_user>/ on the ELM.
Connect to the ELM with SSH and run:
- tar -zxvf wondershaper-1.4.1-CTX-Custom.tgz
- cd wondershaper
- sudo make install
- sudo systemctl enable --now wondershaper.service
This will confirm the setting is set:
- sudo wondershaper -s -a eth0
You should see the rates set in the output like this:
class htb 1:1 root rate 1024Mbit ceil 1024Mbit burst 1408b cburst 1408b
Sent 1690 bytes 7 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
lended: 0 borrowed: 0 giants: 0
tokens: 158 ctokens: 158
class htb 1:10 parent 1:1 leaf 10: prio 1 rate 204800Kbit ceil 972800Kbit burst 1561b cburst 1459b
Sent 1690 bytes 7 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
lended: 7 borrowed: 0 giants: 0
tokens: 825 ctokens: 172
class htb 1:20 parent 1:1 leaf 20: prio 2 rate 409600Kbit ceil 972800Kbit burst 1536b cburst 1459b
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
lended: 0 borrowed: 0 giants: 0
tokens: 484 ctokens: 203
class htb 1:30 parent 1:1 leaf 30: prio 3 rate 204800Kbit ceil 921600Kbit burst 1561b cburst 1382b
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
lended: 0 borrowed: 0 giants: 0
tokens: 968 ctokens: 203
To change the speed to something other than 1Gbit, please run the following and change the DSPEED and USPEED to the desired speed in kilobits per second.
- sudo nano /etc/systemd/wondershaper.conf
- When done, press: ctrl + o
- Confirm the file name to save by pressing Enter.
- Now press ctrl + x to exit.
- Reboot the ELM (sudo reboot) or restart the service:
- sudo systemctl restart wondershaper
PLEASE NOTE: The below guide will be left for context and in the event it will be useful in the future. Changing the disk and VM sizes per this guide has not been found to significantly help reduce the server busy errors in practice. The prior steps from above have had consistently positive results and should be attempted first before these settings are changed.
Understanding throttling and how to identify the limits.Azure will throttle disk or network I/O if the limits have been reached for the disks being accessed or for the total VM I/O.
This chart shows the IOps limits for each disk type:
https://docs.microsoft.com/en-us/azure/azure-subscription-service-limits#managed-virtual-machine-disks
The repository disk on the ELM will default to 512GB's which is a P20 disk.
To see the live IOps usage on an ELM, login to the ELM over SSH and run the following:
# sudo dstat -d --disk-tps
This will show four columns that stack a new line once a second. The first two are the read and write speed of the disks on the ELM and the second two columns show TPS or transactions per second. This might not line up exactly with IOps registered with Azure, but it should be close. The read/write speed can also be used to see if you are reaching the bandwidth limits as well listed in the above chart. Depending on the I/O size you could hit a bandwidth throttling limit before the IOps throttling limit.
How to increase resources to match the I/O limits.If dstat shows the bandwidth and IOps go over the disk IOps limits, then expand the size of the repo disk.
To do this. Select the ELM in the Azure portal. Stop the VM. Once it has been deallocated, Select Disks and select the 'repo' disk. This will give you the option to resize the disk. Reference the disk size chart here:
https://docs.microsoft.com/en-us/azure/azure-subscription-service-limits#managed-virtual-machine-disks
Set the size of the disk to a size within the IOps profile you are looking for based on the dstat results.
Network LimitsIf you are hitting network limits, then you will need to increase the size of the VM hardware set for the ELM to increase overall IOps.
This is done by selecting the ELM in the Azure portal -> Size -> then select a new hardware type with enough IOps.
If you would prefer to keep the VM size the same (maybe there are cost concerns) but reduce the network bandwidth requirements, then you will need to externally limit the available bandwidth to the ELM. **
Update: See the steps at the top of this guide for limiting bandwidth at the ELM interface side.**