How To Troubleshoot High Packet or Management CPU Issue on Citrix ADC

How To Troubleshoot High Packet or Management CPU Issue on Citrix ADC

book

Article ID: CTX570268

calendar_today

Updated On:

Description

CPU is a finite resource. Like many resources, there are limits to a CPU's capacity. The NetScaler appliance has two kinds of CPUs in general: The Management CPU and Packet CPU.


Instructions

Wherein, the Management CPU is responsible for processing all the Management traffic on the appliance and the Packet CPU(s) are responsible for handling all the data traffic for eg. TCP , SSL etc.

When diagnosing a complaint involving high CPU, start by gathering the following fundamental facts:

  1. CPUs impacted: nsppe (one or all) & management.
  2. Approximate time stamp/duration.

The following command o/p are quintessential for troubleshooting the high CPU issues:

  • Output of top command: Gives the CPU utilization percentage by the processes running on the NetScaler.
  • Output of stat system memory command: Gives the memory utilization percentage which can also contribute in the CPU utilization.
  • Output of stat system cpu command: This gives the stats about the current CPU utilization in total on the appliance.

Sample o/p of stat cpu command:

> stat cpu

CPU statistics

ID         Usage
1             29
 

The above o/p indicates that there is only 1 CPU (utilized for both Management and Data traffic) and the percentage of utilization is 29%.

The CPU ID is 1.

Now, there are appliances with multiple cores (nCore ) wherein more than single core is allocated to the appliance and then we see multiple CPU IDs on the "stat system cpu " o/p.

*The high CPU seen when running a "top" command does not impact the performance of the box. It also "does not" mean that the NetScaler is running at high CPU or consuming all of the CPU. The NetScaler Kernel runs on top of BSD and that is what is being seen. Although it appears to be using the full amount of the CPU, it is actually not.
 

We can further follow the below steps for understanding the CPU usage:

  1. Check the following counters to understand CPU usage.

    CLASSIC:
    master_cpu_use
    cc_appcpu_use filter=cpu(0)
    (If AppFW or CMP is configured, then looking at slave_cpu_use also makes sense for classic)

    nCORE:
    (For an 8 Core system)
    mgmt_cpu_use (CPU0 - nscollect runs here)
    master_cpu_use (average of cpu(1) thru cpu(7))
    cc_cpu_use filter=cpu(1)
    cc_cpu_use filter=cpu(2)
    cc_cpu_use filter=cpu(3)
    cc_cpu_use filter=cpu(4)
    cc_cpu_use filter=cpu(5)
    cc_cpu_use filter=cpu(6)
    cc_cpu_use filter=cpu(7)
     

  2. How to look for CPU use for a particular CPU?
    Use the nsconmsg command and search for cc_cpu_use and grep for the CPU you are interested in.
    The output will look like the following:

    Indexrtimetotalcount-valdeltarate/secsymbol-name&device-no
    3200209152cc_cpu_use cpu(8)
    3640205-60cc_cpu_use cpu(8)
    3750222172cc_cpu_use cpu(8)
    3860212-10-1cc_cpu_use cpu(8)
    430021660cc_cpu_use cpu(8)
    4400201-15-2cc_cpu_use cpu(8)
    450020871cc_cpu_use cpu(8)
    4610202-60cc_cpu_use cpu(8)
    471020971cc_cpu_use cpu(8)
    4820238294cc_cpu_use cpu(8)
    4920257192cc_cpu_use cpu(8)
  • Look at the total count (third) column and divide by 10 to get the CPU percentage. For eg. in the last line above, 257 implies that 257/10 = 25.7% CPU is used by CPU(8).
    Run the following command to investigate the nsconsmg counters for CPU issue:

    nsconmsg –K newnslog –g cpu_use –s totalcount=600 –d current
    nsconmsg –K newnslog –d current | grep cc_cpu_use
  • Look at the traffic, memory and CPU in conjunction. We may be hitting platform limits if it sustained high CPU usage. Try to understand if the CPU has gone up because of traffic. If so, try to understand if it is genuine traffic or any sort of attack.
  • We can further check for the Profiler o/p to understand who is taking the CPU.
    For details on the profiler o/p , logs , refer to the below article:
    https://support.citrix.com/article/CTX212480

  • We can further use the CPU counters mentioned in the below article for more details:
    https://support.citrix.com/article/CTX133887


Profiling FAQs

1. What is Constant profiling?

This refers to the running of CPU profiler at all times, as soon as the NetScaler device comes up. At the boot time, the profiler is invoked and it keeps running. Any time any of the PE's associated CPU exceeds 90%, the profiler captures the data into a set of files.
 

2. Why is this needed?

This was necessitated with the issues seen at some customer sites and in internal tests. With customer issues, it's hard to go back and request the customer to run the profiler when the issue is seen again. Hence, we have felt the need of a profiler running to be able to see the functions triggering high CPU. With this feature now, the profiler will be running always and the data gets captured when the high CPU usage occurs.
 

3. Which releases/builds contain this feature?

TOT (Crete) 44.2+
9.3 - all builds
9.2 52.x +
Only nCore builds are affected.
 

4. How do we know the profiler is already running?

Run the ps command to check if nsproflog and nsprofmon are running. The number of nsprofmon processes should be the same as the number of PEs running.

root@nc1# ps -ax | grep nspro
36683 p0 S+ 0:00.00 grep nspro
79468 p2- I 0:00.01 /bin/sh /netscaler/nsproflog.sh cpuuse=800 start
79496 p2- I 0:00.00 /bin/sh /netscaler/nsproflog.sh cpuuse=800 start
79498 p2- I 0:00.00 /bin/sh /netscaler/nsproflog.sh cpuuse=800 start
79499 p2- I 0:00.00 /bin/sh /netscaler/nsproflog.sh cpuuse=800 start
79502 p2- S 33:46.15 /netscaler/nsprofmon -s cpu=3 -ys cpuuse=800 -ys profmode=cpuuse -O -k /v
79503 p2- S 33:48.03 /netscaler/nsprofmon -s cpu=2 -ys cpuuse=800 -ys profmode=cpuuse -O -k /v
79504 p2- S 32:20.63 /netscaler/nsprofmon -s cpu=1 -ys cpuuse=800 -ys profmode=cpuuse -O -k /v
 

5. Where is the profiler data?

The profiled data is collected in /var/nsproflog directory. Here is a sample output of the list of files in that folder. At any point of time, the currently running files are newproflog_cpu_<penum>.out. Once the data in these files exceed 10MB in size, they are archived into a tar file and compressed. The roll over mechanism is similar to what we have for newnslog files.

newproflog.0.tar.gz newproflog.5.tar.gz newproflog.old.tar.gz
newproflog.1.tar.gz newproflog.6.tar.gz newproflog_cpu_0.out
newproflog.2.tar.gz newproflog.7.tar.gz nsproflog.nextfile
newproflog.3.tar.gz newproflog.8.tar.gz nsproflog_options
newproflog.4.tar.gz newproflog.9.tar.gz ppe_cores.txt

The current data is always captured in newproflog_cpu_<ppe number>.out. Once the profiler is stopped, the newproflog_cpu_* files will be archived into newproflog.(value in nsproflog.nextfile-1).tar.gz.
 

6. What is nsprofmon and what's nsproflog.sh?

Nsprofmon is the binary that interacts with PE, retrieves the profiler records and writes them into files. There are a myriad of options present which are hard to remember. The wrapper script nsproflog.sh is easier to use and remember. Going forward, it is recommended to use the wrapper script, if it’s limited to collecting CPU usage data.
 

7. Should I use nsprofmon or nsproflog.sh?

In earlier releases (9.0 and earlier), nsprofmon was heavily used internally and by the support groups. Some internal scripts that devtest use, refer to nsprofmon. It is recommended to use nsproflog.sh, if it’s limited to collecting CPU usage data.
 

8. Will the existing scripts be affected?

It will affect the existing scripts if they try to invoke the profiler. Please see the next question.
 

9. What if I want to start the profiler with a different set of parameters?

There can be only one instance of profiler running at any time. If the profiler is already running (invoked at boot time with constant profiling), and if we want to invoke again, it flags an error and exits.

root@nc1# nsproflog.sh cpuuse=900 start
nCore Profiling
Another instance of profiler is already running.
If you want to run the profiler at a different CPU threshold, please stop the current profiler using
# nsproflog.sh stop
... and invoke again with the intended CPU threshold. Please see nsproflog.sh -h for the exact usage.


Similarly, nsprofmon is also modified to check if another instance is running. If it is, it exits flagging an error.

If the profiler needs to be run again with a different CPU usage (i.e. 80%), the running instance needs to be stopped and invoked again:

root@nc1# nsproflog.sh stop
nCore Profiling
Stopping all profiler processes
Removing buffer for -s cpu=1
Removing profile buffer on cpu 1 ...  Done.
Saved profiler capture data in newproflog.5.tar.gz
Setting minimum lost CPU time for NETIO to 0 microsecond ...  Done.
Stopping mgmt profiler process
 
root@nc1# nsproflog.sh cpuuse=800 start
 

10. How do I view the profiler data?

In /var/nsproflog, unzip and untar the desired tar archive. Each file in this archive should correspond to each PE.

Caution: When we unzip and untar the older files, the files from the archive will overwrite the current ones. The names stored inside the tar archive are the same as the ones to which currently running profiler keeps writing into. To avoid this, unzip and untar into a temporary directory.

The simplest way to see the profiled data is

# nsproflog.sh kernel=/netscaler/nsppe display=newproflog_cpu_<ppe number>.out

 

11. How do we collect this data for analysis?

The showtech script has been modified to collect the profiler data. When customer issues arrive, var/nsproflog can be checked to see if the profiler has captured any data.
 

12. Anything else that I need to know?

Collecting traces and profiler data are made mutually exclusive. When nstrace.sh is run to collect traces, profiler is automatically stopped and restarted when nstrace.sh exits. We wouldn't have the profiler data during the time of collecting traces.
 

13. What commands get executed when profiler is started?

Initialization:

For each CPU, the following commands are executed initially:

nsapimgr -c
nsapimgr -ys cpuuse=900 nsprofmon -s cpu=<cpuid> -ys profbuf=128 -ys profmode=cpuuse

Capturing:

For each CPU, the following are executed:

nsapimgr -c
nsprofmon -s cpu=<cpuid> -ys cpuuse=900 -ys profmode=cpuuse -O -k /var/nsproflog/newproflog_cpu_<cpuid>.out -s logsize=10485760 -ye capture

After the above, nsprofmon processes will be running till any one of the capture buffers is full.

nsproflog.sh waits for any of the above child processes to exit

Stopping:

Kill all nsprofmon processes (killall -9 nsprofmon)

For each CPU, the following commands are executed:

nsprofmon -s cpu=<cpuid> -yS profbuf

Profiler capture files are archived:

nsapimgr -ys lctnetio=0

Issue/Introduction

This article outlines the steps on how to troubleshoot high management and CPU usage on the NetScaler appliance.