Apache SkyWalking – Profiling

Blog: Activating Automatical Performance Analysis -- Continuous Profiling

Sun, 25 Jun 2023 00:00:00 +0000

Background

In previous articles, We have discussed how to use SkyWalking and eBPF for performance problem detection within processes and networks. They are good methods to locate issues, but still there are some challenges:

The timing of the task initiation: It’s always challenging to address the processes that require performance monitoring when problems occur. Typically, manual engagement is required to identify processes and the types of performance analysis necessary, which cause extra time during the crash recovery. The root cause locating and the time of crash recovery conflict with each other from time to time. In the real case, rebooting would be the first choice of recovery, meanwhile, it destroys the site of crashing.
Resource consumption of tasks: The difficulties to determine the profiling scope. Wider profiling causes more resources than it should. We need a method to manage resource consumption and understand which processes necessitate performance analysis.
Engineer capabilities: On-call is usually covered by the whole team, which have junior and senior engineers, even senior engineers have their understanding limitation of the complex distributed system, it is nearly impossible to understand the whole system by a single one person.

The Continuous Profiling is a new created mechanism to resolve the above issues.

Automate Profiling

As profiling is resource costing and high experience required, how about introducing a method to narrow the scope and automate the profiling driven by polices creates by senior SRE engineer? So, in 9.5.0, SkyWalking first introduced preset policy rules for specific services to be monitored by the eBPF Agent in a low-energy manner, and run profiling when necessary automatically.

Policy

Policy rules specify how to monitor target processes and determine the type of profiling task to initiate when certain threshold conditions are met.

These policy rules primarily consist of the following configuration information:

Monitoring type: This specifies what kind of monitoring should be implemented on the target process.
Threshold determination: This defines how to determine whether the target process requires the initiation of a profiling task.
Trigger task: This specifies what kind of performance analysis task should be initiated.

Monitoring type

The type of monitoring is determined by observing the data values of a specified process to generate corresponding metrics. These metric values can then facilitate subsequent threshold judgment operations. In eBPF observation, we believe the following metrics can most directly reflect the current performance of the program:

Monitor Type	Unit	Description
System Load	Load	System load average over a specified period.
Process CPU	Percentage	The CPU usage of the process as a percentage.
Process Thread Count	Count	The number of threads in the process.
HTTP Error Rate	Percentage	The percentage of HTTP requests that result in error responses (e.g., 4xx or 5xx status codes).
HTTP Avg Response Time	Millisecond	The average response time for HTTP requests.

Monitoring network type metrics is not as simple as obtaining basic process information. It requires the initiation of eBPF programs and attaching them to the target process for observation. This is similar to the principles of network profiling task we introduced in the previous article, except that we no longer collect the full content of the data packets. Instead, we only collect the content of messages that match specified HTTP prefixes.

By using this method, we can significantly reduce the number of times the kernel sends data to the user space, and the user-space program can parse the data content with less system resource usage. This ultimately helps in conserving system resources.

Metrics collector

The eBPF agent would report metrics of processes periodically as follows to indicate the process performance in time.

Name	Unit	Description
process_cpu	(0-100)%	The CPU usage percent
process_thread_count	count	The thread count of process
system_load	count	The average system load for the last minute, each process have same value
http_error_rate	(0-100)%	The network request error rate percentage
http_avg_response_time	ms	The network average response duration

Threshold determination

For the threshold determination, the judgement is made by the eBPF Agent based on the target monitoring process in its own memory, rather than relying on calculations performed by the SkyWalking backend. The advantage of this approach is that it doesn’t have to wait for the results of complex backend computations, and it reduces potential issues brought about by complicated interactions.

By using this method, the eBPF Agent can swiftly initiate tasks immediately after conditions are met, without any delay.

It includes the following configuration items:

Threshold: Check if the monitoring value meets the specified expectations.
Period: The time period(seconds) for monitoring data, which can also be understood as the most recent duration.
Count: The number of times(seconds) the threshold is triggered within the detection period, which can also be understood as the total number of times the specified threshold rule is triggered in the most recent duration(seconds). Once the count check is met, the specified Profiling task will be started.

Trigger task

When the eBPF Agent detects that the threshold determination in the specified policy meets the rules, it can initiate the corresponding task according to pre-configured rules. For each different target performance task, their task initiation parameters are different:

On/Off CPU Profiling: It automatically performs performance analysis on processes that meet the conditions, defaulting to 10 minutes of monitoring.
Network Profiling: It performs network performance analysis on all processes in the same Service Instance on the current machine, to prevent the cause of the issue from being unrealizable due to too few process being collected, defaulting to 10 minutes of monitoring.

Once the task is initiated, no new profiling tasks would be started for the current process for a certain period. The main reason for this is to prevent frequent task creation due to low threshold settings, which could affect program execution. The default time period is 20 minutes.

Data Flow

The figure 1 illustrates the data flow of the continuous profiling feature:

Figure 1: Data Flow of Continuous Profiling

eBPF Agent with Process

Firstly, we need to ensure that the eBPF Agent and the process to be monitored are deployed on the same host machine, so that we can collect relevant data from the process. When the eBPF Agent detects a threshold validation rule that conforms to the policy, it immediately triggers the profiling task for the target process, thereby reducing any intermediate steps and accelerating the ability to pinpoint performance issues.

Sliding window

The sliding window plays a crucial role in the eBPF Agent’s threshold determination process, as illustrated in the figure 2:

Figure 2: Sliding Window in eBPF Agent

Each element in the array represents the data value for a specified second in time. When the sliding window needs to verify whether it is responsible for a rule, it fetches the content of each element from a certain number of recent elements (period parameter). If an element exceeds the threshold, it is marked in red and counted. If the number of red elements exceeds a certain number, it is deemed to trigger a task.

Using a sliding window offers the following two advantages:

Fast retrieval of recent content: With a sliding window, complex calculations are unnecessary. You can know the data by simply reading a certain number of recent array elements.
Solving data spikes issues: Validation through count prevents situations where a data point suddenly spikes and then quickly returns to normal. Verification with multiple values can reveal whether exceeding the threshold is frequent or occasional.

eBPF Agent with SkyWalking Backend

The eBPF Agent communicates periodically with the SkyWalking backend, involving three most crucial operations:

Policy synchronization: Through periodic policy synchronization, the eBPF Agent can keep processes on the local machine updated with the latest policy rules as much as possible.
Metrics sending: For processes that are already being monitored, the eBPF Agent periodically sends the collected data to the backend program. This facilitates real-time query of current data values by users, who can also compare this data with historical values or thresholds when problems arise.
Profiling task reporting: When the eBPF detects that a certain process has triggered a policy rule, it automatically initiates a performance task, collects relevant information from the current process, and reports it to the SkyWalking backend. This allows users to know when, why, and what type of profiling task was triggered from the interface.

Demo

Next, let’s quickly demonstrate the continuous profiling feature, so you can understand more specifically what it accomplishes.

Deploy SkyWalking Showcase

SkyWalking Showcase contains a complete set of example services and can be monitored using SkyWalking. For more information, please check the official documentation.

In this demo, we only deploy service, the latest released SkyWalking OAP, and UI.

export SW_OAP_IMAGE=apache/skywalking-oap-server:9.5.0
export SW_UI_IMAGE=apache/skywalking-ui:9.5.0
export SW_ROVER_IMAGE=apache/skywalking-rover:0.5.0

export FEATURE_FLAGS=mesh-with-agent,single-node,elasticsearch,rover
make deploy.kubernetes

After deployment is complete, please run the following script to open SkyWalking UI: http://localhost:8080/.

kubectl port-forward svc/ui 8080:8080 --namespace default

Create Continuous Profiling Policy

Currently, continues profiling feature is set by default in the Service Mesh panel at the Service level.

Figure 3: Continuous Policy Tab

By clicking on the edit button aside from the Policy List, the polices of current service could be created or updated.

Figure 4: Edit Continuous Profiling Policy

Multiple polices are supported. Every policy has the following configurations.

Target Type: Specifies the type of profiling task to be triggered when the threshold determination is met.
Items: For profiling task of the same target, one or more validation items can be specified. As long as one validation item meets the threshold determination, the corresponding performance analysis task will be launched.
1. Monitor Type: Specifies the type of monitoring to be carried out for the target process.
2. Threshold: Depending on the type of monitoring, you need to fill in the corresponding threshold to complete the verification work.
3. Period: Specifies the number of recent seconds of data you want to monitor.
4. Count: Determines the total number of seconds triggered within the recent period.
5. URI Regex/List: This is applicable to HTTP monitoring types, allowing URL filtering.

Done

After clicking the save button, you can see the currently created monitoring rules, as shown in the figure 5:

Figure 5: Continuous Profiling Monitoring Processes

The data can be divided into the following parts:

Policy list: On the left, you can see the rule list you have created.
Monitoring Summary List: Once a rule is selected, you can see which pods and processes would be monitored by this rule. It also summarizes how many profiling tasks have been triggered in the last 48 hours by the current pod or process, as well as the last trigger time. This list is also sorted in descending order by the number of triggers to facilitate your quick review.

When you click on a specific process, a new dashboard would show to list metrics and triggered profiling results.

Figure 6: Continuous Profiling Triggered Tasks

The current figure contains the following data contents:

Task Timeline: It lists all profiling tasks in the past 48 hours. And when the mouse hovers over a task, it would also display detailed information:
1. Task start and end time: It indicates when the current performance analysis task was triggered.
2. Trigger reason: It would display the reason why the current process was profiled and list out the value of the metric exceeding the threshold when the profiling was triggered. so you can quickly understand the reason.
Task Detail: Similar to the CPU Profiling and Network Profiling introduced in previous articles, this would display the flame graph or process topology map of the current task, depending on the profiling type.

Meanwhile, on the Metrics tab, metrics relative to profiling policies are collected to retrieve the historical trend, in order to provide a comprehensive explanation of the trigger point about the profiling.

Figure 7: Continuous Profiling Metrics

Conclusion

In this article, I have detailed how the continuous profiling feature in SkyWalking and eBPF works. In general, it involves deploying the eBPF Agent service on the same machine where the process to be monitored resides, and monitoring the target process with low resource consumption. When it meets the threshold conditions, it would initiate more complex CPU Profiling and Network Profiling tasks.

In the future, we will offer even more features. Stay tuned!

Twitter, ASFSkyWalking
Slack. Send Request to join SkyWalking slack mail to the mail list(dev@skywalking.apache.org), we will invite you in.
Subscribe to our medium list.

Zh: 自动化性能分析——持续剖析

Sun, 25 Jun 2023 00:00:00 +0000

背景

在之前的文章中，我们讨论了如何使用 SkyWalking 和 eBPF 来检测性能问题，包括进程和网络。这些方法可以很好地定位问题，但仍然存在一些挑战：

任务启动的时间: 当需要进行性能监控时，解决需要性能监控的进程始终是一个挑战。通常需要手动参与，以标识进程和所需的性能分析类型，这会在崩溃恢复期间耗费额外的时间。根本原因定位和崩溃恢复时间有时会发生冲突。在实际情况中，重新启动可能是恢复的第一选择，同时也会破坏崩溃的现场。
任务的资源消耗: 确定分析范围的困难。过宽的分析范围会导致需要更多的资源。我们需要一种方法来管理资源消耗并了解哪些进程需要性能分析。
工程师能力: 通常由整个团队负责呼叫，其中有初级和高级工程师，即使是高级工程师也对复杂的分布式系统有其理解限制，单个人几乎无法理解整个系统。

持续剖析（Continuous Profiling） 是解决上述问题的新机制。

自动剖析

由于性能分析的资源消耗和高经验要求，因此引入一种方法以缩小范围并由高级 SRE 工程师创建策略自动剖析。因此，在 9.5.0 中，SkyWalking 首先引入了预设策略规则，以低功耗方式监视特定服务的 eBPF 代理，并在必要时自动运行剖析。

策略

策略规则指定了如何监视目标进程并确定在满足某些阈值条件时应启动何种类型的分析任务。

这些策略规则主要包括以下配置信息：

监测类型: 这指定了应在目标进程上实施什么样的监测。
阈值确定: 这定义了如何确定目标进程是否需要启动分析任务。
触发任务: 这指定了应启动什么类型的性能分析任务。

监测类型

监测类型是通过观察指定进程的数据值来生成相应的指标来确定的。这些指标值可以促进后续的阈值判断操作。在 eBPF 观测中，我们认为以下指标最能直接反映程序的当前性能：

监测类型	单位	描述
系统负载	负载	在指定时间段内的系统负载平均值。
进程 CPU	百分比	进程的 CPU 使用率百分比。
进程线程计数	计数	进程中的线程数。
HTTP 错误率	百分比	导致错误响应（例如，4xx 或 5xx 状态代码）的 HTTP 请求的百分比。
HTTP 平均响应时间	毫秒	HTTP 请求的平均响应时间。

指标收集器

eBPF 代理会定期报告以下进程度量，以指示进程性能：

名称	单位	描述
process_cpu	(0-100)%	CPU 使用率百分比
process_thread_count	计数	进程中的线程数
system_load	计数	最近一分钟的平均系统负载，每个进程的值相同
http_error_rate	(0-100)%	网络请求错误率百分比
http_avg_response_time	毫秒	网络平均响应持续时间

阈值确定

对于阈值的确定，eBPF 代理是基于其自身内存中的目标监测进程进行判断，而不是依赖于 SkyWalking 后端执行的计算。这种方法的优点在于，它不必等待复杂后端计算的结果，减少了复杂交互所带来的潜在问题。

通过使用此方法，eBPF 代理可以在条件满足后立即启动任务，而无需任何延迟。

它包括以下配置项：

阈值: 检查监测值是否符合指定的期望值。
周期: 监控数据的时间周期（秒），也可以理解为最近的持续时间。
计数: 检测期间触发阈值的次数（秒），也可以理解为最近持续时间内指定阈值规则触发的总次数（秒）。一旦满足计数检查，指定的分析任务将被开始。

触发任务

当 eBPF Agent 检测到指定策略中的阈值决策符合规则时，根据预配置的规则可以启动相应的任务。对于每个不同的目标性能任务，它们的任务启动参数都不同：

On/Off CPU Profiling: 它会自动对符合条件的进程进行性能分析，缺省情况下监控时间为 10 分钟。
Network Profiling: 它会对当前机器上同一 Service Instance 中的所有进程进行网络性能分析，以防问题的原因因被收集进程太少而无法实现，缺省情况下监控时间为 10 分钟。

一旦任务启动，当前进程将在一定时间内不会启动新的剖析任务。主要原因是为了防止因低阈值设置而频繁创建任务，从而影响程序执行。缺省时间为 20 分钟。

数据流

图 1 展示了持续剖析功能的数据流：

图 1: 持续剖析的数据流

eBPF Agent进行进程跟踪

首先，我们需要确保 eBPF Agent 和要监测的进程部署在同一台主机上，以便我们可以从进程中收集相关数据。当 eBPF Agent 检测到符合策略的阈值验证规则时，它会立即为目标进程触发剖析任务，从而减少任何中间步骤并加速定位性能问题的能力。

滑动窗口

滑动窗口在 eBPF Agent 的阈值决策过程中发挥着至关重要的作用，如图 2 所示：

图 2: eBPF Agent 中的滑动窗口

数组中的每个元素表示指定时间内的数据值。当滑动窗口需要验证是否负责某个规则时，它从最近的一定数量的元素 (period 参数) 中获取每个元素的内容。如果一个元素超过了阈值，则标记为红色并计数。如果红色元素的数量超过一定数量，则被认为触发了任务。

使用滑动窗口具有以下两个优点：

快速检索最近的内容：使用滑动窗口，无需进行复杂的计算。你可以通过简单地读取一定数量的最近数组元素来了解数据。
解决数据峰值问题：通过计数进行验证，可以避免数据点突然增加然后快速返回正常的情况。使用多个值进行验证可以揭示超过阈值是频繁还是偶然发生的。

eBPF Agent与OAP后端通讯

eBPF Agent 定期与 SkyWalking 后端通信，涉及三个最关键的操作：

策略同步：通过定期的策略同步，eBPF Agent 可以尽可能地让本地机器上的进程与最新的策略规则保持同步。
指标发送：对于已经被监视的进程，eBPF Agent 定期将收集到的数据发送到后端程序。这就使用户能够实时查询当前数据值，用户也可以在出现问题时将此数据与历史值或阈值进行比较。
剖析任务报告：当 eBPF 检测到某个进程触发了策略规则时，它会自动启动性能任务，从当前进程收集相关信息，并将其报告给 SkyWalking 后端。这使用户可以从界面了解何时、为什么和触发了什么类型的剖析任务。

演示

接下来，让我们快速演示持续剖析功能，以便你更具体地了解它的功能。

部署 SkyWalking Showcase

SkyWalking Showcase 包含完整的示例服务，并可以使用 SkyWalking 进行监视。有关详细信息，请查看官方文档。

在此演示中，我们只部署服务、最新发布的 SkyWalking OAP 和 UI。

export SW_OAP_IMAGE=apache/skywalking-oap-server:9.5.0
export SW_UI_IMAGE=apache/skywalking-ui:9.5.0
export SW_ROVER_IMAGE=apache/skywalking-rover:0.5.0

export FEATURE_FLAGS=mesh-with-agent,single-node,elasticsearch,rover
make deploy.kubernetes

部署完成后，请运行以下脚本以打开 SkyWalking UI：http://localhost:8080/。

kubectl port-forward svc/ui 8080:8080 --namespace default

创建持续剖析策略

目前，持续剖析功能在 Service Mesh 面板的 Service 级别中默认设置。

图 3: 持续策略选项卡

通过点击 Policy List 旁边的编辑按钮，可以创建或更新当前服务的策略。

图 4: 编辑持续剖析策略

支持多个策略。每个策略都有以下配置。

Target Type：指定符合阈值决策时要触发的剖析任务的类型。
Items：对于相同目标的剖析任务，可以指定一个或多个验证项目。只要一个验证项目符合阈值决策，就会启动相应的性能分析任务。
1. Monitor Type：指定要为目标进程执行的监视类型。
2. Threshold：根据监视类型的不同，需要填写相应的阈值才能完成验证工作。
3. Period：指定你要监测的最近几秒钟的数据数量。
4. Count：确定最近时间段内触发的总秒数。
5. URI 正则表达式/列表：这适用于 HTTP 监控类型，允许 URL 过滤。

完成

单击保存按钮后，你可以看到当前已创建的监控规则，如图 5 所示：

图 5: 持续剖析监控进程

数据可以分为以下几个部分：

策略列表：在左侧，你可以看到已创建的规则列表。
监测摘要列表：选择规则后，你可以看到哪些 pod 和进程将受到该规则的监视。它还总结了当前 pod 或进程在过去 48 小时内触发的性能分析任务数量，以及最后一个触发时间。该列表还按触发次数降序排列，以便你快速查看。

当你单击特定进程时，将显示一个新的仪表板以列出指标和触发的剖析结果。

图 6: 持续剖析触发的任务

当前图包含以下数据内容：

任务时间轴：它列出了过去 48 小时的所有剖析任务。当鼠标悬停在任务上时，它还会显示详细信息：
1. 任务的开始和结束时间：它指示当前性能分析任务何时被触发。
2. 触发原因：它会显示为什么会对当前进程进行剖析，并列出当剖析被触发时超过阈值的度量值，以便你快速了解原因。
任务详情：与前几篇文章介绍的 CPU 剖析和网络剖析类似，它会显示当前任务的火焰图或进程拓扑图，具体取决于剖析类型。

同时，在 Metrics 选项卡中，收集与剖析策略相关的指标以检索历史趋势，以便在剖析的触发点提供全面的解释。

图 7: 持续剖析指标

结论

在本文中，我详细介绍了 SkyWalking 和 eBPF 中持续剖析功能的工作原理。通常情况下，它涉及将 eBPF Agent 服务部署在要监视的进程所在的同一台计算机上，并以低资源消耗监测目标进程。当它符合阈值条件时，它会启动更复杂的 CPU 剖析和网络剖析任务。

在未来，我们将提供更多功能。敬请期待！

Twitter：ASFSkyWalking
Slack：向邮件列表 (dev@skywalking.apache.org) 发送“Request to join SkyWalking Slack”，我们会邀请你加入。
订阅我们的 Medium 列表。

Blog: Pinpoint Service Mesh Critical Performance Impact by using eBPF

Tue, 05 Jul 2022 00:00:00 +0000

Content

Background

Apache SkyWalking observes metrics, logs, traces, and events for services deployed into the service mesh. When troubleshooting, SkyWalking error analysis can be an invaluable tool helping to pinpoint where an error occurred. However, performance problems are more difficult: It’s often impossible to locate the root cause of performance problems with pre-existing observation data. To move beyond the status quo, dynamic debugging and troubleshooting are essential service performance tools. In this article, we’ll discuss how to use eBPF technology to improve the profiling feature in SkyWalking and analyze the performance impact in the service mesh.

Trace Profiling in SkyWalking

Since SkyWalking 7.0.0, Trace Profiling has helped developers find performance problems by periodically sampling the thread stack to let developers know which lines of code take more time. However, Trace Profiling is not suitable for the following scenarios:

Thread Model: Trace Profiling is most useful for profiling code that executes in a single thread. It is less useful for middleware that relies heavily on async execution models. For example Goroutines in Go or Kotlin Coroutines.
Language: Currently, Trace Profiling is only supported in Java and Python, since it’s not easy to obtain the thread stack in the runtimes of some languages such as Go and Node.js.
Agent Binding: Trace Profiling requires Agent installation, which can be tricky depending on the language (e.g., PHP has to rely on its C kernel; Rust and C/C++ require manual instrumentation to make install).
Trace Correlation: Since Trace Profiling is only associated with a single request it can be hard to determine which request is causing the problem.
Short Lifecycle Services: Trace Profiling doesn’t support short-lived services for (at least) two reasons:
1. It’s hard to differentiate system performance from class code manipulation in the booting stage.
2. Trace profiling is linked to an endpoint to identify performance impact, but there is no endpoint to match these short-lived services.

Fortunately, there are techniques that can go further than Trace Profiling in these situations.

Introduce eBPF

We have found that eBPF — a technology that can run sandboxed programs in an operating system kernel and thus safely and efficiently extend the capabilities of the kernel without requiring kernel modifications or loading kernel modules — can help us fill gaps left by Trace Profiling. eBPF is a trending technology because it breaks the traditional barrier between user and kernel space. Programs can now inject bytecode that runs in the kernel, instead of having to recompile the kernel to customize it. This is naturally a good fit for observability.

In the figure below, we can see that when the system executes the execve syscalls, the eBPF program is triggered, and the current process runtime information is obtained by using function calls.

Using eBPF technology, we can expand the scope of Skywalking’s profiling capabilities:

Global Performance Analysis: Before eBPF, data collection was limited to what agents can observe. Since eBPF programs run in the kernel, they can observe all threads. This is especially useful when you are not sure whether a performance problem is caused by a particular request.
Data Content: eBPF can dump both user and kernel space thread stacks, so if a performance issue happens in kernel space, it’s easier to find.
Agent Binding: All modern Linux kernels support eBPF, so there is no need to install anything. This means it is an orchestration-free vs an agent model. This reduces friction caused by built-in software which may not have the correct agents installed, such as Envoy in a Service Mesh.
Sampling Type: Unlike Trace Profiling, eBPF is event-driven and, therefore, not constrained by interval polling. For example, eBPF can trigger events and collect more data depending on a transfer size threshold. This can allow the system to triage and prioritize data collection under extreme load.

eBPF Limitations

While eBPF offers significant advantages for hunting performance bottlenecks, no technology is perfect. eBPF has a number of limitations described below. Fortunately, since SkyWalking does not require eBPF, the impact is limited.

Linux Version Requirement: eBPF programs require a Linux kernel version above 4.4, with later kernel versions offering more data to be collected. The BCC has documented the features supported by different Linux kernel versions, with the differences between versions usually being what data can be collected with eBPF.
Privileges Required: All processes that intend to load eBPF programs into the Linux kernel must be running in privileged mode. As such, bugs or other issues in such code may have a big impact.
Weak Support for Dynamic Language: eBPF has weak support for JIT-based dynamic languages, such as Java. It also depends on what data you want to collect. For Profiling, eBPF does not support parsing the symbols of the program, which is why most eBPF-based profiling technologies only support static languages like C, C++, Go, and Rust. However, symbol mapping can sometimes be solved through tools provided by the language. For example, in Java, perf-map-agent can be used to generate the symbol mapping. However, dynamic languages don’t support the attach (uprobe) functionality that would allow us to trace execution events through symbols.

Introducing SkyWalking Rover

SkyWalking Rover introduces the eBPF profiling feature into the SkyWalking ecosystem. The figure below shows the overall architecture of SkyWalking Rover. SkyWalking Rover is currently supported in Kubernetes environments and must be deployed inside a Kubernetes cluster. After establishing a connection with the SkyWalking backend server, it saves information about the processes on the current machine to SkyWalking. When the user creates an eBPF profiling task via the user interface, SkyWalking Rover receives the task and executes it in the relevant C, C++, Golang, and Rust language-based programs.

Other than an eBPF-capable kernel, there are no additional prerequisites for deploying SkyWalking Rover.

CPU Profiling with Rover

CPU profiling is the most intuitive way to show service performance. Inspired by Brendan Gregg‘s blog post, we’ve divided CPU profiling into two types that we have implemented in Rover:

On-CPU Profiling: Where threads are spending time running on-CPU.
Off-CPU Profiling: Where time is spent waiting while blocked on I/O, locks, timers, paging/swapping, etc.

Profiling Envoy with eBPF

Envoy is a popular proxy, used as the data plane by the Istio service mesh. In a Kubernetes cluster, Istio injects Envoy into each service’s pod as a sidecar where it transparently intercepts and processes incoming and outgoing traffic. As the data plane, any performance issues in Envoy can affect all service traffic in the mesh. In this scenario, it’s more powerful to use eBPF profiling to analyze issues in production caused by service mesh configuration.

Demo Environment

If you want to see this scenario in action, we’ve built a demo environment where we deploy an Nginx service for stress testing. Traffic is intercepted by Envoy and forwarded to Nginx. The commands to install the whole environment can be accessed through GitHub.

On-CPU Profiling

On-CPU profiling is suitable for analyzing thread stacks when service CPU usage is high. If the stack is dumped more times, it means that the thread stack occupies more CPU resources.

When installing Istio using the demo configuration profile, we found there are two places where we can optimize performance:

Zipkin Tracing: Different Zipkin sampling percentages have a direct impact on QPS.
Access Log Format: Reducing the fields of the Envoy access log can improve QPS.

Zipkin Tracing

Zipkin with 100% sampling

In the default demo configuration profile, Envoy is using 100% sampling as default tracing policy. How does that impact the performance?

As shown in the figure below, using the on-CPU profiling, we found that it takes about 16% of the CPU overhead. At a fixed consumption of 2 CPUs, its QPS can reach 5.7K.

Disable Zipkin tracing

At this point, we found that if Zipkin is not necessary, the sampling percentage can be reduced or we can even disable tracing. Based on the Istio documentation, we can disable tracing when installing the service mesh using the following command:

istioctl install -y --set profile=demo \
   --set 'meshConfig.enableTracing=false' \
   --set 'meshConfig.defaultConfig.tracing.sampling=0.0'

After disabling tracing, we performed on-CPU profiling again. According to the figure below, we found that Zipkin has disappeared from the flame graph. With the same 2 CPU consumption as in the previous example, the QPS reached 9K, which is an almost 60% increase.

Tracing with Throughput

With the same CPU usage, we’ve discovered that Envoy performance greatly improves when the tracing feature is disabled. Of course, this requires us to make trade-offs between the number of samples Zipkin collects and the desired performance of Envoy (QPS).

The table below illustrates how different Zipkin sampling percentages under the same CPU usage affect QPS.

Zipkin sampling %	QPS	CPUs	Note
100% (default)	5.7K	2	16% used by Zipkin
1%	8.1K	2	0.3% used by Zipkin
disabled	9.2K	2	0% used by Zipkin

Access Log Format

Default Log Format

In the default demo configuration profile, the default Access Log format contains a lot of data. The flame graph below shows various functions involved in parsing the data such as request headers, response headers, and streaming the body.

Simplifying Access Log Format

Typically, we don’t need all the information in the access log, so we can often simplify it to get what we need. The following command simplifies the access log format to only display basic information:

istioctl install -y --set profile=demo \
   --set meshConfig.accessLogFormat="[%START_TIME%] \"%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%\" %RESPONSE_CODE%\n"

After simplifying the access log format, we found that the QPS increased from 5.7K to 5.9K. When executing the on-CPU profiling again, the CPU usage of log formatting dropped from 2.4% to 0.7%.

Simplifying the log format helped us to improve the performance.

Off-CPU Profiling

Off-CPU profiling is suitable for performance issues that are not caused by high CPU usage. For example, when there are too many threads in one service, using off-CPU profiling could reveal which threads spend more time context switching.

We provide data aggregation in two dimensions:

Switch count: The number of times a thread switches context. When the thread returns to the CPU, it completes one context switch. A thread stack with a higher switch count spends more time context switching.
Switch duration: The time it takes a thread to switch the context. A thread stack with a higher switch duration spends more time off-CPU.

Write Access Log

Enable Write

Using the same environment and settings as before in the on-CPU test, we performed off-CPU profiling. As shown below, we found that access log writes accounted for about 28% of the total context switches. The “__write” shown below also indicates that this method is the Linux kernel method.

Disable Write

SkyWalking implements Envoy’s Access Log Service (ALS) feature which allows us to send access logs to the SkyWalking Observability Analysis Platform (OAP) using the gRPC protocol. Even by disabling the access logging, we can still use ALS to capture/aggregate the logs. We’ve disabled writing to the access log using the following command:

istioctl install -y --set profile=demo --set meshConfig.accessLogFile=""

After disabling the Access Log feature, we performed the off-CPU profiling. File writing entries have disappeared as shown in the figure below. Envoy throughput also increased from 5.7K to 5.9K.

Conclusion

In this article, we’ve examined the insights Apache Skywalking’s Trace Profiling can give us and how much more can be achieved with eBPF profiling. All of these features are implemented in skywalking-rover. In addition to on- and off-CPU profiling, you will also find the following features:

Continuous profiling, helps you automatically profile without manual intervention. For example, when Rover detects that the CPU exceeds a configurable threshold, it automatically executes the on-CPU profiling task.
More profiling types to enrich usage scenarios, such as network, and memory profiling.

Blog: SkyWalking Python Agent Supports Profiling Now

Sun, 12 Sep 2021 00:00:00 +0000

The Java Agent of Apache SkyWalking has supported profiling since v7.0.0, and it enables users to troubleshoot the root cause of performance issues, and now we bring it into Python Agent. In this blog, we will show you how to use it, and we will introduce the mechanism of profiling.

How to use profiling in Python Agent

This feature is released in Python Agent at v0.7.0. It is turned on by default, so you don’t need any extra configuration to use it. You can find the environment variables about it here.

Here are the demo codes of an intentional slow application.

import time

def method1():
    time.sleep(0.02)
    return '1'

def method2():
    time.sleep(0.02)
    return method1()

def method3():
    time.sleep(0.02)
    return method2()

if __name__ == '__main__':
    import socketserver
    from http.server import BaseHTTPRequestHandler

    class SimpleHTTPRequestHandler(BaseHTTPRequestHandler):

        def do_POST(self):
            method3()
            time.sleep(0.5)
            self.send_response(200)
            self.send_header('Content-Type', 'application/json')
            self.end_headers()
            self.wfile.write('{"song": "Despacito", "artist": "Luis Fonsi"}'.encode('ascii'))

    PORT = 19090
    Handler = SimpleHTTPRequestHandler

    with socketserver.TCPServer(("", PORT), Handler) as httpd:
        httpd.serve_forever()

We can start it with SkyWalking Python Agent CLI without changing any application code now, which is also the latest feature of v0.7.0. We just need to add sw-python run before our start command(i.e. sw-python run python3 main.py), to start the application with python agent attached. More information about sw-python can be found there.

Then, we should add a new profile task for the / endpoint from the SkyWalking UI, as shown below.

We can access it by curl -X POST http://localhost:19090/, after that, we can view the result of this profile task on the SkyWalking UI.

The mechanism of profiling

When a request lands on an application with the profile function enabled, the agent begins the profiling automatically if the request’s URI is as required by the profiling task. A new thread is spawned to fetch the thread dump periodically until the end of request.

The agent sends these thread dumps, called ThreadSnapshot, to SkyWalking OAPServer, and the OAPServer analyzes those ThreadSnapshot(s) and gets the final result. It will take a method invocation with the same stack depth and code signature as the same operation, and estimate the execution time of each method from this.

Let’s demonstrate how this analysis works through the following example. Suppose we have such a program below and we profile it at 10ms intervals.

def main():
    methodA()

def methodA():
    methodB()

def methodB():
    methodC()
    methodD()

def methodC():
    time.sleep(0.04)

def methodD():
    time.sleep(0.06)

The agent collects a total of 10 ThreadSnapShot(s) over the entire time period(Diagram A). The first 4 snapshots represent the thread dumps during the execution of function C, and the last 6 snapshots represent the thread dumps during the execution of function D. After the analysis of OAPServer, we can see the result of this profile task on the SkyWalking Rocketbot UI as shown in the right of the diagram. With this result, we can clearly see the function call relationship and the time consumption situation of this program.

Diagram A

You can read more details of profiling theory from this blog.

We hope you enjoy the profile in the Python Agent, and if so, you can give us a star on Python Agent and SkyWalking on GitHub.

Blog: Apache SkyWalking: Use Profiling to Fix the Blind Spot of Distributed Tracing

Mon, 13 Apr 2020 00:00:00 +0000

This post originally appears on The New Stack

This post introduces a way to automatically profile code in production with Apache SkyWalking. We believe the profile method helps reduce maintenance and overhead while increasing the precision in root cause analysis.

Limitations of the Distributed Tracing

In the early days, metrics and logging systems were the key solutions in monitoring platforms. With the adoption of microservice and distributed system-based architecture, distributed tracing has become more important. Distributed tracing provides relevant service context, such as system topology map and RPC parent-child relationships.

Some claim that distributed tracing is the best way to discover the cause of performance issues in a distributed system. It’s good at finding issues at the RPC abstraction, or in the scope of components instrumented with spans. However, it isn’t that perfect.

Have you been surprised to find a span duration longer than expected, but no insight into why? What should you do next? Some may think that the next step is to add more instrumentation, more spans into the trace, thinking that you would eventually find the root cause, with more data points. We’ll argue this is not a good option within a production environment. Here’s why:

There is a risk of application overhead and system overload. Ad-hoc spans measure the performance of specific scopes or methods, but picking the right place can be difficult. To identify the precise cause, you can “instrument” (add spans to) many suspicious places. The additional instrumentation costs more CPU and memory in the production environment. Next, ad-hoc instrumentation that didn’t help is often forgotten, not deleted. This creates a valueless overhead load. In the worst case, excess instrumentation can cause performance problems in the production app or overload the tracing system.
The process of ad-hoc (manual) instrumentation usually implies at least a restart. Trace instrumentation libraries, like Zipkin Brave, are integrated into many framework libraries. To instrument a method’s performance typically implies changing code, even if only an annotation. This implies a re-deploy. Even if you have the way to do auto instrumentation, like Apache SkyWalking, you still need to change the configuration and reboot the app. Otherwise, you take the risk of GC caused by hot dynamic instrumentation.
Injecting instrumentation into an uninstrumented third party library is hard and complex. It takes more time and many won’t know how to do this.
Usually, we don’t have code line numbers in the distributed tracing. Particularly when lambdas are in use, it can be difficult to identify the line of code associated with a span. Regardless of the above choices, to dive deeper requires collaboration with your Ops or SRE team, and a shared deep level of knowledge in distributed tracing.

Regardless of the above choices, to dive deeper requires collaboration with your Ops or SRE team, and a shared deep level of knowledge in distributed tracing.

Profiling in Production

Introduction

To reuse distributed tracing to achieve method scope precision requires an understanding of the above limitations and a different approach. We called it PROFILE.

Most high-level languages build and run on a thread concept. The profile approach takes continuous thread dumps. We merge the thread dumps to estimate the execution time of every method shown in the thread dumps. The key for distributed tracing is the tracing context, identifiers active (or current) for the profiled method. Using this trace context, we can weave data harvested from profiling into existing traces. This allows the system to automate otherwise ad-hoc instrumentation. Let’s dig deeper into how profiling works:

We consider a method invocation with the same stack depth and signature (method, line number etc), the same operation. We derive span timestamps from the thread dumps the same operation is in. Let’s put this visually:

Above, represents 10 successive thread dumps. If this method is in dumps 4-8, we assume it started before dump 4 and finished after dump 8. We can’t tell exactly when the method started and stopped. but the timestamps of thread dumps are close enough.

To reduce overhead caused by thread dumps, we only profile methods enclosed by a specific entry point, such as a URI or MVC Controller method. We identify these entry points through the trace context and the APM system.

The profile does thread dump analysis and gives us:

The root cause, precise to the line number in the code.
Reduced maintenance as ad-hoc instrumentation is obviated.
Reduced overload risk caused by ad-hoc instrumentation.
Dynamic activation: only when necessary and with a very clear profile target.

Implementing Precise Profiling with Apache SkyWalking 7

Distributed profiling is built-into Apache SkyWalking application performance monitoring (APM). Let’s demonstrate how the profiling approach locates the root cause of the performance issue.

final CountDownLatchcountDownLatch= new CountDownLatch(2);
 
threadPool.submit(new Task1(countDownLatch));
threadPool.submit(new Task2(countDownLatch));
 
try {
   countDownLatch.await(500, TimeUnit.MILLISECONDS);
} catch (InterruptedExceptione) {
}

Task1 and Task2 have a race condition and unstable execution time: they will impact the performance of each other and anything calling them. While this code looks suspicious, it is representative of real life. People in the OPS/SRE team are not usually aware of all code changes and who did them. They only know something in the new code is causing a problem.

To make matters interesting, the above code is not always slow: it only happens when the condition is locked. In SkyWalking APM, we have metrics of endpoint p99/p95 latency, so, we are easy to find out the p99 of this endpoint is far from the avg response time. However, this is not the same as understanding the cause of the latency. To locate the root cause, add a profile condition to this endpoint: duration greater than 500ms. This means faster executions will not add profiling load.

This is a typical profiled trace segment (part of the whole distributed trace) shown on the SkyWalking UI. We now notice the “service/processWithThreadPool” span is slow as we expected, but why? This method is the one we added the faulty code to. As the UI shows that method, we know the profiler is working. Now, let’s see what the profile analysis result say.

This is the profile analysis stack view. We see the stack element names, duration (include/exclude the children) and slowest methods have been highlighted. It shows clearly, “sun.misc.Unsafe.park” costs the most time. If we look for the caller, it is the code we added: CountDownLatch.await.

The Limitations of the Profile Method

No diagnostic tool can fit all cases, not even the profile method.

The first consideration is mistaking a repeatedly called method for a slow method. Thread dumps are periodic. If there is a loop of calling one method, the profile analysis result would say the target method is slow because it is captured every time in the dump process. There could be another reason. A method called many times can also end up captured in each thread dump. Even so, the profile did what it is designed for. It still helps the OPS/SRE team to locate the code having the issue.

The second consideration is overhead, the impact of repeated thread dumps is real and can’t be ignored. In SkyWalking, we set the profile dump period to at least 10ms. This means we can’t locate method performance issues if they complete in less than 10ms. SkyWalking has a threshold to control the maximum parallel degree as well.

Understanding the above keeps distributed tracing and APM systems useful for your OPS/SRE team.

How to Try This

Everything we discussed, including the Apache SkyWalking Java Agent, profile analysis code, and UI, could be found in our GitHub repository. We hope you enjoyed this new profile method, and love Apache SkyWalking. If so, give us a star on GitHub to encourage us.

SkyWalking 7 has just been released. You can contact the project team through the following channels:

Follow SkyWalking twitter.
Subscribe mailing list: dev@skywalking.apache.org. Send to dev-subscribe@kywalking.apache.org to subscribe to the mail list.

Co-author Sheng Wu is a Tetrate founding engineer and the founder and VP of Apache SkyWalking. He is solving the problem of observability for large-scale service meshes in hybrid and multi-cloud environments.

Adrian Cole works in the Spring Cloud team at VMware, mostly on Zipkin

Han Liu is a tech expert at Lagou. He is an Apache SkyWalking committer

Apache SkyWalking – Profiling

Blog: Activating Automatical Performance Analysis -- Continuous Profiling

Background

Automate Profiling

Policy

Monitoring type

Network related monitoring

Metrics collector

Threshold determination

Trigger task

Data Flow

eBPF Agent with Process

Sliding window

eBPF Agent with SkyWalking Backend

Demo

Deploy SkyWalking Showcase

Create Continuous Profiling Policy

Done

Conclusion

Zh: 自动化性能分析——持续剖析

背景

自动剖析

策略

监测类型

相关网络监测

指标收集器

阈值确定

触发任务

数据流

eBPF Agent进行进程跟踪

滑动窗口

eBPF Agent与OAP后端通讯

演示

部署 SkyWalking Showcase

创建持续剖析策略

完成

结论

Blog: Pinpoint Service Mesh Critical Performance Impact by using eBPF

Content

Background

Trace Profiling in SkyWalking

Introduce eBPF

eBPF Limitations

Introducing SkyWalking Rover

CPU Profiling with Rover

Profiling Envoy with eBPF

Demo Environment

On-CPU Profiling

Zipkin Tracing

Zipkin with 100% sampling

Disable Zipkin tracing

Tracing with Throughput

Access Log Format

Default Log Format

Simplifying Access Log Format

Off-CPU Profiling

Write Access Log

Enable Write

Disable Write

Conclusion

Blog: SkyWalking Python Agent Supports Profiling Now

How to use profiling in Python Agent

The mechanism of profiling

Blog: Apache SkyWalking: Use Profiling to Fix the Blind Spot of Distributed Tracing

Limitations of the Distributed Tracing

Profiling in Production

Introduction

Implementing Precise Profiling with Apache SkyWalking 7

The Limitations of the Profile Method

How to Try This