使用 gdb 解析 ppp driver crash log
由于项目整合,经过一次大版本升级后的项目,继承了大家族中许多新的特性,然而在正常功能测试中崩溃了,而且还是kernel crash, 会导致reboot的那种。本文就此问题重现、调试分析过程予以归纳总结。
问题描述
PPTP 拨号上网模式下,Router 在添加特定静态路由后crash. 添加路由的原因在本文讨论内容中不重要,因此略过。
问题再现
实际测试过程中寻找重现方法耗时较长,经过大量实验后简化如下:
- 连接 WAN Port
- PPTP 拨号上网
拨号成功后,查看 WAN
口及 ppp0
的 IP.
root@model:$ ifconfig brwan
brwan Link encap:Ethernet HWaddr 00:13:2F:34:42:59
inet addr:10.0.0.84 Bcast:10.156.23.255 Mask:255.255.254.0
inet6 addr: fe80::203:7fff:fe94:229/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:3547 errors:0 dropped:541 overruns:0 frame:0
TX packets:174 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:354037 (345.7 KiB) TX bytes:16378 (15.9 KiB)
root@model:$ ifconfig ppp0
ppp0 Link encap:Point-to-Point Protocol
inet addr:192.168.0.234 P-t-P:192.168.0.1 Mask:255.255.255.255
UP POINTOPOINT RUNNING NOARP MULTICAST MTU:1436 Metric:1
RX packets:5 errors:0 dropped:0 overruns:0 frame:0
TX packets:7 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:3
RX bytes:84 (84.0 B) TX bytes:233 (233.0 B)
确认拨号成功,同时查看一下当前路由。
root@model:$ route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 0.0.0.0 0.0.0.0 U 0 0 0 ppp0
10.0.0.1 10.0.23.254 255.255.255.255 UGH 0 0 0 brwan
10.0.0.2 10.0.23.254 255.255.255.255 UGH 0 0 0 brwan
10.0.22.0 0.0.0.0 255.255.254.0 U 0 0 0 brwan
192.168.0.1 0.0.0.0 255.255.255.255 UH 0 0 0 ppp0
192.168.1.0 0.0.0.0 255.255.255.0 U 0 0 0 br0
- 添加以下静态路由
route add -net 10.0.23.10 netmask 255.255.255.255 dev ppp0
再次查看路由,可以看到静态路由已经添加成功。
root@model:$ route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 0.0.0.0 0.0.0.0 U 0 0 0 ppp0
10.0.0.1 10.0.23.254 255.255.255.255 UGH 0 0 0 brwan
10.0.0.2 10.0.23.254 255.255.255.255 UGH 0 0 0 brwan
10.0.22.0 0.0.0.0 255.255.254.0 U 0 0 0 brwan
10.0.23.10 0.0.0.0 255.255.255.255 UH 0 0 0 ppp0
192.168.0.1 0.0.0.0 255.255.255.255 UH 0 0 0 ppp0
192.168.1.0 0.0.0.0 255.255.255.0 U 0 0 0 br0
前面为什么说是特定路由呢?因为 10.0.23.10
正是 PPTP Server 的 IP 地址,不加该路由前,访问 PPTP Server 的数据包会从 brwan
出去,添加后将从 ppp0
出去,这是个关键点。
添加路由后,等待10秒左右,kernel 将会 crash, 并且会打印出以下 crash log, 这也是本文的研究对象。
INFO: rcu_preempt self-detected stall on CPU { 0} (t=2100 jiffies g=4282090 c=4282089 q=40763)
CPU: 0 PID: 10879 Comm: pppd Tainted: P 3.14.77 #2
[<c021eb54>] (unwind_backtrace) from [<c021bc4c>] (show_stack+0x10/0x14)
[<c021bc4c>] (show_stack) from [<c03b2b08>] (dump_stack+0x78/0x98)
INFO: rcu_preempt detected stalls on CPUs/tasks: { 0} (detected by 1, t=2102 jiffies, g=4282090, c=4282089, q=40778)
Task dump for CPU 0:
pppd R running 0 10879 1 0x00000002
[<c020fc4c>] (__schedule) from [<d922e000>] (0xd922e000)
[<c03b2b08>] (dump_stack) from [<c026e390>] (rcu_check_callbacks+0x230/0x6a0)
[<c026e390>] (rcu_check_callbacks) from [<c02391c0>] (update_process_times+0x38/0x58)
[<c02391c0>] (update_process_times) from [<c0276a58>] (tick_sched_timer+0x44/0x74)
[<c0276a58>] (tick_sched_timer) from [<c024a874>] (__run_hrtimer+0x50/0xc8)
[<c024a874>] (__run_hrtimer) from [<c024b0b4>] (hrtimer_interrupt+0x12c/0x288)
[<c024b0b4>] (hrtimer_interrupt) from [<c04e2828>] (arch_timer_handler_virt+0x28/0x30)
[<c04e2828>] (arch_timer_handler_virt) from [<c0268d28>] (handle_percpu_devid_irq+0x68/0x84)
[<c0268d28>] (handle_percpu_devid_irq) from [<c0265698>] (generic_handle_irq+0x20/0x30)
[<c0265698>] (generic_handle_irq) from [<c0218e74>] (handle_IRQ+0x68/0x90)
[<c0218e74>] (handle_IRQ) from [<c0208514>] (gic_handle_irq+0x3c/0x5c)
[<c0208514>] (gic_handle_irq) from [<c0209580>] (__irq_svc+0x40/0x70)
Exception stack(0xd922fd58 to 0xd922fda0)
fd40: de01cadc 00000000
fd60: 0000013c 0000013b d7e444c0 de01cae8 de01cadc 00000000 ffffffff ffffffff
fd80: 00000001 d7e44514 00000000 d922fda0 bf11d598 c0212a44 20000013 ffffffff
[<c0209580>] (__irq_svc) from [<c0212a44>] (_raw_spin_lock_bh+0x48/0x5c)
[<c0212a44>] (_raw_spin_lock_bh) from [<bf11d598>] (ppp_release_channels+0xa4/0x5cc [ppp_generic])
[<bf11d598>] (ppp_release_channels [ppp_generic]) from [<bf11f91c>] (ppp_input+0x590/0x658 [ppp_generic])
[<bf11f91c>] (ppp_input [ppp_generic]) from [<bf11fb4c>] (ppp_start_xmit+0x168/0x174 [ppp_generic])
[<bf11fb4c>] (ppp_start_xmit [ppp_generic]) from [<c05305dc>] (dev_hard_start_xmit+0x2d4/0x420)
[<c05305dc>] (dev_hard_start_xmit) from [<c0546d2c>] (sch_direct_xmit+0x6c/0x1b0)
[<c0546d2c>] (sch_direct_xmit) from [<c0530918>] (__dev_queue_xmit+0x1f0/0x418)
[<c0530918>] (__dev_queue_xmit) from [<c0571da0>] (ip_finish_output+0x43c/0x49c)
[<c0571da0>] (ip_finish_output) from [<bf064f7c>] (pptp_xmit+0x49c/0x51c [pptp])
[<bf064f7c>] (pptp_xmit [pptp]) from [<bf120ea4>] (ppp_ioctl+0xe20/0xe7c [ppp_generic])
[<bf120ea4>] (ppp_ioctl [ppp_generic]) from [<bf120ff0>] (ppp_write+0xe0/0x104 [ppp_generic])
[<bf120ff0>] (ppp_write [ppp_generic]) from [<c02c1698>] (vfs_write+0xcc/0x160)
[<c02c1698>] (vfs_write) from [<c02c1ba0>] (SyS_write+0x40/0x80)
[<c02c1ba0>] (SyS_write) from [<c0208c80>] (ret_fast_syscall+0x0/0x38)
BUG: soft lockup - CPU#1 stuck for 22s! [ifconfig:6333]
解析 crash log
针对以上crash log, 我们要找出 Root Cause,需要分析其中的关键信息,然后从代码中找到对应位置。
CPU: 0 PID: 10879 Comm: pppd Tainted: P 3.14.77 #2
首先,最简单是查看用户空间的信息,可以看到,Comm 后输出的是 pppd
, 显然与 pppd
有关,pppd 作为用户空间的一个 Daemon, 用于处理 pptp, l2tp 以及 pppoe 相关功能,并且会与 kernel 中的 ppp driver 通信。
但是使用 strace pid-of-pppd
并不能很好的查找到问题所在,因为 crash 发生时,strace 也无法打印信息。
继续查看后面的log,下面这段log像是进程调度相关的信息,并不能直观看出是什么问题。
[<c020fc4c>] (__schedule) from [<d922e000>] (0xd922e000)
[<c03b2b08>] (dump_stack) from [<c026e390>] (rcu_check_callbacks+0x230/0x6a0)
[<c026e390>] (rcu_check_callbacks) from [<c02391c0>] (update_process_times+0x38/0x58)
[<c02391c0>] (update_process_times) from [<c0276a58>] (tick_sched_timer+0x44/0x74)
[<c0276a58>] (tick_sched_timer) from [<c024a874>] (__run_hrtimer+0x50/0xc8)
[<c024a874>] (__run_hrtimer) from [<c024b0b4>] (hrtimer_interrupt+0x12c/0x288)
[<c024b0b4>] (hrtimer_interrupt) from [<c04e2828>] (arch_timer_handler_virt+0x28/0x30)
[<c04e2828>] (arch_timer_handler_virt) from [<c0268d28>] (handle_percpu_devid_irq+0x68/0x84)
[<c0268d28>] (handle_percpu_devid_irq) from [<c0265698>] (generic_handle_irq+0x20/0x30)
[<c0265698>] (generic_handle_irq) from [<c0218e74>] (handle_IRQ+0x68/0x90)
[<c0218e74>] (handle_IRQ) from [<c0208514>] (gic_handle_irq+0x3c/0x5c)
[<c0208514>] (gic_handle_irq) from [<c0209580>] (__irq_svc+0x40/0x70)
Exception stack(0xd922fd58 to 0xd922fda0)
继续往后看,下面这部分出现了 ppp 相关的信息,其中 ppp_generic 正是 ppp driver 对应的内核模块 ppp_generic.ko
, 而 pptp 对应的是内核模块 pptp.ko
. 所以可以大致确定是 ppp driver 引发的 crash.
fd40: de01cadc 00000000
fd60: 0000013c 0000013b d7e444c0 de01cae8 de01cadc 00000000 ffffffff ffffffff
fd80: 00000001 d7e44514 00000000 d922fda0 bf11d598 c0212a44 20000013 ffffffff
[<c0209580>] (__irq_svc) from [<c0212a44>] (_raw_spin_lock_bh+0x48/0x5c)
[<c0212a44>] (_raw_spin_lock_bh) from [<bf11d598>] (ppp_release_channels+0xa4/0x5cc [ppp_generic])
[<bf11d598>] (ppp_release_channels [ppp_generic]) from [<bf11f91c>] (ppp_input+0x590/0x658 [ppp_generic])
[<bf11f91c>] (ppp_input [ppp_generic]) from [<bf11fb4c>] (ppp_start_xmit+0x168/0x174 [ppp_generic])
[<bf11fb4c>] (ppp_start_xmit [ppp_generic]) from [<c05305dc>] (dev_hard_start_xmit+0x2d4/0x420)
[<c05305dc>] (dev_hard_start_xmit) from [<c0546d2c>] (sch_direct_xmit+0x6c/0x1b0)
[<c0546d2c>] (sch_direct_xmit) from [<c0530918>] (__dev_queue_xmit+0x1f0/0x418)
[<c0530918>] (__dev_queue_xmit) from [<c0571da0>] (ip_finish_output+0x43c/0x49c)
[<c0571da0>] (ip_finish_output) from [<bf064f7c>] (pptp_xmit+0x49c/0x51c [pptp])
[<bf064f7c>] (pptp_xmit [pptp]) from [<bf120ea4>] (ppp_ioctl+0xe20/0xe7c [ppp_generic])
[<bf120ea4>] (ppp_ioctl [ppp_generic]) from [<bf120ff0>] (ppp_write+0xe0/0x104 [ppp_generic])
[<bf120ff0>] (ppp_write [ppp_generic]) from [<c02c1698>] (vfs_write+0xcc/0x160)
[<c02c1698>] (vfs_write) from [<c02c1ba0>] (SyS_write+0x40/0x80)
[<c02c1ba0>] (SyS_write) from [<c0208c80>] (ret_fast_syscall+0x0/0x38)
然后从其中的 _raw_spin_lock_bh
可以猜出可能是死锁(dead lock)问题.
[<c0212a44>] (_raw_spin_lock_bh) from [<bf11d598>] (ppp_release_channels+0xa4/0x5cc [ppp_generic])
但是具体怎么造成死锁的,以及和添加的静态路由有没有关系?继续 debug.
printk
根据以上 log 可知,ppp_generic
会通过 ppp_write
调用 ppp_ioctl
继而调用 pptp_xmit
, 那么先来看看内核驱动代码中的 ppp_write
.
- drivers/net/ppp/ppp_generic.c
static ssize_t ppp_write(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
struct ppp_file *pf = file->private_data;
struct sk_buff *skb;
ssize_t ret;
if (!pf)
return -ENXIO;
ret = -ENOMEM;
skb = alloc_skb(count + pf->hdrlen, GFP_KERNEL);
if (!skb)
goto out;
skb_reserve(skb, pf->hdrlen);
ret = -EFAULT;
if (copy_from_user(skb_put(skb, count), buf, count)) {
kfree_skb(skb);
goto out;
}
skb_queue_tail(&pf->xq, skb);
switch (pf->kind) {
case INTERFACE:
ppp_xmit_process(PF_TO_PPP(pf));
break;
case CHANNEL:
ppp_channel_push(PF_TO_CHANNEL(pf));
break;
}
ret = count;
out:
return ret;
}
这里并没有直接调用 ppp_ioctl
, 但是可以知道发送数据需要通过 ppp_xmit_process
或者 ppp_channel_push
完成。
switch (pf->kind) {
case INTERFACE:
printk("Using ppp_xmit_process...\n");
ppp_xmit_process(PF_TO_PPP(pf));
break;
case CHANNEL:
printk("Using ppp_channel_push...\n");
ppp_channel_push(PF_TO_CHANNEL(pf));
break;
}
为此,使用 printk
打印 log 确定是经过 ppp_channel_push
. 那么再来看看这个函数长啥样。
/*
* Try to send data out on a channel.
*/
static void
ppp_channel_push(struct channel *pch)
{
struct sk_buff *skb;
struct ppp *ppp;
spin_lock_bh(&pch->downl);
if (pch->chan) {
while (!skb_queue_empty(&pch->file.xq)) {
skb = skb_dequeue(&pch->file.xq);
if (!pch->chan->ops->start_xmit(pch->chan, skb)) {
/* put the packet back and try again later */
skb_queue_head(&pch->file.xq, skb);
break;
}
}
} else {
/* channel got deregistered */
skb_queue_purge(&pch->file.xq);
}
spin_unlock_bh(&pch->downl);
/* see if there is anything from the attached unit to be sent */
if (skb_queue_empty(&pch->file.xq)) {
read_lock_bh(&pch->upl);
ppp = pch->ppp;
if (ppp)
ppp_xmit_process(ppp);
read_unlock_bh(&pch->upl);
}
}
注意这里有个加锁操作 spin_lock_bh(&pch->downl)
, 这里先不管。继续加printk,确定是从下面的语句访问的 pptp_xmit
.
if (!pch->chan->ops->start_xmit(pch->chan, skb)) {
按这个思路可以把执行顺序列出来。
- ppp_write [ppp_generic] <- ppp_channel_push [ppp_generic]
- ppp_channel_push [ppp_generic] <- start_xmit [ppp_generic]
- start_xmit [ppp_generic] <- pptp_xmit [pptp]
- pptp_xmit [pptp] <- ip_local_out []
- ip_local_out [] <- ...
- ... <- ppp_start_xmit [ppp_generic]
但是其中有部分不是 ppp driver 的code, 所以中间部分就没细究。最终是又回到 ppp_generic. 到这还是无法看出哪里出的问题。当时着急出release,又有了临时的workaroud,就把这个问题搁置了。
gdb
最后还是组长通过 gdb 破案了,不过注意需要使用 toolchain 里的 gdb.
- gdb: ./staging_dir/toolchain-.../bin/arm-openwrt-linux-dgb
- ppp_generic.ko: build_dir/target.../linux-ipq806x/linux-.../drivers/net/ppp/ppp_generic.ko
使用 gdb 中的 "list" 指令,简写 "l" 也可以,后面接地址、函数名或者行数
(gdb) list *Addr
(gdb) list FUNCTION
(gdb) list LINE
使用该指令,以 crash log 中给出的地址为参数,执行结果如下:
$ ./staging_dir/toolchain-.../bin/arm-openwrt-linux-dgb \
build_dir/target.../linux-ipq806x/linux-.../drivers/net/ppp/ppp_generic.ko
(gdb) list *(ppp_write+0xe0)
0x3ff0 is in ppp_write (drivers/net/ppp/ppp_generic.c:568).
563 case INTERFACE:
564 ppp_xmit_process(PF_TO_PPP(pf));
565 break;
566 case CHANNEL:
567 ppp_channel_push(PF_TO_CHANNEL(pf));
568 break;
569 }
570
571 ret = count;
572
(gdb) list *(ppp_ioctl+0xe20)
0x3ea4 is in ppp_channel_push (drivers/net/ppp/ppp_generic.c:1811).
1806
1807 spin_lock_bh(&pch->downl);
1808 if (pch->chan) {
1809 while (!skb_queue_empty(&pch->file.xq)) {
1810 skb = skb_dequeue(&pch->file.xq);
1811 if (!pch->chan->ops->start_xmit(pch->chan, skb)) {
1812 /* put the packet back and try again later */
1813 skb_queue_head(&pch->file.xq, skb);
1814 break;
1815
}
(gdb) list *(ppp_release_channels+0xa4)
0x598 is in ppp_push (drivers/net/ppp/ppp_generic.c:1540).
1535 /* not doing multilink: send it down the first channel */
1536 list = list->next;
1537 pch = list_entry(list, struct channel, clist);
1538
1539 spin_lock_bh(&pch->downl);
1540 if (pch->chan) {
1541 if (pch->chan->ops->start_xmit(pch->chan, skb))
1542 ppp->xmit_pending = NULL;
1543 } else {
1544 /* channel got unregistered */
(gdb)
使用 gdb 可以快速定位到出错代码,从上面的解析结果可以看出,在以下两个函数
- ppp_channel_push
- ppp_push
中都调用了 spin_lock_bh(&pch->downl)
, ppp_channel_push 加锁后正在等待数据发送完成,而发送过程中又调用了 ppp_push ,但是该锁已经锁住了, ppp_push 只能一直等待,最终导致两处都在无限等待中从而陷入死锁。
ok,到此基本真相大白了,结合设定的静态路由来说。当用户向 PPTP server 请求数据时, 驱动会调用 ppp_channel_push
发送数据, 由于添加了静态路由,数据会走 ppp0, 于是又重新调用了 ppp_push
, 导致形成了一个环。
当然这并不是linux kernel 的bug,因为所用kernel经过了 qualcomm 的修改,事实证明公板也存在这个问题,所以应该是某部分修改引入的。于是乎,甩锅~
总结
经过这次的 crash 调试,学习了 ppp driver 的部分代码,积累了处理此类 crash 问题的经验,进一步学习了 gdb 调试方法。
此外,
- crash log 指示的函数未必就是真正出现问题的函数,编译优化可能导致某些Symbols 乱序,具体以 gdb 结果为准;
- FW 中的 binary 通常不包含 Symbols, 所以使用 gdb 时需要对编译目录中的 binary 进行调试;
- 触发 crash 的是定时发送的 LCP echo request 数据包;
- PPTP 使用 GRE(Generic Routing Encapsulation) 封装 ppp0 数据包,经由以太网络接口 brwan 进行IP网络数据收发。
参考
版权声明:本博客所有文章除特殊声明外,均采用 CC BY-NC 4.0 许可协议。转载请注明出处 litreily的博客!