 |
» |
|
|
|
Example 1 |  |
The following core dump was obtained while using a modified
version of the sp driver, which is described in example #2 in the strdb section of this chapter. On entry to adb, we first look at the msgbuf to look for the panic message and hex stack trace.
The interesting portion of msgbuf for this dump is: msgbuf+10/s . . . interrupt type 15, pcsq.pcoq = 0.3b2cc, isr.ior = 0.0 Data page fault on interrupt stack B2352A HP-UX () #1: Fri Aug 14 00:49:59 PDT 1992 panic: (display==0xbf00, flags==0x0) Interrupt PC-Offset Stack Trace (read across, most recent is 1st): 0x0013e81c 0x000cddb8 0x000bc93c 0x0003b2cc 0x0012e2bc 0x0016b350 End Of Stack
|
First we translate the hex stack trace in the panic message
into procedure names and addresses. Using the adb i command for each of the hex addresses in the panic
message stack trace, we get the following symbolic stack trace: panic+40: addil 800,dp interrupt+7E8: rsm 1,r0 $ihndlr_rtn: rsm 1,r0 sp_timeout+2C: ldws 0(arg3),arg2 softclock+94: b,n softclock+30 external_interrupt+350: ldil 261000,r22
|
The address where the illegal data access occurred is sp_timeout+2C. The isr.ior in the panic message indicates that the data address
that caused the panic is 0.0, and the instruction at sp_timeout+2C is ldws 0(arg3),arg2, so arg3 must have been 0 at the time of the panic. So
we are probably dereferencing a null pointer. Our first task is
to find out which pointer this is. To do this we need to know which
source code line sp_timeout+2C corresponds to. Here is the source code for sp_timeout(): struct sp { unsigned sp_state; queue_t *sp_rdq; mblk_t *mp; mblk_t *last_mp; }; static sp_timeout(lp) struct sp *lp; { mblk_t *temp; unsigned int s; if (lp->sp_state & SPOPEN) { /* Put message on driver's read queue */ s = splstr(); temp = lp->mp; lp->mp = lp->mp->b_next; if (lp->mp == NULL) lp->last_mp = NULL; temp->b_next = NULL; putq(lp->sp_rdq,temp); splx(s); } }
|
Here is the relevant portion of the assembly code. The instruction
which caused the panic is marked with an "*." sp_timeout,20?ia # adb command sp_timeout: # adb's response sp_timeout: stw rp,-14(sp) sp_timeout+4: stwm r3,40(sp) sp_timeout+8: stw r4,-3C(sp) sp_timeout+0xC: or arg0,r0,r3 sp_timeout+10: ldws 0(r3),arg1 sp_timeout+14: bb,>=,n arg1,31,sp_timeout+58 sp_timeout+18: bl tmxlwsrv+6C,rp (splstr) sp_timeout+1C: or r0,r0,r0 sp_timeout+20: or ret0,r0,r4 sp_timeout+24: ldws 8(r3),arg1 sp_timeout+28: ldws 8(r3),arg3 *sp_timeout+2C: ldws 0(arg3),arg2 sp_timeout+30: stws arg2,8(r3)
|
 |
At sp_timeout+0xC, arg0, which corresponds to the source code
variable lp is moved to r3. We know arg0 is lp, because lp is the first argument to sp_timeout(). sp_timeout+0x14 looks like the if statement in the source
code, because bb is a branch instruction. sp_timeout+0x18 is the call to splstr(). sp_timeout+0x28 loads arg3 with the memory contents at location
r3 + 0x8. arg3 is the source code variable lp->mp. We can guess this because mp is 8 bytes from
the start of lp, according to the declaration for the struct sp. So our problem is that lp->mp is NULL. We want to confirm this, and want to
look at the rest of *lp. To do so, we need to find the value of r3 at
the time of the panic. We may be able to extract this information
from the stack if we know the value of sp at the time of the panic. To get this information,
we do a manual stack back-trace. See "Manual Stack Back-Tracing" for
details on how this is done. The resulting table is shown below: sp | pcoqh | Procedure
Address | Frame
Size |
---|
0x1fdb80 0x1fdb50 0x1fdad0 0x16860 0x167e0 0x16560 0x16330 0x162f0
|
| 0x24b34 0xc8f48 0xc8938 0x13e8cc 0xcddb8 0xbc93c 0x3b2cc 0x12e2bc
|
| doadump+0xec panic_boot+0x354 boot+0x24 panic+0xf0 interrupt+0x7e8 $ihndlr_rtn sp_timeout+0x2c softclock+0x94
|
| 0x30 0x80 0x100 0x80 0x280 0x230 0x40 0x80
|
|
Now that we have the values of sp, we want to look into the stack of the procedure
above sp_timeout() in the stack trace to find what value that procedure
saved in its stack for r3. In this case, the procedure above sp_timeout() is $ihndlr_rtn. $ihndlr_rtn is one of the low-level kernel utility procedures
which is hand-coded in assembly and does not create a normal stack
frame. Instead it creates a "save state" area,
which contains the values of all the registers at the time the trap
or interrupt took place. The structure save_state is defined in /usr/include/machine/save_state.h. The general registers are stored first, and are
located at "top of save state area" + "register number" *
4. For example, r3 will be 3*4 = 12 off of the beginning of the save
state area. To find the top of the save state area, subtract the
size of the save_state structure from the value of sp for $ihndlr_rtn:  |
0x16560-0x230>sp # set sp to top of trap save state <sp/X 16330: 0xF000009 # first word of save state area <sp+0xC/X # find contents of r3 (lp) at sp + 3*4 icsBase+33C: 24C258 24C258+0x8/X # find 8 off of r3 (lp->mp) sp_sp+18: 0 # lp->mp is NULL 0x24c258/4X # look at all of lp: # state *sp_rdq *mp *last_mp sp_sp+10: 1 1040C00 0 10F7C00
|
We can also use strdb to look at lp. (See the strdb section of this chapter for details.) There may
be several instances of the sp driver, each with a different minor number, so
we must look at each one until we find the instance whose q_ptr is the same as the address we have for lp. lp is a pointer to the sp driver's private data, which is also pointed to
by q_ptr. The strdb STREAMS/UX subsystem la command will tell us what minor numbers are in
use for the sp driver: :la sp sp MAJOR = 115 ACTIVE Minor 2 Stream head RQ = 0x0810eb000 ACTIVE Minor 1 Stream head RQ = 0x081107a00 ACTIVE Minor 0 Stream head RQ = 0x0810ebe00
|
The strdb STREAMS/UX subsystem command lm will show us what modules may have been pushed
into the stream above the sp driver: :lm sp 0 STREAM Head lmodc Driver sp
|
In this case, the panicking stream happens to correspond to
the sp with minor number 1. From the strdb STREAMS/UX subsystem, we use ":qh sp
1" to get to the read queue of the stream head containing sp driver with minor number 1. Then the o command to get to the write queue of the stream
head. Next the n command twice to get from the stream head through
the module lmodc to the driver sp. Here is the display of the q information for driver sp, minor number 1. Note that q_ptr is 0x24c258, which is the address of lp. :qh sp 1 struct queue 0x1040c74 q_qinfo = 0x1e545c q_pad1[2] = 00 q_first = 0x0q_other = 0x1040c00 q_last = 0x0 q_next = 0x0 q_link = 0x0 q_ptr = 0x24c258 q_count = 0 q_flag = 0x1128 QWANTR QUSE QOLD QSYNCH q_minpsz = 0 q_maxpsz = 256 q_hiwat = 0x8000 q_lowat = 0x4000 q_bandp = 0x105fd40 q_nband = 1 q_pad1[0] = 00 q_pad1[1] = 00
|
Now that we have reached the queue structure for the panicking sp driver instance, we can use strdb or adb to examine its contents. Using the strdb command :b, we can look at q_ptr, and see that its mp field (the third word) is NULL:  |
:b 0x24c258 0x0024c258 00 00 00 01 01 04 0c 00 00 00 00 00 01 0f 7c 00 | ................ 0x0024c268 00 00 00 01 01 0f 8e 00 00 00 00 00 00 00 00 00 | ................ 0x0024c278 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ 0x0024c288 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ 0x0024c298 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ 0x0024c2a8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ 0x0024c2b8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ 0x0024c2c8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ 0x0024c2d8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ 0x0024c2e8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ 0x0024c2f8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ 0x0024c308 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ 0x0024c318 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ 0x0024c328 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ 0x0024c338 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ |
 |
|
At this point, we have probably learned all that we can from
the dump and must turn to the source code to discover the cause
of this problem. We next examine the code carefully everywhere that lp->mp is updated or should be updated. Because sp driver's put routine, spput(), should be updating lp->mp, we look at it first. static spput(q, mp) queue_t *q; mblk_t *mp; { struct sp *lp; unsigned int s; switch (mp->b_datap->db_type) { case M_DATA: case M_PROTO: case M_PCPROTO: s = splstr(); lp = q->q_ptr; if (!lp->last_mp) lp->last_mp = mp; else lp->last_mp->b_next = mp; splx(s); timeout(sp_timeout,lp,1); break; default: printf("Routine spput: Should not be here\n"); break; } }
|
Note that spput() never updates lp->mp. It just adds the new message to the tail of the
list using lp->last_mp. But once sp_timeout() has processed the last message on the list and
set lp->mp to NULL, spput() will never update lp->mp to point at the next message it receives. This causes sp_timeout() to be called with lp->mp == NULL. If we change spput() if statement to properly update lp->mp as shown below, this panic will be fixed. if (!lp->mp) /* * head of list is NULL so list is empty -- put new message * at head of list */ lp->mp = mp; else /* * list is not empty -- put new message at tail of list */ lp->last_mp->b_next = mp; /* * update list tail pointer to point to new message */ lp->last_mp = mp;
|
Example 2 |  |
The following core dump was obtained while using a modified
version of the sp driver, which is described in example #2 in the strdb section of this chapter. On entry to adb, we first look at the msgbuf to look for the panic message and hex stack trace.
The interesting portion of msgbuf for this dump is: msgbuf+0xc/s . . . trap type 15, pcsq.pcoq = 0.3b584, isr.ior = 0.0 B2352A HP-UX () #1: Fri Aug 14 00:49:59 PDT 1992 panic: (display==0xbf00, flags==0x0) Data segmentation fault PC-Offset Stack Trace (read across, most recent is 1st): 0x0013e81c 0x000cc108 0x000bd3f4 0x0003b584 0x00049a48 0x0004bd0c 0x0002f7d4 0x00046178 0x00049a48 0x000460d0 0x00046594 0x0012cc10 0x000bedd0 0x00024cf0 End Of Stack |
First we translate the hex stack trace in the panic message
into procedure names and addresses. Using the adb i command for each of the hex addresses in the panic
message stack trace, we get the following symbolic stack trace: panic+40: addil 800,dp trap+0xA28: b trap+0xF18 $call_trap+20: rsm 1,r0 spput+4C: stws r31,0(r1) csq_lateral+80: b,n csq_lateral+8C puthere+4C: ldw -54(sp),rp lmodcsrv+5C: bl getq,rp sq_wrapper+50: ldw -54(sp),rp csq_lateral+80: b,n csq_lateral+8C runq_run+58: b,n runq_run+74 str_sched_daemon+264: b str_sched_daemon+160
|
The address where the illegal data access occurred is spput+4C. The isr.ior in the panic message indicates that the data address
that caused the panic is 0.0, and the instruction at spput+4C is stws r31,0(r1), so r1 must have been 0 at the
time of the panic. We are probably dereferencing a null pointer.
Our first task is to find out which pointer this is. To do this
we need to know to which source code line spput+4C corresponds to. Here is the source code for spput(): struct sp { unsigned sp_state; queue_t *sp_rdq; mblk_t *mp; mblk_t *last_mp; }; static spput(q, mp) queue_t *q; mblk_t *mp; { struct sp *lp; unsigned int s; switch (mp->b_datap->db_type) { case M_DATA: case M_PROTO: case M_PCPROTO: lp = q->q_ptr; if (!lp->mp) lp->mp = mp; else lp->last_mp->b_next = mp; lp->last_mp = mp; timeout(sp_timeout,lp,1); break; default: printf("Routine spput: Should not be here\n"); break; } }
|
Here is the relevant portion of the assembly code. The instruction
where the panic occurred is marked with an "*". spput,40?ia spput: spput: stw rp,-14(sp) spput+4: ldo 40(sp),sp spput+8: or arg1,r0,r31 spput+0xC: ldw 14(r31),r22 spput+10: ldbs 0xD(r22),arg1 spput+14: ldo -41(r0),arg2 spput+18: ldo -41(arg1),arg3 spput+1C: combt,=,n arg2,arg3,spput+30 spput+20: ldo -40(r0),ret1 spput+24: combt,=,n ret1,arg3,spput+30 spput+28: ldo 42(r0),r19 spput+2C: combf,=,n r19,arg3,spput+78 spput+30: ldw 14(arg0),arg1 spput+34: ldws 8(arg1),ret0 spput+38: comibf,=,n 0,ret0,spput+48 spput+3C: stws r31,8(arg1) spput+40: b spput+54 spput+44: stws r31,0xC(arg1) spput+48: ldws 0xC(arg1),r1 * spput+4C: stws r31,0(r1) spput+50: stws r31,0xC(arg1) spput+54: ldil 3B000,rp spput+58: ldo 298(rp),r20 spput+5C: extru,= r20,1F,1,r21 spput+60: ldw -4(dp),r21 spput+64: ldo 1(r0),arg2 spput+68: bl spclose+0xB4,rp (timeout)
|
 |
First, we try to get a general idea where spput+0x4C falls in the source code. It occurs before the
call to timeout() at spput+0x68. The pattern of combt and combf instructions from spput+0x1C to spput+0x2C correspond to the switch statement in the
source code. We guess this by noticing that we have loaded a value
into arg3 which we compare against three different values, which
resembles the first three case statements in the switch statement.
It is unlikely that the default case of the switch statement, which
just does a printf(), would cause the system to panic. spput+0x4C is probably in the source code in the case
statement for M_DATA, M_PROTO, and M_PCPROTO. The comibf instruction at spput+0x38 must correspond to the if (!lp->mp)
source statement, because it is a conditional branch statement,
and it is comparing a register to 0 (zero). We may be able to determine whether we executed the "if" clause
or the "else" clause of the if statement, based
on the fact that we know we executed spput+0x4C (because a trap
occurred while executing it). The comibf instruction branches to its target address if
the condition it is checking is false. This comibf instruction compares ret0 to zero. If ret0 equals
zero, comibf will not branch, and execution will continue to spput+0x3C and spput+0x40. spput+0x40 is an unconditional branch to spput+0x54, which is past spput+0x4C. Therefore, if ret0 had been zero, we never
would have executed spput+0x4C. So ret0 was not zero. Since we know that
the comibf instruction corresponds to "if (!lp->mp)," we know
that lp->mp was not NULL, and the comibf instruction branches to spput+0x48 if lp->mp is not NULL, we can be
confident that spput+0x48 and spput+0x4C are part of the else clause of the if statement, which consists of one statement, "lp->last_mp->b_next
= mp;." Now we know which source code line we panicked on. We need
to determine which source code pointer the register r1 corresponds
to, because dereferencing r1 is what caused the panic. To do this,
we work backwards from spput+0x4C to see where r1's contents came from. On spput+0x48, r1 gets loaded from arg1 + 0xC. Now we look
backward to see where arg1 came from. It is tempting to assume that
arg1 is the second argument to spput, which is mp. But at spput+0x10, arg1 is the target of a load, so at spput+0x48 arg1 does not contain mp. It is also tempting
to look at spput+0x44 for the origins of arg1's contents, because
that instruction has arg1 as its target. But because we took the comibf at spput+0x38, we must have branched around spput+0x44, so we can ignore this instruction. Looking
further backward to spput+0x30, arg1 gets loaded from arg0 + 0x14. arg0
has not been the target of a load instruction since the beginning
of spput, so it must still contain the first argument to spput, q. Looking at the source code, the only time that q is referenced is to set lp in the statement before the if. So arg1 must correspond to lp. Looking at the source code line where the panic occurred, "lp->last_mp->b_next
= mp," and the assembly code lines spput+0x48 and spput+0x4C, it appears that spput+0x48 is setting r1 to lp->last_mp, and spput+0x4C is attempting to put the contents of r31 into
memory location r1 + 0, which must be "lp->last_mp->b_next". So our problem is that lp->last_mp is NULL. It may
help us to look at the rest of *lp, and to do so we need to find the value of arg1
at the time of the panic. We may be able to extract this information
from the stack if we know the value of sp at the time of the panic.
To get this information, we do a manual stack back-trace. See "Manual
Stack Back-Tracing" for details on how this is done. The
resulting table is shown below: sp | pcoqh | Procedure
Address | Frame
Size |
---|
0x1fdb80 0x1fdb50 0x1fdad0 0x7ffe6f88 0x7ffe6f08 0x7ffe6e08 0x7ffe6bd8 0x7ffe6b98
|
| 0x24b34 0xc8f48 0xc8938 0x13e8cc 0xcc108 0xbd3f4 0x3b584 0x49a48
|
| doadump+0xEC panic_boot+354 boot+0x24 panic+0xf0 trap+0xf18 $call_trap spput+0x4c csq_lateral+0x80
|
| 0x30 0x80 0x80 0x80 0x100 0x230 0x40 0x80
|
|
Now that we have the values of sp, we want to look into the stack frame of the procedure
above spput() in the stack trace, to find what value that procedure
saved in its stack for arg1. In this case, the procedure above spput() is $call_trap. $call_trap is one of the low-level kernel utility procedures
which is hand-coded in assembly and does not create a normal stack
frame. Instead it creates a "save state" area,
which contains the values of all the registers at the time the trap
or interrupt took place. The structure save_state is defined in /usr/include/machine/save_state.h. The general registers are stored first, and are
located at "top of save state area" + "register number" *
4. So, for example, arg1, which is also known as r25, will be 25*4
= 100 off of the beginning of the save state area. To find the top
of the save state area, subtract the size of the save_state structure (0x230 in release 9.0) from the value
of sp for $call_trap:  |
0x7ffe6e08-0x230>sp # set sp to top of trap save state <sp/X 7FFE6BD8: 0xF000009 # first word of save state area 0x7ffe6bd8+0x4/X 7FFE6BDC: 0 # find contents of r1 (lp->last_mp) # at sp + 1*4. NULL, as we thought 0x7ffe6bd8+0x64/X # find contents of arg1 (lp) at # sp + 25*4. 7FFE6C38: 0xFFFFFFBF
|
0xFFFFFFBF is a very unlikely value for lp. It is more likely that the contents of arg1 were
changed in the process of taking a trap. The four arg registers
are considered scratch registers, and the trap path is very likely
to have overwritten these registers before it created the save state area.
However, there is an alternative way to find out the value of lp. If we can determine what the procedure that called spput() set arg0 to before the call, we will know the
value of q, and lp is q->q_ptr.  |
csq_lateral+40,15?ia csq_lateral+40: ldws 8(r3),arg2 csq_lateral+44: depi -1,1E,1,arg2 csq_lateral+48: stws arg2,8(r3) csq_lateral+4C: bl csq_turnover+108,rp (UNCRIT) csq_lateral+50: or r6,r0,arg0 csq_lateral+54: ldw 10(r5),ret1 csq_lateral+58: comibt,=,n 0,ret1,csq_lateral+68 csq_lateral+5C: ldw 10(r5),arg0 csq_lateral+60: ldw 1C(arg0),r19 csq_lateral+64: bb,<,n r19,18,csq_lateral+84 csq_lateral+68: ldw 1C(r5),arg1 csq_lateral+6C: stw r0,1C(r5) csq_lateral+70: ldw 14(r5),r6 csq_lateral+74: ldw 18(r5),arg0 * csq_lateral+78: ble 0(sr4,r6) csq_lateral+7C: or r31,r0,rp csq_lateral+80: b,n csq_lateral+8C
|
 |
The procedure which called spput() is csq_lateral(). The point where the call was made is marked with
an asterisk. Note that the procedure call here is made using the
instruction ble instead of the usual instruction bl. This is because csq_lateral does not know the name of the procedure it is
going to call. csq_lateral() is passed a structure which contains the address
of a procedure to call and the arguments with which to call it.
Because the compiler cannot tell at compile time how far away in
the executable image the procedure address is, it must use a branch
and link external, ble, instruction in order to be sure it will be able
to reach the procedure address being branched to. At csq_lateral+0x74, arg0 is loaded from r5 + 0x18. So if we
can find out what value r5 had at that point, we can determine the
value of q. r5 is a callee save register, so there is a chance
that spput saved r5 in its stack frame. We look at the first
few instructions of spput: spput/6i spput: spput: stw rp,-14(sp) ldo 40(sp),sp or arg1,r0,r31 ldw 14(r31),r22 ldbs 0xD(r22),arg1 ldo -41(r0),arg2
|
We see that spput did not save r5. Callee registers are only saved
if the callee plans to overwrite the register. So we cannot get
r5 from spput's stack frame, but if spput did not save r5 that means it did not overwrite it;
therefore, the value for r5 in the save state area will be the same
value that r5 had at csq_lateral+0x74. Look at 4*5 into the save state area: <sp+0x14/X # sp + 4*5 == r5 7FFE6BEC: 11002A0 11002A0+0x18/X # q is r5 + 0x18 11002B8: 10EE674 10EE674+0x14/X # lp is q + 0x14 10EE688: 24C278 24C278+0xC/X # lp->last_mp = lp + 0xC sp_sp+3C: 0 # lp->last_mp is NULL 0x24c278/4X # look at all of lp: # state sp_rdq mp last_mp sp_sp+30: 1 10EE600 0 0
|
 |
Note that at the point the panic occurred, lp->mp was NULL, even though we can be sure that at the
time we checked lp->mp at instruction spput+0x38, lp->mp was not NULL. How can this
be true? As we saw in the previous example, sp_timeout() modifies the lp structure, and it runs out of timeout. In other
words, spput() calls timeout() to schedule sp_timeout() to run after a specified amount of time. At each
system clock tick, the kernel examines the list of procedures created
by timeout() and schedules those procedures whose time has
expired to run. Because a clock tick is a high level interrupt,
it can occur at any time, and may suspend spput() if it is running. A clock tick may have occurred
between spput+0x38 and spput+0x4C, allowing sp_timeout() to run and set lp->mp to NULL. In order to prevent this, we need to protect
access to the lp structure by using splstr() around all critical sections of code in the sp driver which manipulate lp. So spput() source code should be changed as shown below:  |
case M_DATA: case M_PROTO: case M_PCPROTO: /* * Use splstr() to protect access to q->q_ptr area from * interrupts which may schedule sp_timeout(). */ s = splstr(); lp = q->q_ptr; if (!lp->mp) lp->mp = mp; else lp->last_mp->b_next = mp; /* * Return to previous interrupt level */ splx(s);
|
In order to protect access to q->q_ptr, sp_timeout() must also call splstr() before it accesses q->q_ptr. The source
code for sp_timeout() in the first example in this section shows the
correct use of splstr(). See the STREAMS/UX synchronization section of Chapter 3 for guidelines
on protecting module and driver critical sections. Example 3 |  |
The following core dump was obtained while using a modified
version of the sp driver, which is described in example #2 in the strdb section of this chapter. On entry to adb, we first look at the msgbuf to look for the panic message and hex stack trace.
The interesting portion of msgbuf for this dump is: msgbuf+0xc/s . . . trap type 15, pcsq.pcoq = 0.9ea14, isr.ior = 0.d @(#)9245XA HP-UX (A.09.00) #0: Thu Aug 13 23:17:54 PDT 1992 panic: (display==0xbf00, flags==0x0) Data segmentation fault PC-Offset Stack Trace (read across, most recent is 1st): 0x0016da70 0x000e5a68 0x000d34cc 0x0009ea14 0x00099714 0x0009 2fdc 0x0006e0c8 0x0006dbb8 0x0006d2a8 0x001954e8 0x00194fa4 0x000b 7e24 0x001846d4 0x00181730 0x00156538 0x00156af8 0x001567b8 0x000e 6d80 0x000d3aac End Of Stack
|
First we translate the hex stack trace in the panic message
into procedure names and addresses. Using the adb i command for each of the hex addresses in the panic
message stack trace, we get the following symbolic stack trace: panic+30: addil -1000,dp trap+0xADC: b trap+1004 $call_trap+20: rsm 1,r0 flushq+60: ldbs 0xD(r21),r22 q_free+1C: ldw -0xA4(sp),r31 osr_pop_subr+0xB44: b osr_pop_subr+0xB4C osr_close_subr+4D8: stw ret0,-40(sp) pse_close+8A0: stw ret0,-3C(sp) hpstreams_close+58: stw ret0,-40(sp) call_open_close+448: or ret0,r0,r3 closed+138: or ret0,r0,r5 ufs_close+11C: movb,tr r0,ret0,ufs_close+15C vn_close+24: ldw -54(sp),rp vno_close+50: addil -59800,dp closef+0xE8: ldw 18(r3),arg0 exit+2B4: bl uffree,rp rexit+20: ldw -54(sp),rp syscall+2A4: ldhs 0(r9),r19
|
 |
The address where the illegal data access occurred is flushq+0x60. The isr.ior in the panic message indicates that the data address
that caused the panic is 0.d, and the instruction at flushq+0x60 is ldbs 0xD(r21),r22, so r21 must have been
0 at the time of the panic. So we are probably dereferencing a null
pointer. Our first task is to find out which pointer this is. To
do this we need to know which variable r21 was supposed to contain.
We do not have source code for flushq(), because it is a STREAMS/UX internal procedure,
but we do know from its man page what arguments it takes, and we
do have the assembly version of the code. Here is the relevant portion
of the assembly. The instruction where the panic occurred is marked
with an "*".  |
flushq,20?ia flushq: flushq: stw rp,-14(sp) flushq+4: ldo 40(sp),sp flushq+8: stw arg0,-64(sp) flushq+0xC: stw arg1,-68(sp) flushq+10: ldw -68(sp),r20 flushq+14: zdepi 1,10,1,r21 flushq+18: and r20,r21,r22 flushq+1C: stw r22,-3C(sp) flushq+20: ldw -68(sp),r31 flushq+24: addil -8000,r0 flushq+28: ldo -1(r1),r19 flushq+2C: and r31,r19,r20 flushq+30: stw r20,-68(sp) flushq+34: ldw -64(sp),r21 flushq+38: ldws 4(r21),r22 flushq+3C: stw r22,-34(sp) flushq+40: ldw -34(sp),r1 flushq+44: comibt,=,n 0,r1,flushq+120 flushq+48: or r0,r0,r0 flushq+4C: ldw -34(sp),r31 flushq+50: ldws 0(r31),r19 flushq+54: stw r19,-38(sp) flushq+58: ldw -34(sp),r20 flushq+5C: ldw 14(r20),r21 * flushq+60: ldbs 0xD(r21),r22 flushq+64: stw r22,-40(sp) flushq+68: ldw -68(sp),r1
|
We can find flushq()'s calling sequence in its man page in SVR4PG: void flushq(queue_t *q, int flag) It is more likely that *q or one of its members is NULL than the parameter
flag being the cause of our problem. We will trace the use of the
first argument, originally in arg0, through flushq, to see how it might be related to the contents
of r21. At flushq+0x8, arg0 is pushed onto the stack at offset sp
- 0x64. Neither arg0 nor -64(sp) is referenced again until flushq+0x34. At flushq+0x34, r21 is loaded with -64(sp), so at this point r21 contains *q. At flushq+0x38, r22 is loaded from memory location 4 + r21.
Looking at the structure definition for queue_t, found in /usr/include/sys/stream.h, we see that the second word in a queue_t structure,
which would be found at memory location r21 + 4, is the q_first pointer.  |
struct queue { struct qinit * q_qinfo; /* procedures and limits for queue */ struct msgb * q_first; /* head of message queue */ struct msgb * q_last; /* tail of message queue */ struct queue * q_next; /* next QUEUE in Stream */ struct queue * q_link; /* link to scheduling queue */ caddr_t q_ptr; /* to private data structure */ ulong q_count; /* weighted count of characters on q */ ulong q_flag; /* QUEUE state */ long q_minpsz; /* min packet size accepted */ long q_maxpsz; /* max packet size accepted */ ulong q_hiwat; /* high water mark, for flow control */ ulong q_lowat; /* low water mark */ struct qband * q_bandp; /* band information */ unsigned char q_nband; /* number of bands */ unsigned char q_pad1[3]; /* reserved */ struct queue * q_other; /* pointer to other Q in queue pair */ QUEUE_KERNEL_FIELDS };
|
So r22 now contains q->q_first. At flushq+0x3C, r22 is stored back in the stack, at sp - 0x34. At this point, it may be useful to try and work backwards
from flushq+0x5C, where r21 gets loaded from 0x14 + r20, because
at the next instruction, flushq+0x60, we know that r21 is NULL. We notice that
at flushq+0x58, r20 is loaded from sp - 0x34. At flushq+0x3C, we know that sp - 0x34 was q->q_first. Checking the instructions between flushq+0x3C and flushq+0x58 shows that sp - 0x34 has not been stored to by any of these
instructions, only loaded from. So at flushq+0x58, r20 is loaded with q->q_first. At flushq+0x5C, r21 is loaded with some field of q->q_first. Looking at the structure definition for struct msgb, also found in /usr/include/sys/stream.h, we find that the sixth word of the msgb structure, which would be found at memory location
r20 + 5 words == r20 + 0x14, is b_datap.  |
struct msgb { struct msgb * b_next; /* next message on queue */ struct msgb * b_prev; /* previous message on queue */ struct msgb * b_cont; /* next message block of message */ unsigned char * b_rptr; /* first unread data byte in buffer */ unsigned char * b_wptr; /* first unwritten data byte */ struct datab * b_datap; /* data block */ unsigned char b_band; /* message priority */ unsigned char b_pad1; unsigned short b_flag; /* message flags */ long b_pad2; MSG_KERNEL_FIELDS };
|
So our problem is that q->q_first->b_datap is NULL. We want to confirm this, and to look
at the rest of the q structure. To do that we need to find the value
of sp - 0x64 at the time of the panic. We may be able
to extract this information from the stack if we know the value
of sp at time of the panic. To get this information,
we do a manual stack back-trace. See "Manual Stack Back-Tracing" for
details on how this is done. The resulting table is shown below: sp | pcoqh | Procedure
Address | Frame
Size |
---|
0x2418c0 0x241890 0x2417d0 0x7ffe7750 0x7ffe7710 0x7ffe7650 0x7ffe7420
|
| 0x1c374 0xdfcd0 0xdf3a8 0x16db14 0xe5a68 0xd34cc 0x9ea14
|
| doadump+0xec panic_boot+0x354 boot+0x34 panic+0xd4 trap+0xadc $call_trap flushq+0x60
|
| 0x30 0xc0 0x80 0x40 0xc0 0x230 0x40
|
|
Now that we have the values of sp for flushq, we know the q address we are interested in is at 0x7ffe7420
- 0x64: 0x7ffe7420-0x64/X 7FFE73BC: 5E9C00
|
Looking at the first few words of the q structure, we can determine the value of q_first, which is the second word: 5E9C00/4X 5E9C00: 294160 5D8C00 6C1880 0
|
Looking at q_first, we can see that the sixth word, b_datap, is NULL: 5D8C00/8X 5D8C00: 646480 0 646400 644000 6440D1 0 0 0
|
We can also use strdb to look at q and q_first. See the strdb section of this chapter for more information.
Because there may be several instances of the sp driver, each with a different minor number, we
must look at each one until we find the stream which contains a
queue whose address is the same as the address we have for q. The strdb STREAMS/UX subsystem la command will tell us what minor numbers are in
use for the sp driver: :la sp sp MAJOR = 115 ACTIVE Minor 0x000013 Stream head RQ = 0x00607b00 ACTIVE Minor 0x000012 Stream head RQ = 0x00605c00
|
These instances of sp are far fewer than we had expected. lm on minor number 0x12 shows that lmodc has already been popped off the stream: :lm sp 0x12 STREAM Head Driver sp
|
and using :qh sp 0x12, and o and n as needed to traverse all
the queues in this stream shows that none of these queues have address
0x5e9c00. lm on sp 0x13 shows that lmodc is still pushed above sp on this stream, but traversing all the queues
in this stream shows that none of them are the queue we are looking
for. We can use the strdb primary mode :x command to format q_first as a struct msgb to confirm our finding from adb that q->q_first->b_datap is NULL. (We find the structure type for q_first from /usr/include/sys/stream.h). :x msgb 0x5d8c00 struct msgb 0x5d8c00 S:1 b_next = 0x646480 b_prev = 0x0 b_cont = 0x646400 b_rptr = 0x644000 b_wptr = 0x6440d1 b_datap = 0x0 b_band = 0 b_pad1 = 00 b_flag = 0x0 b_pad2 = 0
|
b_datap could be NULL because its resources have been
freed, or it could be NULL because the data structure was corrupted
in some way. To try to narrow this down, we want to look at the
message buffer b_cont. If its b_datap is also NULL, the possibility of corruption becomes
less likely. We can use :x msgb 0x646400 to format the b_cont field of q->q_first. It is easier, however, to see if there is a navigation
key available for the b_cont field. "?" lists the available
navigation keys: navigation for structure msgb 'n' = b_next (msgb) 'p' = b_prev (msgb) 'm' = b_rptr (b_rptr) 'c' = b_cont (msgb) 'd' = b_datap (datab)
|
Using the c navigation key, we see that b_datap for b_cont is also NULL. This makes it very likely that this
message has already been freed. struct msgb 0x646400 S:2 b_next = 0x5d8c00 b_prev = 0x0 b_cont = 0x0 b_rptr = 0x651400 b_wptr = 0x6517e1 b_datap = 0x0 b_band = 0 b_pad1 = 00 b_flag = 0x0 b_pad2 = 0
|
Now we try to get information about the queue which was pointing
to this message at the time of the panic. We use :x to format 0x5e9c00 as a queue structure to see
what information it may still contain. :x queue 0x5e9c00 struct queue 0x5e9c00 S:3 q_qinfo = 0x294160 q_pad1[0] = 00 q_first = 0x5d8c00 q_pad1[1] = 00 q_last = 0x6c1880 q_pad1[2] = 00 q_next = 0x0 q_other = 0x5e9c74 q_link = 0x0 q_ptr = 0x0 q_count = 24896 q_flag = 0x1135 QREADR QFULL QWANTW QUSE QOLD QSYNCH q_minpsz = 0 q_maxpsz = 256 q_hiwat = 0x8000 q_lowat = 0x4000 q_bandp = 0x539d00 q_nband = 1
|
Note that this is a read queue whose q_next pointer is NULL. This implies that this queue
is not a connected part of a stream, and is in the process of being
closed. To find out what driver or module this queue is being used
by, we want to look at q_qinfo. We could use :x qinit 0x294160, or look for an
appropriate navigation key: ? navigation for structure queue 'i' = q_qinfo (qinit) 'm' = q_first (msgb) 'z' = q_last (msgb) 'n' = q_next (queue) 'l' = q_link (queue) 'b' = q_bandp (qband) 'o' = q_other (queue)
|
We use the i navigation key to print the following: struct qinit 0x294160 S:4 qi_putp = 0x785ac qi_srvp = 0x78794 qi_qopen = 0x7841c qi_qclose = 0x78490 qi_qadmin = 0x0 qi_minfo = 0x294148 qi_mstat = 0x0
|
Using the adb i command, we can find out the name of the qi_putp routine: 0x785ac/i lmodcput: lmodcput: stw rp,-14(sp)
|
This means the module lmodc was using the queue on which the panic occurred.
We can double check this by looking at the qi_minfo structure in strdb. Again, we can either use :x module_info 0x294148,
or we could see if there is a navigation key available for qi_minfo: ? navigation for structure qinit 'i' = qi_minfo (module_info) 's' = qi_mstat (module_stat)
|
Using the qinit i navigation key to print the module_info structure: struct module_info 0x294148 S:5 mi_idnum = 0x3ec mi_idname = 0x23a0a8 mi_minpsz = 0 mi_maxpsz = 256 mi_hiwat = 0x8000 mi_lowat = 0x4000
|
and using the adb s command to print mi_idname as a string: 0x23a0a8/s lmcinfo+10: lmodc
|
So we had the panic occur on an lmodc read queue which was in the process of being closed.
Our stack trace confirms this. We are making the exit system call,
close all open file descriptors and as part of process clean-up.
The last close of a stream causes each module and driver to be popped
and its resources freed, including its message buffers. Whenever a
panic occurs which involves b_datap being NULL, the cause is usually that the buffer
has already been freed but a pointer to it was not zeroed out, and
a module or driver continues to access the buffer through this non-zeroed
pointer. The best way to find the cause of this problem is to look
through the source code for all calls to freemsg() or freeb(), and check that all pointers to the buffer being
freed are zeroed out.  |
static spclose(q) queue_t *q; { struct sp *lp; unsigned int s; mblk_t *mp, *t_mp; lp = (struct sp *) (q->q_ptr); /* Free messages queued by spput() on interim mesg queue. */ s = splstr(); mp = lp->mp; while (mp != NULL) { t_mp = mp; mp = mp->b_next; freemsg(t_mp); } splx(s); flushq(WR(q), 1); q->q_ptr = NULL; }
|
For the sp driver, we found that spclose() calls freemsg(): freemsg() is called to free all messages held in the interim
message queue in our private data area, but we do not zero out the
pointers lp->mp or lp->last_mp, which point to the head and tail of the private interim
queue. A call to sp_timeout() may still be pending in the timeout queue. When sp_timeout() is executed, because lp->mp is non-NULL, it will call putq() to pass lp->mp up to sp's read queue, where sp's service routine will call putnext() to put it in lmodc's read queue. When flushq() is called on lmodc's read queue, it tries to free this already freed
message, causing a trap type 15 panic on the NULL b_datap. Adding the following code to spclose() will fix this problem: . . . freemsg(t_mp); } splx(s); /* * NULL out list pointers to insure the messages they point to * will not be freed twice. */ lp->mp = NULL; lp->last_mp = NULL; flushq(WR(q), 1); q->q_ptr = NULL; }
|
|