An example of inspecting an EMC Smarts Core file

The contract I am currently employed on has the new Smarts/IONIX IP AM/PM/SPM and SAM version 9 software which we are extending with Dynamic model, ASL, Perl, SqLite -- etc.

Every now and then it rolls over, throws its legs in the air and dies a dramatic death - dropping a core file and filling the log file with stack traces in the process. Of course, we do the usual thing of gathering these up, opening an SR on the EMC PowerLink site and seeking their assistance which they are pleased to give and they usually manage to find a solution for us - so many thanks to them for that!

But sometimes the problem is caused by something we've done with the dynamic model, and being of an investigatory mind set I have been interested to see how far I can go in understand the debug output, and getting useful info from the core files myself. Today, for the first tine I have manged to work out one simple little trick that allows me to get some useful info for a rather specific crash in the SAM domain.

I'm sharing this here in case it proves useful, or someone else (like me) likes to dig in a little deeper from time to time.

So here's the issue..

  1. We fire up SAM -- fine.
  2. We fire up IP AM/PM/SPM - that too is fine
  3. A few minutes later SAM dies and drops a core file - it seems to be at the point when the IP->SAM synchronization finishes.

And the question is of course: "why?" what have I done to break it?

The platform

These notes refer to the 64 bit SAM version 9 running on RedHat Linux. Some of this will work for other similar platforms, but if you are running on Windows - then you'll need to ask someone else how to do this.

Start with the log file.

We'll need to get some basic understanding of the stack traces in the log file. If you look in one after a crash, you'll see lots of lines that look like this...

#24 0x00002aaaaf50c315 in ASL_ValueObjectOperation::valueOf (this=0x2aaabdedc4e0, parser=0x2aaabf4fb060) at
/work/blackcurrent/DMT-9.0.1.X/15/smarts/ga/parser/ASL_Value.c:1815
#25 0x00002aaaaf4e57a0 in ASL_ParseActionExpr::perform (this=<value optimized out>, parser=0x2aaabf4fb060)
at /work/blackcurrent/DMT-9.0.1.X/15/smarts/ga/parser/ASL_ParseAction.c:578
#26 0x00002aaaaf4e97fb in ASL_ParseActionList::perform (this=0x2aaabf4faaa0, parser=0x2aaabf4fb060) at /
work/blackcurrent/DMT-9.0.1.X/15/smarts/ga/parser/ASL_ParseAction.c:385

These are grouped together into blocks - one per thread in the domain manager. The different blocks all start with the thread number header, and line number "#0" - like this..

Thread 23 (Thread 1205887613 (LWP 7912)):
#0 0x00000032a820d91b in read () from /lib64/libpthread.so.0

The first thing to do is to look for the block that contains the text "sm_POSIXstacktrace" quite near it's top - this is the block for the thread that has crashed. Now scan through the lines in the block looking for a class name or function (operation) that might give you some clues as to what the thread was doing at the time.

It helps to declutter the text if you remove the part of the lines that starts with the word " at ". The text you are throwing away tells you what line number and what C source file the code is in - but since we don't have access to the sources this is of no use to us. You can also throw away the hex stack address at the start of the line (which is also of limited use). Here is the block for my example, "decluttered" in this way with sm_POSIXstacktrace entry highlighted in red..

Thread 2 (Thread 1205856576 (LWP 7807)):
#0 - read () from /lib64/libpthread.so.0
#1 - _lineread (fd=4, buf=0x47df81b0 "", bufsz=<value optimized out>)
#2 - sm_POSIXstacktrace (fd=-1, stdinfo=<value optimized out>, line=452, file=0x2b78647bd9a8 "/work/blackcurrent/DMT-9.0.1.X/19/smarts/platform/misc/posix/POSIX_stackdump.c")
#3 - sm_LINUXstacktrace (fd=-1, stdinfo=0)
#4 - fatalHandler (sig=11, info=0x47dfb5e0, context=<value optimized out>)
#5 <signal handler called>
#6 - arena_get2 () from /lib64/libc.so.6
#7 - malloc () from /lib64/libc.so.6
#8 - __tls_get_addr () from /lib64/ld-linux-x86-64.so.2
#9 - _nss_ldap_inc_depth () from /lib64/libnss_ldap.so.2
#10 - _nss_ldap_enter () from /lib64/libnss_ldap.so.2
#11 - fork () from /lib64/libc.so.6
#12 - sm_si_command_openv (argv=0x2aaabf68c350, options=<value optimized out>, user=<value optimized out>, stream=0x47dfbee0)
#13 - sm_si_command_open (command=<value optimized out>, options=0x2aaab49ca94d "b", user=0x2b78647bddfd "CURRENT", stream=0x47dfbee0)
#14 - run_script (scriptName=@0x47dfc380, directoryName=@0x47dfc370, interpreterName=@0x47dfc360, parameters=@0x47dfc720, stdindata=@0x47dfc710, test=0, trace=0)
#15 - ACT_Script::run_ex (this=<value optimized out>, parameters=@0x47dfc720, stdindata=@0x47dfc710)
#16 - ACT_Script::run (this=0x4, parameters=@0x0, stdindata=@0x0)
#17 - ACT_Script::run_st (ref=<value optimized out>, ret_val=<value optimized out>, args=<value optimized out>) at ACT_Script.h:553
#18 - MR_OpDef::do_Invoke (this=<value optimized out>, ref=<value optimized out>, ret_val=@0x0, argValues=@0x0)
#19 - MR_ClassDef::do_OpInvoke (this=<value optimized out>, op_name=@0x47dfce80, ref={<MR_ObjRef> = {<MR_ObjId> = {_Self = 0x47dfc9f0, static objectGettingDeleted = 0x0}, static empty_table = {<CI_HashTable_U<MR_ObjConstRef,MR_modification_coll_pset*>> = {<CI_Table_U<MR_ObjConstRef,MR_modification_coll_pset*>> = {_vptr.CI_Table_U = 0x2aaaac866750}, _rep = {_size = 0, _buckets = 0x2aaaaba76c00, _pivot = 0, _k = 1, _flags = 0 '\0', _threshold = 16}}, _ModifiedPSets = {_vptr.CI_Vector_U = 0x2aaaac866db0, _data = 0x2aaaaba76c80, _dSize = 1, _size = 1}, _NumDeletions = 0}}, <No data fields>}, ret_val=@0x47dfcb40, args=@0x47dfcae0)
#20 - MR_DynamicAccess::invokeOperation (instance=<value optimized out>, op_name=@0x47dfce80, ret_val=@0x47dfcb40, args=@0x47dfcae0)
#21 - SM_RepositoryServer::invokeOperation (this=<value optimized out>, cls=@0x47dfceb0, instance=@0x47dfcea0, op=@0x47dfce80, args=@0x2aaabf6891d0)
#22 - SM_ServerInterface::invokeOperation (this=<value optimized out>, className=@0x4, instanceName=@0x0, opName=@0x0, args=@0x1)
#23 - ASL_ValueObjectOperation::asAnyVal (this=0x2aaabdedc4e0, parser=<value optimized out>)
#24 - ASL_ValueObjectOperation::valueOf (this=0x2aaabdedc4e0, parser=0x2aaabf4fb060)
#25 - ASL_ParseActionExpr::perform (this=<value optimized out>, parser=0x2aaabf4fb060)
#26 - ASL_ParseActionList::perform (this=0x2aaabf4faaa0, parser=0x2aaabf4fb060)
#27 - ASL_Rule::match (this=0x2aaabf4faf20, parser=0x2aaabf4fb060, skipAhead=0, result=0x0, args=0x0)
#28 - ASL_Parser::parse (this=0x2aaabf4fb060)
#29 - GA_DriverRunner::run (this=0x2aaabf4fb440)
#30 - CI_ClosureM0<CI_Runner>::apply (this=0x0)
#31 - CI_Thread::_thread_run (arg=0x2aaabf4fb440)
#32 - _thread_run_wrapper (arg=0x0)
#33 - thread_wrapper (data=0x2aaabc7d5560)
#34 - start_thread () from /lib64/libpthread.so.0
#35 - clone () from /lib64/libc.so.6

This listing shows us the stack of nested functions called by the crashed thread. The thread starts by calling "clone" (at the bottom of the listing) which calls "start_thread" which in turn calls "thread_wrapper" .. etc - working up the listing and stack. So at the time the sm_POSIXstacktrace was called, the thread was 34 function levels deep in nesting.

Now run your eye up and down the list to see what you can recognize. Well .. you can see a call to ASL_Parser::parse at line 28 - so could guess the thread was running an ASL script. Looking up from there, you see the entries that suggest a class operation is being "invoked" between lines 22 and 18. Lines 17 and 16 seem to suggest that the ACT_Script class's "run" operation is being called - and a quick check with the perl dashboard documentation (or dmctl) confirms that this function does indeed exist and is callable from ASL.

A question worth answering at this point is: what script is ACT_Script trying to run?

Going a little further up the stack - like #14 looks interesting. It still refers to a function that (by it's name) seems to be something to do with running a script, but also it has named arguments that have memory addresses associated with them. One of these is called "scriptName".

#14  - run_script (scriptName=@0x47dfc380, directoryName=@0x47dfc370, interpreterName=@0x47dfc360,
parameters=@0x47dfc720, stdindata=@0x47dfc710, test=0, trace=0)

When I saw this, I guessed that this scriptName was a string data type. If this were pure "C" then I would have expected the address 0x47dfc380 to point directly to the name.

Turn to the core file

The next step was to run the following linux command (with the core number replaced with the domain's "pid" - which you can find be searching for "pid=" in the log file)..

gdb --core=/core.1234

If you've never used gdb before - there is plenty of help available with the "help" command, it's man and info pages and on the internet.

I started by assuming that the address I just noted points to a "C" null-terminated string. If I was right, I should have been able to read it using the "x" command. Here's the command and the result I got..

(gdb) x/1s 0x47dfc380
 0x47dfc380:   "\76h"

That didn't look like any kind of scriptName to me - so my next guess was that this address pointed to a structure that in turn pointed to the name. Here's the command I used to check this idea...

(gdb) x/4a 0x47dfc380
 0x47dfc380:     0x2aaabcedcba0  0x2aaaac8a1ee4
 0x47dfc390:     0x0     0x2aaaadf9cf98

I retried the first command but with the new address "0x2aaabcedcba0" instead - still nothing useful, so I tried the address dump instead..

(gdb) x/4a 0x2aaabcedcba0
 0x2aaabcedcba0: 0xc     0x2aaa00000003
 0x2aaabcedcbb0: 0x2aaabcedcbb8  0x756f72672d6c6674

The first value displayed was too small to be an address - maybe it was the string's length? The second value didnt look like an address because it had too many zeros in it - which was too regular for my taste. So my next thought was to try a third dump using the third value as the address and see what that gave ..

(gdb) x/1s 0x2aaabcedcbb8
 0x2aaabcedcbb8:  "abc-group.sh"

Great! It looked like I had found it.

So -- here's the steps to use again next time..

  1. Identify a meaningful function name with named parameters that have addresses - and which you'd expect to contain a useful string. Lets call that address "A"
  2. Display the address stored at that address using the gdb command "x/4a A" (replace "A" with the address)
  3. Take the first address now displayed (call it "B") and repeat the "x" command using the new address.. "x/4a B"
  4. Check that the first value shown is sensible for the string length you're expecting. If so ..
  5. Take the 3rd value shown as address "C" and using the command "x/1s C" to display the string there.

Here's the whole sequence for our worked example (using colours to show how addresses were taken from the output of one command and used as the input of the next).

(gdb) x/4a 0x47dfc380
 0x47dfc380:     0x2aaabcedcba0  0x2aaaac8a1ee4
 0x47dfc390:     0x0     0x2aaaadf9cf98
(gdb) x/4a 0x2aaabcedcba0
 0x2aaabcedcba0: 0xc     0x2aaa00000003
 0x2aaabcedcbb0: 0x2aaabcedcbb8  0x756f72672d6c6674
(gdb) x/1s 0x2aaabcedcbb8
 0x2aaabcedcbb8:  "abc-group.sh"

This brief research told me that the domain crashed while running the "abc-group.sh" script. I found where in my code this happens, commented it out and re-tested to see whether SAM held up better - which it did.

So that gave me a temporary work-around to the problem while the EMC engineers dug into why that part of the domain's code causes this problem under these circumstances - and hopefully came up with a patch.

Scroll to Top