Oh No! My BI Server has Core Dumped!
As well as supporting client's development teams, we often come in and help support teams in client organizations who are taking on Oracle BI and EPM Suite. One such client I was working with this week wanted me to help them with an OBIEE diagnosis checklist, where their staff could follow a series of steps to find out why their OBIEE environment was no longer available. In this case, they'd followed my instructions on working out which OBIEE component (or set of components) was down, but now they wanted a series of guidelines on why they were down.
To take an example, it's become clear that for whatever reason, the BI Server has gone down and people can no longer log in. The same may happen with the Presentation Server, which usually has the effect of making the J2EE server hosting the Presentation Services Plugin report an error message rather than bring up the dashboard login. Now in most cases you can restart the service or rerun the startup scripts under Linux/Unix, but how do you go about working out why they went down and try to fix things?
First off, unless you've done something obvious or you've got hardware problems, for example, if your BI Server or BI Presentation Server are crashing it's usually time to raise an SR and call in Oracle Support. If you look through metalink3 you'll see a fair few references to the BI Server and Presentation Server crashing, and they are usually due to bugs in the software that are then fixed in subsequent releases of OBIEE. There may well be workarounds or even back-ported fixes, but in most cases you'll need Oracle Support to first diagnose the problem and then either provide an interim patch or tell you to upgrade to the latest version (10.1.3.4.1 at the time of writing this article). If you're on an earlier release, such as one of the 10.1.3.3.x releases there are a whole bunch of issues you can hit over such things as scripts in connection pools, pivot tables and nested composite views that are fixed in 10.1.3.4.1, and even with 10.1.3.4 you find Oracle notes on metalink3 saying that time series functions, for example, are buggy in this release and only properly work in 10.1.3.4.1 (which is why time dimensions only appeared in the BI Apps 7.9.6, for example).
So how do you know if it's time to call Oracle support? Well on the principle of keeping it simple at the start, the first place I'd look for diagnostic information are the logs generated by the applications. Assuming you are on Linux, I'd take a look at the $ORACLEBI/server/log:
[oracle@ora11g Log]$ tail -n20 /u01/app/oracle/product/OracleBI/server/Log/NQServer.log 2009-08-06 22:29:42 [58001] Query Cache Statistics -- Hits:0(0%), Qualified Misses:1(100%), Unqualified Misses:0(0%). 2009-08-06 22:29:42 [14058] Unloaded all subject areas. 2009-08-07 07:24:34 [36007] Loading repository /u01/app/oracle/product/OracleBI/server/Repository/samplesales.rpd. 2009-08-07 07:24:34 [14055] Loading subject area: Sample Sales ... 2009-08-07 07:24:35 [14056] Finished loading subject area: Sample Sales. 2009-08-07 07:24:35 [58002] Query Cache loaded with 1 entries from saved cache files. 2009-08-07 07:24:35 [43030] : Oracle BI Server started. Version: 10.1.3.4.1.090414.1900. 2009-08-07 07:44:09 [43031] : Oracle BI Server shutdown. 2009-08-07 07:44:09 [58001] Query Cache Statistics -- Hits:0(0%), Qualified Misses:0(0%), Unqualified Misses:0(0%). 2009-08-07 07:44:09 [14058] Unloaded all subject areas.OK, so looking at this log file shows that the BI Server closed down cleanly, which means that something (actually me, in this case) shut it down gracefully. This doesn't tell us why, but it's at least a clue.
Under Linux and Unix each of the OBIEE server processes has an ".out.log" file that provides console messages from the application. If we take a look at this we can see the closedown again.
[oracle@ora11g Log]$ tail -n20 /u01/app/oracle/product/OracleBI/server/Log/nqsserver.out.log Oracle BI Server starting... rm: cannot remove `/u01/app/oracle/product/OracleBIData/tmp/nQS_*.TMP': No such file or directory nqsserver: Oracle BI Server started. Version: 10.1.3.4.1.090414.1900. nqsserver: Oracle BI Server shutdown. Oracle BI Server starting... nqsserver: Oracle BI Server started. Version: 10.1.3.4.1.090414.1900. nqsserver: Oracle BI Server shutdown.So this tells us two things. Firstly, the shutdown was graceful and also that it happened at 07.44am - if we have intermittent shutdowns the first log file will be useful in recording when they happen.
It's a similar story for the BI Presentation Server as well.
[oracle@ora11g Log]$ tail -n20 /u01/app/oracle/product/OracleBIData/web/log/sawlog0.log Type: Information Severity: 30 Time: Fri Aug 7 07:24:49 2009 File: project/sawserver/sawserver.cpp Line: 459 Properties: ThreadID-3086690528 Location: saw.sawserver Oracle BI Presentation Services have started successfully. --------------------------------------- Type: Information Severity: 30 Time: Fri Aug 7 07:51:31 2009 File: project/sawserver/sawserver.cpp Line: 604 Properties: ThreadID-3086690528 Location: saw.sawserver Oracle BI Presentation Services are shutting down. ---------------------------------------So again, the Presentation Server has shut down, cleanly and with no error message in the file (again, because I shut it down just before). If there are errors with the Presentation Server you'll often see messages in here, and you can in fact turn up the logging level in this file to see more detail (though typically you'd only really do this on the request of Oracle support).
So far then, we know that the BI Server and Presentation Server have gone down, cleanly, but we don't know why. At this point I'd check that there isn't something in my internal processes that brings these down, and I'd check my product versions against known bugs listed on metalink3 and upgrade to 10.1.3.4.1 if I'm not on that release already.
But what about more sudden crashes, which typically don't leave any entries in the log files as the BI Server doesn't have time to write them? Let's simulate this with a "kill -6" on the nqsserver process:
[oracle@ora11g setup]$ ps -ef | grep nqsserver oracle 7578 1 1 07:59 pts/2 00:00:01 /u01/app/oracle/product/OracleBI/server/Bin/nqsserver -quiet oracle 7803 5928 0 08:00 pts/2 00:00:00 grep nqsserver [oracle@ora11g setup]$ kill -6 7578First of all I'd do another "ps -ef" to list out the processes and check that in this case, the BI Server has gone down.
[oracle@ora11g setup]$ ps -ef | grep nqsserver oracle 7809 5928 0 08:00 pts/2 00:00:00 grep nqsserverIndeed it has. So do the log files show anything?
[oracle@ora11g setup]$ tail -n20 /u01/app/oracle/product/OracleBI/server/Log/nqsserver.out.log Oracle BI Server starting... rm: cannot remove `/u01/app/oracle/product/OracleBIData/tmp/nQS_*.TMP': No such file or directory nqsserver: Oracle BI Server started. Version: 10.1.3.4.1.090414.1900. nqsserver: Oracle BI Server shutdown. Oracle BI Server starting... nqsserver: Oracle BI Server started. Version: 10.1.3.4.1.090414.1900. nqsserver: Oracle BI Server shutdown. Oracle BI Server starting... rm: cannot remove `/u01/app/oracle/product/OracleBIData/tmp/nQS_*.TMP': No such file or directory nqsserver: Oracle BI Server started. Version: 10.1.3.4.1.090414.1900.No, as far as the BI Server log file is concerned, the server is still up. What about the alert log for the OS?
[root@ora11g setup]# grep nqsserver /var/log/messages [root@ora11g setup]#Nothing. Quite often you will find useful information here, but in this case we've just got the mystery "BI Server has crashed and left no trace" error. This is, however, were core dump files come in useful. Now most Linux setups I've worked with don't have core dumps enabled by default, and you can check this by running "ulimit -a" on your system.
[root@ora11g setup]# ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited max nice (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 32768 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 65536 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 max rt priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 16384 virtual memory (kbytes, -v) unlimited file locks (-x) unlimitedWhat this is saying is that the maximum core dump ("core file size") is 0, which means they won't be produced. To enable these, I need to edit the /etc/security/limits.conf file and ideally set the value to "unlimited". You can also enable core dumps in your session by running the ulimit command directly, like this
[oracle@ora11g ~]$ ulimit -c unlimitedThen I find the PID for the nqsserver process, and then kill it with an invalid memory reference signal to create a core dump.
[oracle@ora11g ~]$ ps -ef | grep nqsserver oracle 6275 1 18 09:29 pts/2 00:00:00 /u01/app/oracle/product/OracleBI/server/Bin/nqsserver -quiet oracle 6391 4591 0 09:29 pts/2 00:00:00 grep nqsserver [oracle@ora11g ~]$ kill -SIGSEGV 6275 [oracle@ora11g ~]$ ps -ef | grep nqsserver oracle 6411 4591 0 09:30 pts/2 00:00:00 grep nqsserver [oracle@ora11g ~]$ ls /u01/app/oracle/product/OracleBI/server/Log core.6275 NQQuery.log NQServer.log nqsserver.out.logAnd there ("core.6275") is our core dump. Whilst processes generally core dump in the directory that the process is running in, the Oracle BI Server core dumps into the $ORACLEBI/server/Log directory instead.
So, what now? Well generally if you are getting core dumps, as I said it's time to call Oracle support and get them to take a look. When you raise the SR though, what they'll generally get you to do is to analyze the core dump file with the "gdb" utility, which produces diagnostic data that can be useful to product engineering. You can do this yourself if you already have gdb installed, and take a look at what caused previous core dumps.
[oracle@ora11g ~]$ gdb GNU gdb 6.6 Copyright (C) 2006 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "". (gdb) core /u01/app/oracle/product/OracleBI/server/Log/core.7225 (no debugging symbols found) Using host libthread_db library "/lib/libthread_db.so.1". Core was generated by `/u01/app/oracle/product/OracleBI/server/Bin/nqsserver -quiet'. Program terminated with signal 11, Segmentation fault. #0 0x00572402 in __kernel_vsyscall () (gdb) quit [oracle@ora11g ~]$So as you would expect seeing that I caused the segmentation fault, this is what is listed in the core dump extract. Sometimes this file can be useful even to yourself, as it can reference actions (writing to tables, for example) that you can trace back to something you've just added to the RPD, but usually it's something that Oracle support will use to work out if you've hit a bug.