Wednesday, December 31, 2008

Did I get my money worth?

At the end of November I submitted a Purchase Order for my 2009 Annual Oracle Support. I passed the six figure mark several years ago and am well on my way to topping the next hundred next year. Here is my report from Metalink of the number of incidences and close times for the year:

2. Close Rates (closed calls only)
Time Incidents % Cumulative Total
4 h11.411.41
12 h11.412.82
24 h002.82
48 h34.237.05
72 h11.418.46
1 Week22.8211.28
2 Weeks34.2315.51
4 Weeks4056.3471.85
>4 Weeks2028.17100.02
Total Incidents 71


That doesn't count the 5 TARs I still have open; one from May. 84% of my TARs are closed in four weeks or more. Let me say that again; FOUR WEEKS OR MORE. Each TAR cost me approximately $2500.

So, Oracle Community, did I get adequate return on my investment?

(Oh, and BTW, I love how my cumulative total adds up to 100.02%)

Monday, December 29, 2008

The Bug, part II

AH HA! you guys took the bait!

Granted, this is a bug in the particular JDBC Driver I am using. The 10.2.0.2 driver straight out of the box has bug 4711863 that can possibly insert bad data when you use Java 1.5. In fact, I proved it with a scenario similar to:
drop table tablea
/
drop table tableb
/
create table tablea (
record_id number(38) primary key,
price number)
/
create table tableb (
record_id number(38),
price number)
/
-- fill it
begin
for i in 1..100 loop
insert into tablea values (i, 0+(i*.0000001));
commit;
end loop;
end;
/

quit


Then, my java program sucks data from tableA and inserts it into tableB. Inside my java program, I convert a Double to BigDecimal using code like:

String insertString = "insert into tableB values (?, ?)";
PreparedStatement stmt2 = this.dbConnection.prepareCall(insertString);
stmt2.setInt(1,rs.getInt("RECORD_ID"));
stmt2.setBigDecimal(2, new BigDecimal(rs.getDouble("PRICE")));

I query the data from tableB and boy am I surprised:

SQL> select * from tableb where price <> price*1;

RECORD_ID PRICE
-------------------- --------------------
2 .00000019RS
3 .00000029RS
4 .00000039RS
5 .00000049RS
6 .00000059RS
7 .00000069RS
8 .00000079RS
9 .00000089RS
10 .00000099RS

So maybe it's a sqlplus setting or something. I try:

SQL> set numwidth 20
SQL> /

RECORD_ID PRICE
-------------------- --------------------
2 .00000019RS
3 .00000029RS
4 .00000039RS
5 .00000049RS
6 .00000059RS
7 .00000069RS
8 .00000079RS
9 .00000089RS
10 .00000099RS


Same deal.

I go back and verify my source data from tableA, and sure enough it's correct:

SQL> select * from tablea where record_id = 2;

RECORD_ID PRICE
--------------- ---------------
2 .0000002

SQL> l
1* select record_id, price, dump(price) dmp from tableb where price <> price * 1
SQL> /

RECORD_ID PRICE DMP
--------- -------------------- ----------------------------------------
2 .00000019RS Typ=2 Len=21: 189,20,246,100,100,100,100
,100,100,100,10,50,63,24,66,18,73,52,74,
72,23

3 .00000029RS Typ=2 Len=21: 189,30,246,100,100,100,100
,100,100,99,65,25,44,36,48,77,59,78,61,5
7,85

4 .00000039RS Typ=2 Len=21: 189,40,246,100,100,100,100
,100,100,99,19,100,25,48,31,36,46,4,48,4
3,46

5 .00000049RS Typ=2 Len=21: 189,50,246,100,100,100,100
,100,100,98,74,75,6,60,13,95,32,30,35,29
,7

6 .00000059RS Typ=2 Len=21: 189,60,246,100,100,100,100
,100,100,98,29,49,87,71,96,54,18,56,22,1
4,69

7 .00000069RS Typ=2 Len=21: 189,70,246,100,100,100,100
,100,100,97,84,24,68,83,79,13,4,82,8,100
,30

8 .00000079RS Typ=2 Len=21: 189,80,246,100,100,100,100
,100,100,97,38,99,49,95,61,71,91,7,95,85
,92

9 .00000089RS Typ=2 Len=21: 189,90,246,100,100,100,100
,100,100,96,93,74,31,7,44,30,77,33,82,71
,53

10 .00000099RS Typ=2 Len=21: 189,100,246,100,100,100,10
0,100,100,96,48,49,12,19,26,89,63,59,69,
57,14


OK, so JDBC isn't 100% correct when you cast between BigDecimal and Double. Big whoop. And in fact, the fix for the aforementioned bug is fixed if you apply the appropriate patch.

My problem is in the fact that Oracle (the database software) let a bad piece of data in. Oracle says the client should type the data when binding. I agree. But I also think that the server shouldn't let non-numeric data into numeric columns, regardless. I contend that this is a bug on the server side as well as a bug on the client side. This is only one specific case of a Java program inserting bad data. I have some tables that have non-numeric characters that were never touched by Java.

Monday, December 22, 2008

Where is the bug?

So, if I have a table that has a column defined as NUMBER(10) and try to insert the value of 'ABCDEFG' in it, I get an "ORA-01858: a non-numeric character was found where a numeric was expected" error. Fair enough, I expect that the database does type checking and won't allow an invalid number in the field.

In addition, if I try to insert an 11 digit number, I get an "ORA-01438: value larger than specified precision allowed for this column" error.

So far so good.

If an invalid number, such as "-0" or "17~:49" got into the database via JDBC, would it be a client problem, a server problem, or both?

Friday, December 19, 2008

11g

I was asked the other day when I was going to start writing about 11g.

Quite frankly, I literally just got everything off 9i last weekend. The final thorn in my side of Oracle Applications was started at the close of business on Friday and lingered on to the wee hours of Saturday morning. It took about 18 months and my standard install of 10g has 23 patches applied, but finally I have a consistent environment on 10g.

And lets face it, 10g is a pretty darn good product once it's patched. The combination of 10g and the Linux 2.6 kernel is really a rock solid platform with a laundry list of features that you'll probably never need.

I'm keeping abreast of what's new in 11g and I haven't really seen anything compelling that piques my curiosity enough to even install it. Maybe my opinion will change when the calendar flips over and we have some of that mythical "free" time, but for now, I'll be sticking to 10g topics.

Wednesday, December 17, 2008

Phone Interview tips for DBAs

You only get one chance to make a first impression. I would think the following tips would be common sense, yet I am surprised every day.
  1. When on a phone interview, don't put the interviewer on hold.
  2. I don't mind calling you on your cell, but if the call drops, you're done.
  3. I know you might be nervous, but don't cut me off in mid-sentence.
  4. When you take my call, don't be driving in a car.
  5. I don't expect you to pick up the phone on the first ring, but please pick it up. If I get your Voicemail, I'll try you again in five minutes. After that, you're on your own.

Thursday, December 04, 2008

Database Administrator

Job Description
Interactive Brokers Group, industry leader in direct access brokerage, is seeking a motivated Database Administrator to provide day-to-day production support for our highly automated mission critical systems. This position requires interacting with very technical users, big-picture analysis, and acute problem solving skills.


Required Skills
  • 3+ years experience as Oracle DBA in a production environment.
  • Linux and/or Solaris
  • Day-to-day Database Administration.
  • Backup and Recovery using Recovery Manager.
  • Oracle 10g (at least 2 years)
  • Ability to quickly diagnose and resolve performance using standard tools. (explain plan, OWI, Profiler, Statspack)
  • Some experience with partitioning
  • Self-starter with initiative and desire to learn
  • Ability to work quickly in a fast-paced environment.
  • Rotating on-duty coverage is expected.
  • Ability to communicate clearly, constructively and concisely to peers and end-users.
  • Ability to work independently with minimal supervision and drive issues to closure.

Desired Skills
  • Experience with Oracle Applications R12
  • Automation of DBA tasks (Perl, CGI, Ksh)
  • Some experience in 11g
  • DR Strategies including Standby Database
  • Financial Industry Background
  • Oracle Connection Manager


Job Code: JH-DA-CT

Interactive Brokers Group LLC

Apply Now

Tuesday, October 14, 2008

v$session.audsid

I am working on a project that requires me to get the name of the program during the logon process to the database.

"What an ideal place for a logon trigger!", I said to myself.

I thought the task would be quite simple, just get the SYS_CONTEXT('USERENV','SESSIONID') and use that to query v$session. I know that v$session.audsid is not unique as the background processes are "0" while other processes have a number that is unique for that moment in time. Or so I thought.

I tested my logon trigger like crazy in the development enviornment. Then I put it in the QA environment for a few days to make sure it was OK. No problems whatsoever during development or QA.

The first day I put my logon trigger in production, sure enough it failed. It threw a normal old PL/SQL error that indicated two rows were returned where a single row was expected.

Hmmm, that didn't happen in development.

After about 10 days of working on it here and there, I found that the threads of an RMAN backup all had the same v$session.audsid.

While just some monitoring stuff was running:

system@jh01.dev> l
1 select sid, serial#, username, program, audsid
2* from v$session where machine = 'barney'
SID SERIAL# USERNAME PROGRAM AUDSID
---------- ---------- ------------ ---------------------------------------- ----------
170 1 oracle@barney (PMON) 0
159 1 oracle@barney (CJQ0) 0
169 1 oracle@barney (PSP0) 0
168 1 oracle@barney (MMAN) 0
167 1 oracle@barney (DBW0) 0
166 1 oracle@barney (DBW1) 0
165 1 oracle@barney (DBW2) 0
164 1 oracle@barney (DBW3) 0
152 1 oracle@barney (ARC1) 0
153 1 oracle@barney (ARC0) 0
157 1 oracle@barney (MMNL) 0
158 1 oracle@barney (MMON) 0
160 1 oracle@barney (RECO) 0
162 1 oracle@barney (CKPT) 0
163 1 oracle@barney (LGWR) 0
161 1 oracle@barney (SMON) 0
139 65 JEFFH perl@barney (TNS V1-V3) 1600567
145 139 JEFFH perl@barney (TNS V1-V3) 1600566
143 47 JEFFH perl@barney (TNS V1-V3) 1600572
146 2 oracle@barney (q000) 0
149 1 oracle@barney (QMNC) 0
156 6 oracle@barney (q001) 0

While RMAN was running with two threads:
system@jh01.dev> /

SID SERIAL# USERNAME PROGRAM AUDSID
---------- ---------- ------------ ---------------------------------------- ----------
95 3229 SYS rman@barney (TNS V1-V3) 4294967295
170 1 oracle@barney (PMON) 0
158 1 oracle@barney (MMON) 0
169 1 oracle@barney (PSP0) 0
168 1 oracle@barney (MMAN) 0
167 1 oracle@barney (DBW0) 0
166 1 oracle@barney (DBW1) 0
165 1 oracle@barney (DBW2) 0
164 1 oracle@barney (DBW3) 0
163 1 oracle@barney (LGWR) 0
152 1 oracle@barney (ARC1) 0
153 1 oracle@barney (ARC0) 0
157 1 oracle@barney (MMNL) 0
159 1 oracle@barney (CJQ0) 0
160 1 oracle@barney (RECO) 0
162 1 oracle@barney (CKPT) 0
161 1 oracle@barney (SMON) 0
118 30350 SYS rman@barney (TNS V1-V3) 4294967295
145 139 JEFFH perl@barney (TNS V1-V3) 1600566
143 47 JEFFH perl@barney (TNS V1-V3) 1600572
139 65 JEFFH perl@barney (TNS V1-V3) 1600567
146 2 oracle@barney (q000) 0
149 1 oracle@barney (QMNC) 0
156 6 oracle@barney (q001) 0

Interesting?

Tuesday, July 29, 2008

10.2.0.4 Post Mortem

In the end, my 10.2.0.4 SGA problem ended up being the _db_block_cache_protect parameter. Seems as though setting that parameter in 10.2.0.4 maps the SGA to /tmp/shm instead of real memory. The immediate cause of my:

ORA-27102: out of memory

Linux-x86_64 Error: 28: No space left on device

was that I didn't have enough space mounted at /tmp/shm. When I did allocate enough space, I got an ORA-07445 [skgmfixup_scaffolding()+129].

Anybody that deals with Oracle Support knows that when you get a different error message, that's essentially the kiss of death for your TAR. True to form, a new TAR was created with my new error message and the attempts to close the original TAR started. In the end, they would rather provide a workaround (don't use _db_block_cache_protect, 10.2 is much more stable) than a solution.

Unfortunately, the only help I can provide you is to not use _db_block_cache_protect in 10.2.0.4 with big SGAs.

Tuesday, July 22, 2008

The 10.2.0.4 16g Solution

If you've been following along, I've had quite a tough time with 16GB SGAs on 10.2.0.4. With help from Kevin Closson and Oracle Support, we found that the init.ora parameters "_db_block_cache_protect=true", "_db_block_check_for_debug=true", and "db_block_checking=true" cause 10.2.0.4 to not be able to allocate a 16G SGA. Not a problem in 10.2.0.3 mind you, but definitely a problem in 10.2.0.4.

In the end, it wasn't NUMA after all.

Click the 10.2.0.4 label below to follow all threads on this issue.

Tuesday, July 15, 2008

The great NUMA debate, part V

There was some questions whether I turned NUMA off or not one one of my previous tests. I repeated the test with similar results.

If I looked at my /proc/cmdline, I got:
ro root=LABEL=/ console=ttyS1 pci=nommconf elevator=deadline selinux=0 numa=off

Then, I looked at my boot messages and saw:
NUMA turned off

Lastly, my numactl --hardware showed:
available: 1 nodes (0-0)
node 0 size: 34815 MB
node 0 free: 31507 MB

I still couldn't start a 16GB SGA. Interestingly enough, I couldn't start a 4G SGA either! I had to go back to booting without numa=off. The saga continues...

Click the 10.2.0.4 label below to follow all threads on this issue and the eventual solution.

Wednesday, July 02, 2008

The moment you've all been waiting for...

No, no, not the Brangelina twins announcement, the numactl output.

$ ps -ef | grep oracle

oracle 13771 2418 0 Jun24 pts/0 00:00:00 -ksh
oracle 13772 13771 0 09:24 pts/0 00:00:00 ksh -i
oracle 13775 13772 0 09:24 pts/0 00:00:00 ps -ef
oracle 13776 13772 0 09:24 pts/0 00:00:00 grep oracle

$ numactl --hardware

available: 4 nodes (0-3)
node 0 size: 10239 MB
node 0 free: 7854 MB
node 1 size: 8191 MB
node 1 free: 5725 MB
node 2 size: 8191 MB
node 2 free: 6757 MB
node 3 size: 8191 MB
node 3 free: 6671 MB

OK, so maybe not everybody was waiting for that. Oracle Support requested an strace of the startup command, so I had to bring the db down anyway. The strace was a good idea, they'll be able to see the system calls being made and such. Maybe we'll get some progress yet...

If you're following along at home: Part I, Part II, Part III

More info from the comments:

I shutdown everything again:
available: 4 nodes (0-3)
node 0 size: 10239 MB
node 0 free: 7854 MB
node 1 size: 8191 MB
node 1 free: 4846 MB
node 2 size: 8191 MB
node 2 free: 6744 MB
node 3 size: 8191 MB
node 3 free: 6646 MB

No surprises there (except, of course, for the strangeness KC pointed out before). Then I decided to startup nomount with a 10.2.0.3 $OH and a 16g SGA:

available: 4 nodes (0-3)
node 0 size: 10239 MB
node 0 free: 7660 MB
node 1 size: 8191 MB
node 1 free: 4315 MB
node 2 size: 8191 MB
node 2 free: 6708 MB
node 3 size: 8191 MB
node 3 free: 6646 MB

Very unexpected. It looks like whatever this is showing is not affected by a 16g SGA.

I changed my $OH back to 10.2.0.4 and started a 12g SGA:
available: 4 nodes (0-3)
node 0 size: 10239 MB
node 0 free: 25 MB
node 1 size: 8191 MB
node 1 free: 257 MB
node 2 size: 8191 MB
node 2 free: 6740 MB
node 3 size: 8191 MB
node 3 free: 6642 MB

This is kind of more what I expected. You see a 12g SGA is basically leaving nodes 0 and 1 free at next to nothing. If we believe the previous output, that's my 12g of RAM?

Click the 10.2.0.4 label below to follow all threads on this issue and the eventual solution.

Tuesday, June 24, 2008

Part III

I escalated my issue to a duty manager and explained that in my environment those types of changes are very difficult. He kicked it back to development for alternate solutions.

In the background, I managed to secure a test box for a couple hours that I could play with. While researching how to turn numa=off, I ran across this on Redhat's site:
The Red Hat Enterprise Linux 4 Update 1 kernel automatically disables NUMA optimizations (numa=off) by default on systems using the AMD64 dual core processor. This ensures stable operation across multiple systems where each has a different system BIOS implementation for reporting dual core processors.

Hmm, that's fishy. Why would they ask me to turn it off if the default value is off? So I did it anyway.

Same thing, good old ORA-27102: out of memory.

As long as I had the box booted with numa=off, I might as well try _enable_numa_optimization=false, right? No dice, same error.

Last, but not least, I tried their suggestion of setting vm.nr_hugepages=0 in the sysctl.conf. But I found out that's already my value:
sysctl -a | grep vm.nr_huge
vm.nr_hugepages = 0

OK, so now what?

Maybe 11g is an option...

Click the 10.2.0.4 label below to follow all threads on this issue and the eventual solution.

Monday, June 23, 2008

Something to be aware of, Part II

We've passed the one month anniversary for Something to be aware of (10.2.0.4) and still no meaningful results. I've been going back and forth with Oracle Support about switching NUMA off. In a synopsis:

OS: Switch NUMA off and retest.
Me: But that never changed.
OS: Switch NUMA off and retest.
Me: We take kernel changes very seriously. It will take approximately 2 months to re-certify that configuration. Besides, that NUMA didn't change.
OS: Switch NUMA off and retest.
Me: Please direct me to the note that says 10.2.0.4 requires NUMA changes.
OS: Switch NUMA off and retest.
Me: I've already proven 10.2.0.3 works as expected in this configuration, NUMA settings never changed.
OS: Switch NUMA off and retest.


My next step is talking to a Duty manager. I really don't want to do that.

Click the 10.2.0.4 label below to follow all threads on this issue and the eventual solution.

Thursday, June 19, 2008

Backing up at Home

I'm pretty particular about my backups at work. In fact, we have a setup where if a nightly backup fails, we get a page in the middle of the night so we can address the problem.

At home, it's a different story. I've got a Windows 2000 PC that I haul out the CD-R's a couple times a year and backup what I think are the important files. At first it wasn't so bad, a couple CD's and I was done. But pictures from a digital camera tend to eat up space at a pretty decent rate. The last backup I did was 18 CD-R's.

About the same time, I got the opportunity to help someone with their own backup issues. They were getting ready to retire a computer and wanted a way to move about 60G of data from one machine to another. So I investigated some options and found a nice 320G SimpleTech USB drive at Costco. We setup the drive and the ArcSoft TotalMedia Backup in about 20 minutes and completed a test backup in another hour or so.

I was so impressed, that I bought one myself. For $89, I just couldn't go wrong.

So far, my impression is TotalMedia can get the basics done. However, as a backup professional I need the ability to tweak things. I'm still doing a backup every day, but looking at other options. Stay tuned for updates.

Monday, June 16, 2008

Microsoft Vista Oy Vey!

I needed a new laptop for my home. OK, maybe "need" is not the correct word, but you get my picture.

So I did what I always do. I navigated to dell.com and purchased a Inspiron 1525 with a 15" screen, 4G of RAM and Microsoft Windows Vista Home. I went through all the normal setup things and had it on my wireless network in about 15 minutes.

So far, so good. Then I tried to add a network printer [Accept or Deny] that was attached [Accept or Deny] to my W2K desktop. Since W2K doesn't serve the drivers like XP, I had to install the drivers [Accept or Deny] for my printer before I could print. Then I had to tweak my firewall settings [Accept or Deny] so that I could print to my network printer. I used to think that the Apple Commercial was exaggerating, but it's not.

Then, to add insult to injury, I tried to print a document. First I had to activate Microsoft Word, which was simple. I typed in some text and...

Wait a second, where's the "save" button...

And what happened to the "File" menu...

Arg, even "Help" isn't on the menu any more.

I finally figured out that you have to click that stupid looking button at the top before you can "Save", "Print", "Open" or do anything you are used to.

Great way to enhance my productivity. Now it takes me 5 minutes to open and print a document instead of two mouse clicks. Word 2007 is a step backwards.

Tuesday, May 13, 2008

Something to be aware of (10.2.0.4)

My standard install of 10.2.0.3 included 23 patches at last count. When the 10.2.0.4 patchset came out a few weeks ago I checked it and decided that since my 23 patches were included in 10.2.0.4, it would be a good idea to upgrade to that version.

I first got underway in my development environment on x86_64. I have several small dbs with 2G SGAs on a single box. The patch/upgrade process went flawless and I encountered no issues at all during the upgrade.

I usually like to run the patch in development for a month or so before I apply it to my production enviornment. However, I was fighting a nasty bug on one of my 10.2.0.3 production dbs that was fixed in 10.2.0.4 but a backport wasn't available to 10.2.0.3 at the time. So after a week in development, I decided to upgrade that production db to 10.2.0.4. That db usually runs with a 16G SGA and when I tried to start the instance with the 10.2.0.4 software, I got the venerable:

SQL> startup
ORA-27102: out of memory
Linux-x86_64 Error: 28: No space left on device

OK, so I know that usually means that my kernel parameters are off somewhere. It was a Saturday and I didn't have a sysadmin available, so I bumped the SGA down to 12G and the instance started right away and upgraded fine. I figured there was some shared memory parameter that I would research on Monday and we'd be back in business the next week.

I looked at Note 301830.1 thinking maybe my shmall paramter was off, but sure enough I had enough configured. I submitted a TAR and basically let Oracle Support stew on it for a few days.

In the meantime, another DBA in my group got the task to setup a new x86_64 box with 10.2.0.4 so we can move a db. He got the software installed OK, but couldn't start an instance with an SGA of more than about 2G of memory. The kernel parameters were exactly the same as my other box, but he still couldn't start an instnace with any decent amount of memory.

This was just too co-incidental. So we played around with the 10.2.0.4 installation a little more and found that 10.2.0.4 was allocating multiple shared memory segments instead of just one big segment. I used ipcs to find my shmid for the SGA and then used pmap to find out that to allocate a 2G SGA, 10.2.0.4 used about 260 shared memory segments of between 15M and 5M.

Then we installed 10.2.0.3 and tried to start with the SAME EXACT parameter file, and sure enough, one shared memory segment of 2G. In fact, we could allocate almost all the way up to our shmall in one segment.

I'm bug 7016155 and I know others have run into this problem as well. I'm sure bug 7019967 think's he's alone in this, but he's not. As usual, my bug has been sitting out there for about 12 days and nobody has looked at it.

Part II

Click the 10.2.0.4 label below to follow all threads on this issue and the eventual solution.

Thursday, April 10, 2008

RMAN-08137 Followup

Before I got a chance to apply my 10.2.0.4 patch, my problem fixed itself! Well, maybe not fixed itself, but certainly went away.

As I was reviewing my backup log from last night, I noticed that my RMAN-08137 error was gone. I didn't do anything different nor did I change anything. In fact, yesterday's backup had the same ORA-08137 error just as I had on the last 12 days.

So what was different? My control_file_record_keep_time was set to 1, so any record of my archived log would surely be gone way before now, right?

No, the control_file_record_keep_time parameter specifies the minimum number of days the recyclable data will stay in the controlfile, not an absolute limit. Sure enough, although my control_file_record_keep_time was set to 1, the latest log from my v$archived_log was from 12 days ago. All the records from the day I had a problem were gone.

Theoretically, I guess I could have switch my logfile a couple thousand times after a day and that would have cleared the issue up.

Wednesday, April 02, 2008

RMAN-08137: WARNING: archive log not deleted as it is still needed

A couple days ago my network link to my standby databases went dark in the middle of the night. My dbs took notice and started generating log shipping errors in the alert.log. About 10 minutes later, I got a page from my standby db that said his gap was more than the acceptable level and could I please do something. No problem, I just turned log_shipping off until the link came back up. The link came back up later in the day, I resumed log shipping, and most of my dbs recognized the gap and started resending logs.

The operative word being "most".

One of my dbs was shipping an archived redo log when the link went down. When I resumed log shipping, Oracle thought he was still shipping a log and thought the gap started at the next log sequence number. He shipped the current logs just fine, but just wouldn't resolve the gap.

"No worries", I thought, "I'll just rcp the missing file and register them with the standby control file and I'll be on my merry way."

Sure enough, the standby started recovering immediately and caught up in a short time. I thought everything was dandy until I got a page on Monday morning (2 AM, thank you very much) that my log_archive_dest was about to fill up on the primary.

Our standard protocol for freeing up the log_archive_dest is to run an archivelog backup with DELETE INPUT specified. I kicked off the backup, killed the monitor that watched the log_archive_dest, and went back to bed. I set the alarm for 1 hour later just to make sure the backup was done.

When I got back up, the archivelog backup wasn't done and the log_archive_dest was at 99%. I then looked in the directory and saw logs that were 3 and 4 days old which I knew was not correct. I killed the backup, deleted the 4 day old logs, crosschecked, and restarted the backup.

When I got into the office, my main task was to figure out why this happened. When I looked at my RMAN message log, I noticed about every log that was being backed up was accompanied by a warning:
RMAN-08137: WARNING: archive log not deleted as it is still needed

Hmm, that's not cool. Maybe my standby was really out of sync and I just didn't know it. Sometimes the primary will ship logs and the standby will accept but not apply them because it is missing a log. But that wasn't my case, the standby was within one or two logs from the primary.

When I looked at my v$archivelog_history, the "applied" flag indicated that the primary thought a bunch of logs were not applied to the standby. I knew different, as the standby was somewhat current.

At that point, I needed assistance. I filed a TAR and within about 2 hours found out it is a bug (bug 4538727) that is fixed in the 10.2.0.4 patchset. Since the patch just came out, I haven't gotten a chance to apply it to anything so I asked for workarounds.

"None", they say.

Theoretically, I supposed I could recreate my control files and my standby control files. I'm not really excited about doing that.

Until I figure out what to do, I'll just have to use RMAN to delete my logs "backed up 1 times" and crosscheck to manage my log_archive_dest space.

My Resolution, sort of.

Wednesday, March 26, 2008

10.2.0.4 Patchset is out

Just noticed that the 10.2.0.4 patchset is out.

Tuesday, March 11, 2008

Maybe I'm starting to get it

Yes, another R12 post. It's all I've been doing lately.

When I first installed 11.5.3, I just went with the default two-tier architecture; Web & Forms on the middle-tier and Reports & CCM on the db host. After a while, I realized that was a mistake for my environment. I should have put everything but the db on the middle-tier. Live and learn, I guess.

R12 was my chance to partially correct that mistake. I had a grand plan to migrate my middle-tier services to a hot x86_64 box and leave my db on Solaris. I figured that would give me an easier path if I needed to scale out either tier.

Except Oracle doesn't have a supported upgrade path on multiple platforms yet.

So I bounced a couple ideas off a very helpful Oracle Support Tech and he prefaced each one with "This isn't supported, but you might try...". I finally settled on a plan and executed it.

First, I collapsed my 11.5.9 tiers on to a single Solaris box. I used the Metalink whitepaper and rapidclone to get everything on my db box.

Then, I upgraded to 12.0.0 and applied the 12.0.4 RUP. At that point, I had a working 12.0.4 system with all my middle tier services on Solaris.

Next, I did a fresh install of 12.0.0 on a Linux box telling the configuration that I was going to have two tiers and my db was going to be on Solaris. Then I deleted my network topology and reran autoconfig that populated the db piece of my topology.

The next step was to apply the 12.0.4 patch to the Linux box. I did this with adpatch, but used the "nodatabaseportion" option because I knew I had already applied the object changes to my db.

The final step was running autoconfig on the middle-tier.

Except for a few glitches here and there, it was a surprisingly straight-forward split. Either I'm starting to get this Oracle Apps stuff or Oracle is finally getting serious about their configuration management.

Tuesday, February 19, 2008

I'm In!

I went round and round with Oracle Support creating spin-off TARs and transferring it from one group to another. I finally found an article that said one of the 11.5.x versions was really executing fnd_web_sec.validate_login() under the covers.

Might as well give that a try, SINCE I'M STILL WAITING.

So I ran the function on my GUEST user and sure enough, it came back with a "Y" which indicated the GUEST user and password were correct.

I then tried the same for my SYSADMIN user and the function returned "N".

Hmm, that's interesting, let me try it again.

Sure enough, my old SYSADMIN password didn't work for some reason. I posted the results in the TAR and the analyst suggested I change my SYSADMIN password with FNDCPASS.

I changed it, and was in!

If this little tidbit of an upgrade is published somewhere, it sure wasn't obvious to me.

Of course, now I have to figure out why anything using the forms servlet won't come up. And how to use WebADI. And why my printers aren't defined.

But at least I'm in!

Friday, February 08, 2008

Fun & Games, Part III

The main TAR had been in "Review" status for 24 hours, so I duty managed it. I usually don't like to duty manage things because if you cry wolf enough, eventually nobody will listen.

The analyst updated the TAR and wanted the results from about a dozen URLs along with another half dozen log files.

Aha, trying to bury me, are you?

After an hour of trying the URLs (which 90% of them were from 11.5.9 and didn't work in R12), I assembled the output and packaged in a nice neat .tar file. As expected, it went back to "Review" status.

Yesterday I had something else to do, so I wasn't very diligent in following up. Which was OK, because there was nothing to follow up with. My TAR remained in "Review" status for the whole day.

This morning I had some free time, so I duty managed my TAR again. Within 20 minutes, the analyst handed the TAR off to another group by creating a "spinoff" and marked the original TAR as "Awaiting Internal Response".

It will end up being a 2 second fix, I just know it. A week later, and I'm still waiting for the slightest bit of help.

Wednesday, February 06, 2008

The Fun & Games Continued

Eventually my TAR from R12 Upgrade Fun & Games suggested that I rerun autoconfig for my DB Tier and I ran into another problem. Since it was not related to the original TAR, I opened a new one and put the original TAR in "Waiting for Customer" state until my new problem could be solved. Interestingly enough, the root cause of the new problem was that I had my $ORACLE_HOME in a different location than rapidWiz thought it should be, so rapidWiz put my newly created context file in the wrong location. I was trying to run adconfig.sh with an 11.5.9 context file, which kept giving me an error about version compatibility. Fair enough, I was back on track in about two days.

So I went back to my original TAR and posted that while I ran autoconfig on my db tier and bounced all my services, that still didn't fix my issue.

And I wait...

While waiting, I found Stephen Chan's blog that pointed me to Oracle Application Server Diagnostic Tools and Log Files in Applications R12 (ml note 454178.1). Definitely something to test, SINCE I'M JUST WAITING.

...to be continued... (hopefully)

Monday, February 04, 2008

R12 Upgrade Fun & Games

So I've been toying around with an Oracle R12 upgrade lately. The first attempt went miserably wrong with all sorts of issues when I applied RUP3.

Then I decided that upgrading my middle-tier to Linux x86_64 sounded like a good idea. Except it wasn't. You see, first off 11.5.9 isn't supported on x86_64, so you can't add a node to your existing 11.5.9 instance. And you can't install an R12.0 base on x86_64 and then upgrade your 11.5.9 db, because that too is not supported. I ended up collapsing my two-tier environment to a single tier so I can upgrade to R12 and then add an R12 node that runs x86_64 later on. That, I double checked, is supported.

I performed the 11.5.9 to R12.0.0 upgrade again on my test system and applied the RUP4. Surprisingly, I only encountered about four problems. Two of the problems were not really problems because I wasn't running those modules and two problems that I researched on Metalink and quickly resolved.

When I brought the whole thing up (on a single tier), all the middle-tier services came up quite fine. Yes, I was surprised too.

The first time I tried to login is when my problems started. The ominous "Login failed. Please verify your login information or contact the system administrator" message comes up. Hey, I am the system administrator, WTF I am supposed to do from here? No errors, no messages, no nothing.

No problem, I'll just go to my standard diagnostic tests that I used in 11.5.9 to see where my problem is. Wait, there's no FND_WEB.Ping anymore. Nor is there any isItWorking servlets.

Finally, I find http://hostname/OA_HTML/jsp/fnd/aolj_conn_test.jsp and run through the tests. Surprisingly, everything but the AOL/J connection test works.

"Aha, I'll provide that information to Oracle Support and they'll be able to tell me where to look right away.", I thought to myself.

"Please upgrade to the latest RUP" was their response.

Me: "Errr, I am on RUP4."

Them: "Please rerun autoconfig."

Me. "Errr, done, but that doesn't solve my problem. Besides, my AOL/J connection test doesn't work. What does AOL/J do and why would it fail with this error?"

Wait.

Wait.

Wait.

Me: "Um, this is kind of urgent. When I try to login to the forms server directly I get 'APP-FND-01542: This Applications Server is not authorized to access this database.'"


Wait.

Wait.

Me: "I found that if I change my s_appserverid_authentication to OFF in the context file I get a different error logging into the forms server. Now I'm getting 'Error: validateLogin returned false; failed to login as XXX, please try again'"

Wait.

Wait.

Me: "Is anyone looking at this TAR?"

Them: "Sorry for the delay, I am looking at your log files"

Wait.

Wait.

...to be continued.

Thursday, January 10, 2008

Patch 4502962

Seems as though there's no version of the R12.AD.A minipack for Linux X86_64. Yet, since there's no binaries that get changed for this patch, you can run the 32bit version without a problem.

Friday, January 04, 2008

R12 Upgrade, Attempt #2

My first upgrade was a total bomb. The R12 portion went OK, only three minor problems that I corrected as the patching was taking place. Took about three days, but it finished with about 40 or so objects invalid (which I consider pretty good considering the umpteen thousand that are in the database).

Instead of just going with the 12.0.0, I decided to apply the RUP3. Might as well be up-to-date when you upgrade, right? While applying the RUP3 patch, it filled up my filesystem. I started deleteing some logfiles so I could continue, but couldn't keep up with adpatch. It left my db in a state of total chaos. Did I mention I didn't backup after my R12.0.0 upgrade was complete?

I noticed that during the R12.0.0 that I was using a decent amount of CPU on the middle-tier and my db tier was busy but not pegged. I swapped out my Solaris middle-tier for an X86_64 middle-tier to see if that speeds up the process.

One thing that came up during the initial install was adrunias.sh was trying to start Java from /tmp/RapidInstall/jre/Linux. The rapidwiz kept throwing an error of RW-50010 which was totally non-descript. I finally ran adrunias.sh with ksh -v to see what it was really doing and found it was looking for the jre in /tmp/RapidInstall/jre/Linux but the only directory was /tmp/RapidInstall/jre/Linux_x64. A soft link cured that problem and I am continuing on my way. Lets see where this takes us...