Friday, December 15, 2006

What is the optimizer doing?

Interesting problem here. I have a table in one of my Oracle 9.2.0.5 databases on Solaris called ACTIVITY_RTAB:

SQL> desc activity_rtab;
Name Null? Type
----------------- -------- --------------
ID NOT NULL NUMBER(19)
FLAG1 NOT NULL CHAR(1)
MONEY_AMOUNT NOT NULL MONEYAMOUNT_T
DATE1 NOT NULL DATE
DATE2 NOT NULL DATE
DATE3 NOT NULL DATE
DATE_MATCHED NOT NULL DATE
FLAG2 NOT NULL CHAR(1)
STR1 NOT NULL VARCHAR2(100)
FLAG3 NOT NULL CHAR(1)
ID2 NOT NULL NUMBER(19)
STR2 NOT NULL VARCHAR2(10)
ID3 NOT NULL NUMBER(19)
FLAG4 NOT NULL CHAR(1)
DATE4 NOT NULL DATE
DATE5 NOT NULL DATE
ID4 NOT NULL NUMBER(38)
DATE6 NOT NULL DATE

This table has two indicies; a Primary Key on ID, and a NON-UNIQUE index on DATE_MATCHED. The data in DATE_MATCHED is pretty skewed. I have about 10K rows per day in this table, except for the date '31-DEC-4712', which I have about 355K rows.

select
before_on,
avg(rows_per_day) avg_rows_per_day
from (
select
case
when date_matched < '31-dec-4712' then 'before'
else 'on'
end before_on,
trunc(date_matched) dm,
count(*) rows_per_day
from activity_rtab
group by
case
when date_matched < '31-dec-4712' then 'before'
else 'on'
end,
trunc(date_matched)
)
group by before_on
/

BEFORE AVG_ROWS_PER_DAY
------ -----------------
before 9413.75
on 355606.00

Queries using the index on DATE_MATCHED index are somewhat quick for any day other than '31-DEC-4712'. This behaviour doesn't really surprise me since I'll be fetching 355K rows basically a single block at a time.

select
count(*)
from activity_rtab
where date_matched = '31-DEC-4712';

COUNT(*)
----------
355606

Elapsed: 00:00:03.48

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=8 Card=1 Bytes=8)
1 0 SORT (AGGREGATE)
2 1 INDEX (RANGE SCAN) OF 'IX_ACTIVITY_DATE_MATCHED' (NON-UNIQUE) (Cost=29 Card=99 Bytes=792)
Statistics
----------------------------------------------------------
56 recursive calls
0 db block gets
1031 consistent gets
0 physical reads
0 redo size
381 bytes sent via SQL*Net to client
499 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

I know what you're thinking, "OK, so what's your problem?". In order to do some maintenance, I did a CTAS and created a new table in the same tablespace. I put both indicies on the new table and ran a quick query.

select count(*) from jh$activity_rtab where date_matched = '31-DEC-4712';

COUNT(*)
----------
355606

Elapsed: 00:00:00.53

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=8 Card=1 Bytes=8)
1 0 SORT (AGGREGATE)
2 1 INDEX (RANGE SCAN) OF 'JH$DATE_MATCHED' (NON-UNIQUE) (Cost=29 Card=9867 Bytes=78936)

Statistics
----------------------------------------------------------
14 recursive calls
0 db block gets
949 consistent gets
0 physical reads
0 redo size
381 bytes sent via SQL*Net to client
499 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed


Huh? Why is this returning in less than a second where the query on the real table takes 3? At first, I thought something must be wrong with the storage so I moved the table and rebuilt the indexes. Same thing.

Then I looked closer at the plan and noticed the query on JH$ACTIVITY_RTAB had "Card=9867" and the query against ACTIVITY_RTAB had "Card=99". Aha! A statistics problem. I calculated statistics on both tables using the following commands:

exec dbms_stats.gather_table_stats(ownname=>USER,tabname=>'activity_rtab',cascade=>true);


No dice. The query on JH$ACTIVITY_RTAB was still significantly faster than ACTIVITY_RTAB. Then I decided to take my own advice and ran a 10046 trace on both queries. The TKPROF confirmed my observations, but still offered no clues as to what the real problem was.

select count(*)
from
jh$activity_rtab where date_matched = '31-DEC-4712'


call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 3 0.00 0.01 0 0 0 0
Execute 3 0.00 0.00 0 0 0 0
Fetch 6 0.92 0.91 0 2838 0 3
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 12 0.92 0.92 0 2838 0 3

Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: 42

Rows Row Source Operation
------- ---------------------------------------------------
1 SORT AGGREGATE
355606 INDEX RANGE SCAN JH$DATE_MATCHED (object id 43980)


Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
SQL*Net message to client 6 0.00 0.00
SQL*Net message from client 6 0.00 0.00
********************************************************************************

select count(*)
from
ACTIVITY_RTAB where date_matched = '31-DEC-4712'


call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 2 0.01 0.00 0 0 0 0
Execute 2 0.00 0.00 0 0 0 0
Fetch 4 6.05 6.22 0 2030 0 2
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 8 6.06 6.23 0 2030 0 2

Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: 42

Rows Row Source Operation
------- ---------------------------------------------------
1 SORT AGGREGATE
355606 INDEX RANGE SCAN IX_ACTIVITY_DATE_MATCHED (object id 37979)


Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
SQL*Net message to client 4 0.00 0.00
SQL*Net message from client 4 0.00 0.00
********************************************************************************

At this point, I had a hunch that it was an optimizer problem but couldn't convince myself. I couldn't get around the fact that both tables had nearly exact characteristics but the queries performed radically different. Time for a 10053 trace on both tables:

Table stats Table: JH$ACTIVITY_RTAB Alias: JH$ACTIVITY_RTAB
TOTAL :: CDN: 7538301 NBLKS: 121877 AVG_ROW_LEN: 109
-- Index stats
INDEX NAME: JH$DATE_MATCHED COL#: 9
TOTAL :: LVLS: 2 #LB: 19996 #DK: 764 LB/K: 26 DB/K: 253 CLUF: 194025
INDEX NAME: JH$PK COL#: 1
TOTAL :: LVLS: 2 #LB: 16754 #DK: 7538301 LB/K: 1 DB/K: 1 CLUF: 908531
_OPTIMIZER_PERCENT_PARALLEL = 0
***************************************
SINGLE TABLE ACCESS PATH
Column: DATE_MATCH Col#: 9 Table: JH$ACTIVITY_RTAB Alias: JH$ACTIVITY_RTAB
NDV: 764 NULLS: 0 DENS: 1.3089e-03 LO: 2453031 HI: 3442447
NO HISTOGRAM: #BKT: 1 #VAL: 2
TABLE: JH$ACTIVITY_RTAB ORIG CDN: 7538301 ROUNDED CDN: 9867 CMPTD CDN: 9867
Access path: tsc Resc: 11723 Resp: 11723
Access path: index (iff)
Index: JH$DATE_MATCHED
TABLE: JH$ACTIVITY_RTAB
RSC_CPU: 0 RSC_IO: 1925
IX_SEL: 0.0000e+00 TB_SEL: 1.0000e+00
Access path: iff Resc: 1925 Resp: 1925
Access path: index (equal)
Index: JH$DATE_MATCHED
TABLE: JH$ACTIVITY_RTAB
RSC_CPU: 0 RSC_IO: 29
IX_SEL: 0.0000e+00 TB_SEL: 1.3089e-03
BEST_CST: 8.00 PATH: 4 Degree: 1
***************************************
OPTIMIZER STATISTICS AND COMPUTATIONS
***************************************
GENERAL PLANS
***********************
Join order[1]: JH$ACTIVITY_RTAB[JH$ACTIVITY_RTAB]#0
Best so far: TABLE#: 0 CST: 8 CDN: 9867 BYTES: 78936
Final - All Rows Plan:
JOIN ORDER: 1
CST: 8 CDN: 9867 RSC: 7 RSP: 7 BYTES: 78936
IO-RSC: 7 IO-RSP: 7 CPU-RSC: 1 CPU-RSP: 1
PLAN
Cost of plan: 8
Operation...........Object name.....Options.........Id...Pid..
SELECT STATEMENT 0
SORT AGGREGATE 1
INDEX JH$DATE_MATCHED RANGE SCAN 2 1
BASE STATISTICAL INFORMATION
***********************
Table stats Table: ACTIVITY_RTAB Alias: ACTIVITY_RTAB
TOTAL :: CDN: 7538301 NBLKS: 121875 AVG_ROW_LEN: 109
-- Index stats
INDEX NAME: ACTIVITY_PK COL#: 1
TOTAL :: LVLS: 2 #LB: 16748 #DK: 7538301 LB/K: 1 DB/K: 1 CLUF: 908561
INDEX NAME: IX_ACTIVITY_DATE_MATCHED COL#: 9
TOTAL :: LVLS: 2 #LB: 20112 #DK: 764 LB/K: 26 DB/K: 253 CLUF: 194021
_OPTIMIZER_PERCENT_PARALLEL = 0
***************************************
SINGLE TABLE ACCESS PATH
Column: DATE_MATCH Col#: 9 Table: ACTIVITY_RTAB Alias: ACTIVITY_RTAB
NDV: 764 NULLS: 0 DENS: 1.3089e-03 LO: 2453031 HI: 3442447
NO HISTOGRAM: #BKT: 1 #VAL: 2
TABLE: ACTIVITY_RTAB ORIG CDN: 7538301 ROUNDED CDN: 99 CMPTD CDN: 99
Access path: tsc Resc: 11723 Resp: 11723
Access path: index (iff)
Index: IX_ACTIVITY_DATE_MATCHED
TABLE: ACTIVITY_RTAB
RSC_CPU: 0 RSC_IO: 1936
IX_SEL: 0.0000e+00 TB_SEL: 1.0000e+00
Access path: iff Resc: 1936 Resp: 1936
Access path: index (equal)
Index: IX_ACTIVITY_DATE_MATCHED
TABLE: ACTIVITY_RTAB
RSC_CPU: 0 RSC_IO: 29
IX_SEL: 0.0000e+00 TB_SEL: 1.3089e-03
BEST_CST: 8.00 PATH: 4 Degree: 1
***************************************
OPTIMIZER STATISTICS AND COMPUTATIONS
***************************************
GENERAL PLANS
***********************
Join order[1]: ACTIVITY_RTAB[ACTIVITY_RTAB]#0
Best so far: TABLE#: 0 CST: 8 CDN: 99 BYTES: 792
Final - All Rows Plan:
JOIN ORDER: 1
CST: 8 CDN: 99 RSC: 7 RSP: 7 BYTES: 792
IO-RSC: 7 IO-RSP: 7 CPU-RSC: 1 CPU-RSP: 1

The queries are basically the same except for that darn Cardinality (CDN). I pulled out Jonathan Lewis' latest book and went through the calculations but still don't see how the CDN of 99 is being computed. I played around with histograms and different bucket sizes, but always came back to the simple premise of "If performance is fine with JH$ACTIVITY_RTAB, why does it suck with ACTIVITY_RTAB"?

I just don't understand it. For kicks, I tried the query with a +RULE hint and it returned in .48 seconds.

SQL> select /*+ rule */ count(*) from ACTIVITY_RTAB where date_matched = '31-DEC-4712';

COUNT(*)
----------
355606

Elapsed: 00:00:00.48

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=HINT: RULE
1 0 SORT (AGGREGATE)
2 1 INDEX (RANGE SCAN) OF 'IX_ACTIVITY_DATE_MATCHED' (NON-UNIQUE)


Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
1015 consistent gets
0 physical reads
0 redo size
381 bytes sent via SQL*Net to client
499 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

I'm really perplexed at this point. Sure, I have two workarounds; use the +RULE hint or drop the table and recreate it. I'm not a big fan of hinting since hinted queries seem to be the first queries that "go bad" when you upgrade. Dropping the table seems reasonable, but since I don't know how I got into this situation, I'm not sure how to detect it. Sort of like choosing between a root canal and a colonoscopy.

An update is here.

3 comments:

Anonymous said...

Just some idle curiosity - did you run the num_distinct query on p. 42 of Jonathan's book for your two tables?

Anonymous said...

Jeff,
I'll take a bet that your production system has a constraint on that column:
date_matched = trunc(date_matched).

This is extracted by the optimizer and used as a predicate in the query, and (incorrectly) introduces a 1% factor to the selectivity (hence 9876 -> 99).

BUT - the predicate gets checked for every row, and checking date = trunc(date) 355,000 times is probably going to account for 5 seconds of CPU.

Warning - always look at the dbms_xplan output (or at least the filter_predicates and access_predicates) when investigating execution plan problems.

The bit that really puzzles me is that the range scan line has a cost of 29, but the query has a cost of 8. I'm still thinking about that - but did you see anything in the trace file that says 'prefetch enabled'.

Jeff Hunter said...

I'm glad I don't bet with you. Indeed, it was a constraint on date_matched that was causing this slowdown:
CONSTRAINT_NAME C SEARCH_CONDITION
------------------------------ - ------------------------------------------------------------
ACTIVITY_DATE_MATCHED C date_matched = TRUNC( date_matched )

SQL> select count(*) from activity_rtab where date_matched = '31-dec-4712';

COUNT(*)
----------
370319

Elapsed: 00:00:01.02
SQL> select count(*) from activity_rtab where date_matched = '31-dec-4712';

COUNT(*)
----------
370319

Elapsed: 00:00:01.01
SQL> alter table activity_rtab drop constraint ACTIVITY_DATE_MATCHED;

Table altered.

Elapsed: 00:00:00.37
SQL> select count(*) from activity_rtab where date_matched = '31-dec-4712';

COUNT(*)
----------
370319

Elapsed: 00:00:00.08
SQL> select count(*) from activity_rtab where date_matched = '31-dec-4712';

COUNT(*)
----------
370319

Elapsed: 00:00:00.07

Thanks for all your help!! You're a real life-saver.