A few basic notes on Oracle Database Administration.

Thursday, September 25, 2008

A lazy log writer

I've been laughing because I live 1 block from Moscone Center. It was closer for me to walk between the conference and my house than it was to walk between some of the sessions. Today, I saw the other side of that coin. I got ordered back to work and missed the last day of the conference, (and Chen Shapira's presentation!!!)

What was going on at work? Not much - as in not what should have been. I like looking at the performance monitor on Enterprise Manager for a quick glance at what is going on. It wasn't a pretty picture.



Clicking on the 'Blocking Sessions' tab I saw that that the log writer session was blocking various other sessions.

I went into the alert log and was pointed to a log writer trace file. Inside the trace file I found

*** 2008-09-25 15:28:24.239
Warning: log write time 15460ms, size 6999KB

*** 2008-09-25 15:28:24.836
Warning: log write time 590ms, size 6898KB

*** 2008-09-25 15:28:29.852
Warning: log write time 5020ms, size 6999KB

I looked at metalink and got

" The above warning messages has been introduced in 10.2.0.4 patchset. This warning message will be generated only if the log write time is more than 500 ms and it will be written to the lgwr trace file .

"These messages are very much expected in 10.2.0.4 database in case the log write is more than 500 ms. This is a warning which means that the write process is not as fast as it intented to be . So probably you need to check if the disk is slow or not or for any potential OS causes. "

We just upgraded to 10.2.0.4. Our storage is across the great divide at another company. We are often short of answers other than, "Everything is configured correctly". With quite a bit of work we have gotten a pair of LUNS allocated for redo logs.

As a test, I moved the redo logs from the SAN to a NSF drive - NOT one that should be allocated to redo. Here was the immediate result:



The log writer waits stopped. Compliments from management. A request from management to storage management to move the entire database to this kind of storage, everyone is happy

almost.

Tonight I looked at the trace file:

*** 2008-09-25 22:53:34.154
Warning: log write time 750ms, size 0KB
*** 2008-09-25 22:53:35.943
Warning: log write time 1770ms, size 28KB
*** 2008-09-25 22:53:39.889
Warning: log write time 940ms, size 0KB

Log writer is taking forever, and it isn't even doing anything!

To be continued..... and detective suggestions welcome!








9 comments:

Fitaloon said...

Could be something as simple as write cache on the disk having some sort of problem. I've known slow writes because of failed backup batteries for the write cache.

Anonymous said...

Pleasae expand on 'NSF drive' did you mean NAS or some other format?

girlgeek said...

Hi anonymous -
We moved the log files to a randomly selected 'other' file, just to see if the specific SAN drive was the problem. I suspect that the drive selected was a NAS - disk array. Since it didn't improve the issue at all, I have now moved the files back to the SAN where they belong. Either way, write time is abismal.
Thank you for your comment. I sure don't know what is wrong.
-Claudia

Hi Fitaloon,
I have passed on your suggestion of a place to look to my managers. It sure appears that the problem is someplace other than the disks themselves. One of my coworkers asked was wondering if the 50K SAN is connected to the 50K Unix machine by a 5 cent scsi cable. (humor) Now my problem is to get across the wall from database to hardware to have someone check out your very good suggestion. That may prove to be the hardest problem of all.
Thanks for your comment. I appreciate ideas.
-Claudia

Anonymous said...

Hi GirlGeek/Claudia, Did you ever get your lazy lgwr to work faster. I'm running into the same problems, however I know what the problem is it's is this thing called NetApps... junk... We have proved over and over again this thing can't hand the loads we toss at it. We have identified more bugs for NetApps to fix and etc. The latest was a fix in 7.3 ontap to be more aggressive with the aging process of the filer cache. anyway log story is, we have 212 second writes of 2kbs to the nvram of the filer and yes it's not all the filer but getting some of these other folks to step up for the storage and network is difficult for various reasons. so your not alone...

thanks
-dennis.

Helga said...

I think that you may be interested in another application that quickly eliminates data corruption issues in database files, please take a look at restore sql mdf tool and let me know what do you think

stephen said...

good morning, Steve here, I am having the same issue with most of my databases, all running Netapp over NFS. Our systems people do not see any problem on the Netapp side - but they may not know what to look for - the IOPS are not high, and neither are the CPU loads on the filers.

So, we don't know what else to check. Netapp support is being
unhelpful - they went so far as suggest that the Oracle log writer times are not correct. hmmm.

Wondering if you found anything...

Thanks, Steve

stephen said...

good morning, Steve here, I am having the same issue with most of my databases, all running Netapp over NFS. Our systems people do not see any problem on the Netapp side - but they may not know what to look for - the IOPS are not high, and neither are the CPU loads on the filers.

So, we don't know what else to check. Netapp support is being
unhelpful - they went so far as suggest that the Oracle log writer times are not correct. hmmm.

Wondering if you found anything...

Thanks, Steve

Lori said...

We are having the same issue. We just migrated 20+ database, a 5 node rac cluster etc. to Linux x8664 rhel5, Oracle 11Gr1. It is a mess. Some days all is well - still lgwr waits of 500ms, 600ms, 1000ms about every 5-10 min, but what kills us is the 47000ms waits that go on for about 10 min and the entire system stops. Did anybody find any solution? We have oracle, the hardware vendors everyone we can think of engaged. I'm sure the problem isn't oracle...but this is a tough one.

sap ecc 7.0 said...

Very helpful post you have provided here. I have not faced this problem so far but after knowing the problem and its recommended solution I will remember this point and take care of it.