2020-04-21

log file sync by Tanel Poder

As “log file sync” is not directly an I/O wait event (IO happens indirectly under the hood), but “wait for LGWR to ACK/increment redo persisting position” then there may be multiple reasons for delays. I’m repeating some of what others have said here too:

  • LGWR could be waiting for remote ACK (synchronous data guard) - that should show up as a LNS* wait or something like that though
  • Process slowness due to CPU overload/scheduling latency or swapping (in both cases, LGWR may do its work fast, but it takes a while for the foreground process to wake up and end the wait event - you could check if there are any reports regarding switching LGWR mode in alert.log around that time)
  • Bug/missed post due to switching between LGWR progress polling vs. waiting for LGWR post
  • ASM instance communication issues (ASM communication can be on the critical path when creating/changing files on ASM and even for reading, when the ASM metadata cache in your local instance doesn’t have what it needs)
  • So, the physical redo log file disk write is not necessarily the biggest contributor to the log file sync wait

You’ll need to measure what LGWR itself was doing during the problem time

  • Since you’re on 19c (12c or later) the LGWR has slaves too and they may be doing most of the work
  • You’ll need to look into both the LGWR process and any other LG* ones too
  • Since you have (Oracle) Snapper running already and writing to trace you should have the data already
  • The lgwr in the following @snapper ash,stats,trace 10 999999 lgwr syntax includes LGWR slaves too
  • As LGWR slaves have their own sessions (and SIDs), you should run ashtop etc and report if there’s any BLOCKING_SESSION value listed for LGWR and slaves
  • If LGWR and the slaves were listed idle during the problem time (while others waited for log file sync event at the same time) and if there was no huge OS level CPU/memory shortage at the time, this looks like a background <-> foreground communication problem.
  • This (and any potential ASM issues) could probably be troubleshooted via KST tracing, KSR channels and X$TRACE that could help here (some of it is enabled by default ( Like I showed in a previous hacking session: https://www.youtube.com/watch?v=mkmvZv58W6w )

Snapper

Paste the full Snapper output (of one time snapshot from the problem time) here if you can - we’ll be able to see what the slaves were doing

  • You’d need to include not only LGWR SID but also any LG* slave SIDs too
  • Also we’ll see better LGWR delay metrics from V$SESSTAT that go beyond just wait events
  • Thinking of metrics like these - snapper already handles them (if you look into the rightmost column)
  • redo synch time overhead (usec) - “FG wakeup overhead per log file sync”
  • redo write time - accounting end to end redo write time (not just the log file parallel write syscall) “per redo write”

ASH wait chains

Regarding the ASH wait chains error on 19c - I’m aware of this but forgot to look into it

  • It has (probably) something to do with GV$ views (or PX slaves used for PDB V$ queries) and possibly the connect by I’m using
  • Does it work in the CDB (if multitenant)?
  • Does it work with lower optimizer features enabled?
  • As we are already manually looking into what LGWR and its slaves are doing, there’s no need for the wait chains script here

Linux Process Snapper - pSnapper

  • If you want to see the full “wait” states of Linux processes/threads, use the -a option.
  • It’s somewhat like always showing “idle” threads too
  • But for linux apps it’s harder to determine what’s idle and what’s some network RPC wait, so I currently just do what Linux “load” metric does
  • I just shows threads in Runnable and Disk/Uninterruptible sleep states by default (-a shows all states, including Sleep and kernel-Idle threads etc)
  • So the Linux pSnapper “idle” means a somewhat different thing than Oracle “idle” - say LGWR was waiting for a Data Guard ACK, that should not be shown as Idle in Oracle, but at the Linux process level it’s “just some network socket wait” and you’d need to use -a option to show all of them

Were LGWR/LG* processes mostly idle during the problem time?

  • As I kept reading the thread, it looks like ASH reported LGWR being mostly idle during the problem time
  • This is not exactly the same pattern, but I’ve seen cases in the past where everyone’s waiting for LGWR and LGWR doesn’t seem to be busy and doesn’t show any significant waits
  • It turned out that LGWR was actually waiting for CKPT to make progress, but it wasn’t synchronously waiting, but just going to sleep and “checking again later” in a loop
  • I wrote about this here: https://blog.tanelpoder.com/posts/log-file-switch-checkpoint-incomplete-and-lgwr-waiting-for-checkpoint-progress/
  • So it may still make sense to check what LGWR/LG* processes are doing even if they don’t seem to be active much (like do you see an occasional controlfile read wait by LGWR etc)
  • SQL*Trace would capture all activity of LGWR, but may slow things down further, so you could also sample & spool V$SESSION_WAIT_HISTORY (that keeps a log of last 10 waits of every session in memory)