Michael,

Thanks for the explanation.
I tried to expand the code flow in below and help the discussion..

TimerInterruptHandler()
    gBS->RaiseTPL (HIGH)
    gBS->RestoreTPL (APPLICATION) // expand in below.
        For Tpl = {NOTIFY, CALLBACK}:          ---- [for-loop]
            if PendingBit is not set:
                continue
            gCurrentTpl = Tpl
            EnableInterrupt()                  ---- [1]
            CoreDispatchEventNotifies(Tpl) // expand in below.
                gBS->RaiseTPL (HIGH)
                gBS->RestoreTPL (Tpl)
                NotifyFunction()               ---- [2]
                gBS->RaiseTPL (HIGH)
                gBS->RestoreTPL (Tpl)
        End-For
        gCurrentTpl = APPLICATION              ---- [3]
        EnableInterrupt()                      ---- [4]
IRET


  1.  Agree that the stack overflow could happen in real platform.
The interrupt-enabled env when CPU runs gBS->RestoreTPL(APPLICATION)  could be 3 cases: When gCurrentTpl is NOTIFY, CALLBACK or APPLICATION.
Let's name them as "env:NOTIFY", "env:CALLBACK" and "env:APPLICATION".
CPU enters "env:NOTIFY" and "env:CALLBACK" in [1].
CPU enters "env:APPLICATION" in [4], or [3] when the interrupt is already enabled in the [for-loop].

When interrupt happens in "env:NOTIFY", the inner interrupt handler calls gBS->RestoreTPL(NOTIFY). The interrupt-enabled env in the inner RestoreTPL(NOTIFY) is "env:NOTIFY" only.
When interrupt happens in "env:CALLBACK", the inner interrupt handler calls gBS->RestoreTPL(CALLBACK). The interrupt-enabled env in the inner RestoreTPL(CALLBACK) can be: "env:NOTIFY" and "env:CALLBACK".

So, the interrupt re-entrance we want to avoid is "env:NOTIFY"  -> "env:NOTIFY", or "env:CALLBACK" -> "env:CALLBACK", or "env:APPLICATION" -> "env:APPLICATION". Because it's endless.
NestedTplInterruptLib was written to avoid it.

However, it's ok for: "env:APPLICATION" -> "env:CALLBACK" -> "env:NOTIFY", or "env:CALLBACK" -> "env:NOTIFY".

In real platform, it's possible that interrupt happens just after [4], and in worst case, the RestoreTpl() call in the inner timer interrupt handler is interrupted after [4] again, and again.


  1.  Some questions on NestedInterruptTplLib.

  1.  Can we remove DisableInterruptsOnIret()? That means the inner interrupt handler would returns to the outer world with interrupt enabled and TPL==HIGH. But I don't see any issue with that.
  2.  If DxeCore can be changed, do you have an easier-to-understand solution? It really took me 2 days to understand why NestedInterruptTplLib is written in today's way.


thanks,
ray
________________________________
From: Michael Brown <mcb30@ipxe.org>
Sent: Wednesday, January 17, 2024 6:46:59 PM
To: devel@edk2.groups.io <devel@edk2.groups.io>; Ni, Ray <ray.ni@intel.com>; Laszlo Ersek <lersek@redhat.com>; kraxel@redhat.com <kraxel@redhat.com>
Cc: Pedro Falcato <pedro.falcato@gmail.com>; Kinney, Michael D <michael.d.kinney@intel.com>; Desimone, Nathaniel L <nathaniel.l.desimone@intel.com>; Kumar, Rahul R <rahul.r.kumar@intel.com>
Subject: Re: [edk2-devel] [PATCH 1/6] UefiCpuPkg/LocalApicTimerDxe: Duplicate OvmfPkg/LocalApicTimerDxe driver

On 17/01/2024 07:11, Ni, Ray wrote:
> The above flow shows endless re-entrance of timer interrupt handler.
>
> But, my question is: above flow only can happen in real platform when the below 4 steps occupies more time than the timer period (usually 10ms).
>              [Timer Interrupt #2]1. RaiseTPL (HIGH) from NOTIFY causing CPU interrupt be disabled.
>              [Timer Interrupt #2]2. Send APIC EOI (ACK the interrupt received so APIC can continue generate interrupts)
>              [Timer Interrupt #2]3. Call DxeCore::CoreTimerTick()
>              [Timer Interrupt #2]4. RestoreTPL (NOTIFY) from HIGH. No callback runs as no callback can be registered at TPL > NOTIFY. In the end of RestoreTPL(), CPU interrupt is enabled.
>
> But, in my opinion, it's impossible.

As is thoroughly documented in NestedInterruptRestoreTpl(), the
potential for unbounded stack consumption arises when an interrupt
occurs after the point that RestoreTPL() completes dispatching all
notifications but before the IRET (or equivalent) instruction pops the
original stack frame.

Since dispatching notifications can take an unbounded amount of time,
there is absolutely no guarantee that this will be less than 10ms after
the previous interrupt.  It could easily be 30 seconds later.

The problematic flow is a subtle variation on what you described:

   [IRQ#1] timer interrupt at TPL_APPLICATION
     [ISR#1] RaiseTPL from TPL_APPLICATION -> TPL_HIGH_LEVEL
     [ISR#1] Send APIC EOI
     [ISR#1] Call CoreTimerTick()
     [ISR#1] RestoreTPL from TPL_HIGH_LEVEL -> TPL_APPLICATION
       [ISR#1] Callbacks for TPL_NOTIFY are run
       [ISR#1] Callbacks for TPL_CALLBACK are run
       ... these may take several *seconds* to complete, during
           which further interrupts are raised, the details of
           which are not shown here...
       [ISR#1] TPL is now restored to TPL_APPLICATION
       [IRQ#N] timer interrupt at TPL_APPLICATION
         [ISR#N] RaiseTPL from TPL_APPLICATION -> TPL_HIGH_LEVEL
         ... continues ...

The root cause is that the ISR reaches a state in which:

   a) an arbitrary amount of time has elapsed since the triggering
      interrupt (due to unknown callbacks being invoked, which may
      themselves wait for further timer interrupts), and

   b) the TPL has been fully restored back to the TPL at the point
      the triggering interrupt occurred (i.e. TPL_APPLICATION in
      this example), and

   c) the timer interrupt source is enabled, and

   d) CPU interrupts are enabled

At this point, there is nothing preventing another interrupt from
occurring.  It will occur at TPL_APPLICATION and it will be one stack
frame deeper than the previous interrupt at TPL_APPLICATION.

Rinse and repeat, and you have unbounded stack consumption.

Hence the requirement for NestedInterruptTplLib, even on physical hardware.

Michael


-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#114045): https://edk2.groups.io/g/devel/message/114045
Mute This Topic: https://groups.io/mt/103734961/7686176
Group Owner: devel+owner@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/leave/12367111/7686176/1913456212/xyzzy [rebecca@openfw.io]
-=-=-=-=-=-=-=-=-=-=-=-