public inbox for devel@edk2.groups.io
 help / color / mirror / Atom feed
* [PATCH 1/1] OvmfPkg/NestedInterruptTplLib: replace ASSERT() with a warning logged.
@ 2023-04-28  9:10 Gerd Hoffmann
  2023-04-28 11:26 ` Michael Brown
  0 siblings, 1 reply; 5+ messages in thread
From: Gerd Hoffmann @ 2023-04-28  9:10 UTC (permalink / raw)
  To: devel
  Cc: Oliver Steffen, Jiewen Yao, Ard Biesheuvel, Pawel Polawski,
	Gerd Hoffmann, Jordan Justen, Michael Brown, Laszlo Ersek

OVMF can't guarantee that the ASSERT() doesn't happen.  Misbehaving
EFI applications can trigger this.  So log a warning instead and try
to continue.

Reproducer: Fetch windows 11 22H2 iso image, boot it in qemu with OVMF.

Traced to BootServices->Stall() being called with IPL=TPL_HIGH_LEVEL
and Interrupts /enabled/ while windows is booting.

Cc: Michael Brown <mcb30@ipxe.org>
Cc: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
---
 OvmfPkg/Library/NestedInterruptTplLib/Tpl.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/OvmfPkg/Library/NestedInterruptTplLib/Tpl.c b/OvmfPkg/Library/NestedInterruptTplLib/Tpl.c
index e19d98878eb7..cd65a5c995a0 100644
--- a/OvmfPkg/Library/NestedInterruptTplLib/Tpl.c
+++ b/OvmfPkg/Library/NestedInterruptTplLib/Tpl.c
@@ -39,7 +39,15 @@ NestedInterruptRaiseTPL (
   //
   ASSERT (GetInterruptState () == FALSE);
   InterruptedTPL = gBS->RaiseTPL (TPL_HIGH_LEVEL);
-  ASSERT (InterruptedTPL < TPL_HIGH_LEVEL);
+  if (InterruptedTPL >= TPL_HIGH_LEVEL) {
+    DEBUG ((
+      DEBUG_WARN,
+      "%a: Called at IPL %d, trying to fixup and continue...\n",
+      __func__,
+      InterruptedTPL
+      ));
+    InterruptedTPL = TPL_HIGH_LEVEL - 1;
+  }
 
   return InterruptedTPL;
 }
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/1] OvmfPkg/NestedInterruptTplLib: replace ASSERT() with a warning logged.
  2023-04-28  9:10 [PATCH 1/1] OvmfPkg/NestedInterruptTplLib: replace ASSERT() with a warning logged Gerd Hoffmann
@ 2023-04-28 11:26 ` Michael Brown
  2023-04-28 13:38   ` Gerd Hoffmann
  0 siblings, 1 reply; 5+ messages in thread
From: Michael Brown @ 2023-04-28 11:26 UTC (permalink / raw)
  To: Gerd Hoffmann, devel
  Cc: Oliver Steffen, Jiewen Yao, Ard Biesheuvel, Pawel Polawski,
	Jordan Justen, Laszlo Ersek

On 28/04/2023 10:10, Gerd Hoffmann wrote:
> OVMF can't guarantee that the ASSERT() doesn't happen.  Misbehaving
> EFI applications can trigger this.  So log a warning instead and try
> to continue.
> 
> Reproducer: Fetch windows 11 22H2 iso image, boot it in qemu with OVMF.
> 
> Traced to BootServices->Stall() being called with IPL=TPL_HIGH_LEVEL
> and Interrupts /enabled/ while windows is booting.

Do we know how the system ended up in a state of being at TPL_HIGH_LEVEL 
with interrupts enabled?  This ought not to be possible: the code in 
EDK2 will (as far as I can tell) always maintain the invariant that 
interrupts are disabled while at TPL_HIGH_LEVEL.

> --- a/OvmfPkg/Library/NestedInterruptTplLib/Tpl.c
> +++ b/OvmfPkg/Library/NestedInterruptTplLib/Tpl.c
> @@ -39,7 +39,15 @@ NestedInterruptRaiseTPL (
>     //
>     ASSERT (GetInterruptState () == FALSE);
>     InterruptedTPL = gBS->RaiseTPL (TPL_HIGH_LEVEL);
> -  ASSERT (InterruptedTPL < TPL_HIGH_LEVEL);
> +  if (InterruptedTPL >= TPL_HIGH_LEVEL) {
> +    DEBUG ((
> +      DEBUG_WARN,
> +      "%a: Called at IPL %d, trying to fixup and continue...\n",
> +      __func__,
> +      InterruptedTPL
> +      ));
> +    InterruptedTPL = TPL_HIGH_LEVEL - 1;
> +  }

This workaround looks wrong to me: it will cause the subsequent call to 
NestedInterruptRestoreTPL() to drop the TPL back down to 
TPL_HIGH_LEVEL-1, which is lower than it would have been when the 
interrupt occurred.  The completed hardware interrupt would then result 
in an overall change of TPL, which cannot be correct.

Thanks,

Michael


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/1] OvmfPkg/NestedInterruptTplLib: replace ASSERT() with a warning logged.
  2023-04-28 11:26 ` Michael Brown
@ 2023-04-28 13:38   ` Gerd Hoffmann
  2023-04-28 14:50     ` [edk2-devel] " Michael Brown
  0 siblings, 1 reply; 5+ messages in thread
From: Gerd Hoffmann @ 2023-04-28 13:38 UTC (permalink / raw)
  To: Michael Brown
  Cc: devel, Oliver Steffen, Jiewen Yao, Ard Biesheuvel, Pawel Polawski,
	Jordan Justen, Laszlo Ersek

On Fri, Apr 28, 2023 at 11:26:32AM +0000, Michael Brown wrote:
> On 28/04/2023 10:10, Gerd Hoffmann wrote:
> > OVMF can't guarantee that the ASSERT() doesn't happen.  Misbehaving
> > EFI applications can trigger this.  So log a warning instead and try
> > to continue.
> > 
> > Reproducer: Fetch windows 11 22H2 iso image, boot it in qemu with OVMF.
> > 
> > Traced to BootServices->Stall() being called with IPL=TPL_HIGH_LEVEL
> > and Interrupts /enabled/ while windows is booting.
> 
> Do we know how the system ended up in a state of being at TPL_HIGH_LEVEL
> with interrupts enabled?  This ought not to be possible: the code in EDK2
> will (as far as I can tell) always maintain the invariant that interrupts
> are disabled while at TPL_HIGH_LEVEL.

Not clear.  Have not found anything in edk2 either.  With a breakpoint
at BootServices->Stall (CoreStall function) I can see it being called
with IPL=TPL_HIGH_LEVEL, most likely from windows boot code, stack
traces point to locations in low memory where no edk2 code is located.

After running for a while (and probably loading files from cdrom to
memory) these calls suddenly start happen with interrupts enabled.

I suspect the windows boot loader does something fishy here, but I can't
proof it, I have not yet pinned down the exact location where interrupts
get enabled while running at IPL=TPL_HIGH_LEVEL (which is what I suspect
is happening, but of course this is not the only possible theory how
that ASSERT got triggered).

Not fully sure how to best continue debugging this, I don't think gdb
can set an watchpoint on eflags.if ...

> > --- a/OvmfPkg/Library/NestedInterruptTplLib/Tpl.c
> > +++ b/OvmfPkg/Library/NestedInterruptTplLib/Tpl.c
> > @@ -39,7 +39,15 @@ NestedInterruptRaiseTPL (
> >     //
> >     ASSERT (GetInterruptState () == FALSE);
> >     InterruptedTPL = gBS->RaiseTPL (TPL_HIGH_LEVEL);
> > -  ASSERT (InterruptedTPL < TPL_HIGH_LEVEL);
> > +  if (InterruptedTPL >= TPL_HIGH_LEVEL) {
> > +    DEBUG ((
> > +      DEBUG_WARN,
> > +      "%a: Called at IPL %d, trying to fixup and continue...\n",
> > +      __func__,
> > +      InterruptedTPL
> > +      ));
> > +    InterruptedTPL = TPL_HIGH_LEVEL - 1;
> > +  }
> 
> This workaround looks wrong to me: it will cause the subsequent call to
> NestedInterruptRestoreTPL() to drop the TPL back down to TPL_HIGH_LEVEL-1,
> which is lower than it would have been when the interrupt occurred.  The
> completed hardware interrupt would then result in an overall change of TPL,
> which cannot be correct.

The idea was to set InterruptedTPL to the highest level which is allowed
to run with interrupts enabled and continue running with interrupts
enabled, hoping that things get back into normal once the next timer
interrupt arrives.

Which -- now that I'm thinking about it again -- is clearly bogus, we
are in the timer interrupt so IRQs have already been disabled at that
point, so the fixup idea doesn't make much sense.

So just leave InterruptedTPL alone and hope for the best?

take care,
  Gerd


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [edk2-devel] [PATCH 1/1] OvmfPkg/NestedInterruptTplLib: replace ASSERT() with a warning logged.
  2023-04-28 13:38   ` Gerd Hoffmann
@ 2023-04-28 14:50     ` Michael Brown
  2023-05-03  6:27       ` Gerd Hoffmann
  0 siblings, 1 reply; 5+ messages in thread
From: Michael Brown @ 2023-04-28 14:50 UTC (permalink / raw)
  To: devel, kraxel
  Cc: Oliver Steffen, Jiewen Yao, Ard Biesheuvel, Pawel Polawski,
	Jordan Justen, Laszlo Ersek

On 28/04/2023 14:38, Gerd Hoffmann wrote:
> I suspect the windows boot loader does something fishy here, but I can't
> proof it, I have not yet pinned down the exact location where interrupts
> get enabled while running at IPL=TPL_HIGH_LEVEL (which is what I suspect
> is happening, but of course this is not the only possible theory how
> that ASSERT got triggered).
> 
> Not fully sure how to best continue debugging this, I don't think gdb
> can set an watchpoint on eflags.if ...

In the absence of any better ideas, I'd be tempted to run QEMU via TCG 
instead of KVM, and hack the STI instruction definition to check the 
location in guest memory where gEfiCurrentTpl gets stored in your test 
build.  (It's brute force debugging, but it should find the culprit.)

>> This workaround looks wrong to me: it will cause the subsequent call to
>> NestedInterruptRestoreTPL() to drop the TPL back down to TPL_HIGH_LEVEL-1,
>> which is lower than it would have been when the interrupt occurred.  The
>> completed hardware interrupt would then result in an overall change of TPL,
>> which cannot be correct.
> 
> The idea was to set InterruptedTPL to the highest level which is allowed
> to run with interrupts enabled and continue running with interrupts
> enabled, hoping that things get back into normal once the next timer
> interrupt arrives.
> 
> Which -- now that I'm thinking about it again -- is clearly bogus, we
> are in the timer interrupt so IRQs have already been disabled at that
> point, so the fixup idea doesn't make much sense.
> 
> So just leave InterruptedTPL alone and hope for the best?

I think so.  CoreRestoreTpl(TPL_HIGH_LEVEL) should be a no-op, so it 
would be reasonable to expect that 
NestedInterruptRestoreTPL(TPL_HIGH_LEVEL, ...) should also be a no-op.

 From a quick check of the implementation, I'm pretty sure this will be 
the case.  However, the logic is already (necessarily) pretty complex 
and so I am not 100% sure of this.  The reasoning behind the logic in 
NestedInterruptTplLib relies on certain axioms and invariants (e.g. that 
there are a finite number of distinct TPLs, and that there are certain 
places within the code that further interrupts provably cannot occur), 
and so it gets very difficult to reason about when one of those 
invariants is violated, as it seems to be in this situation.

If it seems to work, then I think it would be plausible to replace the

   ASSERT (InterruptedTPL < TPL_HIGH_LEVEL);

with a warning message, and to return with InterruptedTPL unmodified. 
But I'd prefer to understand how this invariant violation arises in the 
first place.

Good luck!

Michael


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [edk2-devel] [PATCH 1/1] OvmfPkg/NestedInterruptTplLib: replace ASSERT() with a warning logged.
  2023-04-28 14:50     ` [edk2-devel] " Michael Brown
@ 2023-05-03  6:27       ` Gerd Hoffmann
  0 siblings, 0 replies; 5+ messages in thread
From: Gerd Hoffmann @ 2023-05-03  6:27 UTC (permalink / raw)
  To: Michael Brown
  Cc: devel, Oliver Steffen, Jiewen Yao, Ard Biesheuvel, Pawel Polawski,
	Jordan Justen, Laszlo Ersek

On Fri, Apr 28, 2023 at 02:50:04PM +0000, Michael Brown wrote:
> On 28/04/2023 14:38, Gerd Hoffmann wrote:
> > I suspect the windows boot loader does something fishy here, but I can't
> > proof it, I have not yet pinned down the exact location where interrupts
> > get enabled while running at IPL=TPL_HIGH_LEVEL (which is what I suspect
> > is happening, but of course this is not the only possible theory how
> > that ASSERT got triggered).
> > 
> > Not fully sure how to best continue debugging this, I don't think gdb
> > can set an watchpoint on eflags.if ...
> 
> In the absence of any better ideas, I'd be tempted to run QEMU via TCG
> instead of KVM, and hack the STI instruction definition to check the
> location in guest memory where gEfiCurrentTpl gets stored in your test
> build.  (It's brute force debugging, but it should find the culprit.)

Not so easy as tcg generates code for cli+sti, so there isn't a function
I could add logging hacks to.

> From a quick check of the implementation, I'm pretty sure this will be the
> case.  However, the logic is already (necessarily) pretty complex and so I
> am not 100% sure of this.  The reasoning behind the logic in
> NestedInterruptTplLib relies on certain axioms and invariants (e.g. that
> there are a finite number of distinct TPLs, and that there are certain
> places within the code that further interrupts provably cannot occur), and
> so it gets very difficult to reason about when one of those invariants is
> violated, as it seems to be in this situation.
> 
> If it seems to work,

It seems to work, and the warning is printed exactly once during boot.

> with a warning message, and to return with InterruptedTPL unmodified. But
> I'd prefer to understand how this invariant violation arises in the first
> place.

Root cause not found yet, but all debugging so far didn't found any
problems in edk2/ovmf, so my theory still is that windows does something
fishy here.

take care,
  Gerd


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-05-03  6:27 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-28  9:10 [PATCH 1/1] OvmfPkg/NestedInterruptTplLib: replace ASSERT() with a warning logged Gerd Hoffmann
2023-04-28 11:26 ` Michael Brown
2023-04-28 13:38   ` Gerd Hoffmann
2023-04-28 14:50     ` [edk2-devel] " Michael Brown
2023-05-03  6:27       ` Gerd Hoffmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox