From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124])
 by mx.groups.io with SMTP id smtpd.web10.19984.1682689102652238732
 for <devel@edk2.groups.io>;
 Fri, 28 Apr 2023 06:38:23 -0700
Authentication-Results: mx.groups.io;
 dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=SQxj88zY;
 spf=pass (domain: redhat.com, ip: 170.10.129.124, mailfrom: kraxel@redhat.com)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1682689101;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=g1jBybuVQn0dteDoP4Z+rOywzQr/VTwxE196pedxZAA=;
	b=SQxj88zY+jb795b1JWCdi54nC2jYhxamT+L4e0B+OWZF/TIyhNQJ93Cl889qvKu1a2HMv2
	dztJqyzO4lZdpiZlIiEaBIbDUOeRlgLxciQ6yupNO2Uh5cz6/bk9A/7o//seuTdYYQrjyh
	vaX3Q6ZPSDTC6C1dPfuzS8rV0Zm8cCc=
Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com
 [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-669-6ywQCza1N1iU57a7bXNcoA-1; Fri, 28 Apr 2023 09:38:18 -0400
X-MC-Unique: 6ywQCza1N1iU57a7bXNcoA-1
Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 24F70381D4D6;
	Fri, 28 Apr 2023 13:38:18 +0000 (UTC)
Received: from sirius.home.kraxel.org (unknown [10.39.195.158])
	by smtp.corp.redhat.com (Postfix) with ESMTPS id B9FBA492C13;
	Fri, 28 Apr 2023 13:38:04 +0000 (UTC)
Received: by sirius.home.kraxel.org (Postfix, from userid 1000)
	id 12BB61800082; Fri, 28 Apr 2023 15:38:03 +0200 (CEST)
Date: Fri, 28 Apr 2023 15:38:03 +0200
From: "Gerd Hoffmann" <kraxel@redhat.com>
To: Michael Brown <mcb30@ipxe.org>
Cc: devel@edk2.groups.io, Oliver Steffen <osteffen@redhat.com>, 
	Jiewen Yao <jiewen.yao@intel.com>, Ard Biesheuvel <ardb+tianocore@kernel.org>, 
	Pawel Polawski <ppolawsk@redhat.com>, Jordan Justen <jordan.l.justen@intel.com>, 
	Laszlo Ersek <lersek@redhat.com>
Subject: Re: [PATCH 1/1] OvmfPkg/NestedInterruptTplLib: replace ASSERT() with a warning logged.
Message-ID: <em2rxaumz2dymkq6m44g273gywt4jtx2obuaiogqbkidcfotbv@xs2gwgv67djp>
References: <20230428091019.1506923-1-kraxel@redhat.com>
 <01020187c79d5eef-190d7d28-3c30-442a-913b-4dae66b71839-000000@eu-west-1.amazonses.com>
MIME-Version: 1.0
In-Reply-To: <01020187c79d5eef-190d7d28-3c30-442a-913b-4dae66b71839-000000@eu-west-1.amazonses.com>
X-Scanned-By: MIMEDefang 3.1 on 10.11.54.9
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

On Fri, Apr 28, 2023 at 11:26:32AM +0000, Michael Brown wrote:
> On 28/04/2023 10:10, Gerd Hoffmann wrote:
> > OVMF can't guarantee that the ASSERT() doesn't happen.  Misbehaving
> > EFI applications can trigger this.  So log a warning instead and try
> > to continue.
> > 
> > Reproducer: Fetch windows 11 22H2 iso image, boot it in qemu with OVMF.
> > 
> > Traced to BootServices->Stall() being called with IPL=TPL_HIGH_LEVEL
> > and Interrupts /enabled/ while windows is booting.
> 
> Do we know how the system ended up in a state of being at TPL_HIGH_LEVEL
> with interrupts enabled?  This ought not to be possible: the code in EDK2
> will (as far as I can tell) always maintain the invariant that interrupts
> are disabled while at TPL_HIGH_LEVEL.

Not clear.  Have not found anything in edk2 either.  With a breakpoint
at BootServices->Stall (CoreStall function) I can see it being called
with IPL=TPL_HIGH_LEVEL, most likely from windows boot code, stack
traces point to locations in low memory where no edk2 code is located.

After running for a while (and probably loading files from cdrom to
memory) these calls suddenly start happen with interrupts enabled.

I suspect the windows boot loader does something fishy here, but I can't
proof it, I have not yet pinned down the exact location where interrupts
get enabled while running at IPL=TPL_HIGH_LEVEL (which is what I suspect
is happening, but of course this is not the only possible theory how
that ASSERT got triggered).

Not fully sure how to best continue debugging this, I don't think gdb
can set an watchpoint on eflags.if ...

> > --- a/OvmfPkg/Library/NestedInterruptTplLib/Tpl.c
> > +++ b/OvmfPkg/Library/NestedInterruptTplLib/Tpl.c
> > @@ -39,7 +39,15 @@ NestedInterruptRaiseTPL (
> >     //
> >     ASSERT (GetInterruptState () == FALSE);
> >     InterruptedTPL = gBS->RaiseTPL (TPL_HIGH_LEVEL);
> > -  ASSERT (InterruptedTPL < TPL_HIGH_LEVEL);
> > +  if (InterruptedTPL >= TPL_HIGH_LEVEL) {
> > +    DEBUG ((
> > +      DEBUG_WARN,
> > +      "%a: Called at IPL %d, trying to fixup and continue...\n",
> > +      __func__,
> > +      InterruptedTPL
> > +      ));
> > +    InterruptedTPL = TPL_HIGH_LEVEL - 1;
> > +  }
> 
> This workaround looks wrong to me: it will cause the subsequent call to
> NestedInterruptRestoreTPL() to drop the TPL back down to TPL_HIGH_LEVEL-1,
> which is lower than it would have been when the interrupt occurred.  The
> completed hardware interrupt would then result in an overall change of TPL,
> which cannot be correct.

The idea was to set InterruptedTPL to the highest level which is allowed
to run with interrupts enabled and continue running with interrupts
enabled, hoping that things get back into normal once the next timer
interrupt arrives.

Which -- now that I'm thinking about it again -- is clearly bogus, we
are in the timer interrupt so IRQs have already been disabled at that
point, so the fixup idea doesn't make much sense.

So just leave InterruptedTPL alone and hope for the best?

take care,
  Gerd