From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by mx.groups.io with SMTP id smtpd.web10.19984.1682689102652238732 for ; Fri, 28 Apr 2023 06:38:23 -0700 Authentication-Results: mx.groups.io; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=SQxj88zY; spf=pass (domain: redhat.com, ip: 170.10.129.124, mailfrom: kraxel@redhat.com) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1682689101; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=g1jBybuVQn0dteDoP4Z+rOywzQr/VTwxE196pedxZAA=; b=SQxj88zY+jb795b1JWCdi54nC2jYhxamT+L4e0B+OWZF/TIyhNQJ93Cl889qvKu1a2HMv2 dztJqyzO4lZdpiZlIiEaBIbDUOeRlgLxciQ6yupNO2Uh5cz6/bk9A/7o//seuTdYYQrjyh vaX3Q6ZPSDTC6C1dPfuzS8rV0Zm8cCc= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-669-6ywQCza1N1iU57a7bXNcoA-1; Fri, 28 Apr 2023 09:38:18 -0400 X-MC-Unique: 6ywQCza1N1iU57a7bXNcoA-1 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 24F70381D4D6; Fri, 28 Apr 2023 13:38:18 +0000 (UTC) Received: from sirius.home.kraxel.org (unknown [10.39.195.158]) by smtp.corp.redhat.com (Postfix) with ESMTPS id B9FBA492C13; Fri, 28 Apr 2023 13:38:04 +0000 (UTC) Received: by sirius.home.kraxel.org (Postfix, from userid 1000) id 12BB61800082; Fri, 28 Apr 2023 15:38:03 +0200 (CEST) Date: Fri, 28 Apr 2023 15:38:03 +0200 From: "Gerd Hoffmann" To: Michael Brown Cc: devel@edk2.groups.io, Oliver Steffen , Jiewen Yao , Ard Biesheuvel , Pawel Polawski , Jordan Justen , Laszlo Ersek Subject: Re: [PATCH 1/1] OvmfPkg/NestedInterruptTplLib: replace ASSERT() with a warning logged. Message-ID: References: <20230428091019.1506923-1-kraxel@redhat.com> <01020187c79d5eef-190d7d28-3c30-442a-913b-4dae66b71839-000000@eu-west-1.amazonses.com> MIME-Version: 1.0 In-Reply-To: <01020187c79d5eef-190d7d28-3c30-442a-913b-4dae66b71839-000000@eu-west-1.amazonses.com> X-Scanned-By: MIMEDefang 3.1 on 10.11.54.9 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Fri, Apr 28, 2023 at 11:26:32AM +0000, Michael Brown wrote: > On 28/04/2023 10:10, Gerd Hoffmann wrote: > > OVMF can't guarantee that the ASSERT() doesn't happen. Misbehaving > > EFI applications can trigger this. So log a warning instead and try > > to continue. > > > > Reproducer: Fetch windows 11 22H2 iso image, boot it in qemu with OVMF. > > > > Traced to BootServices->Stall() being called with IPL=TPL_HIGH_LEVEL > > and Interrupts /enabled/ while windows is booting. > > Do we know how the system ended up in a state of being at TPL_HIGH_LEVEL > with interrupts enabled? This ought not to be possible: the code in EDK2 > will (as far as I can tell) always maintain the invariant that interrupts > are disabled while at TPL_HIGH_LEVEL. Not clear. Have not found anything in edk2 either. With a breakpoint at BootServices->Stall (CoreStall function) I can see it being called with IPL=TPL_HIGH_LEVEL, most likely from windows boot code, stack traces point to locations in low memory where no edk2 code is located. After running for a while (and probably loading files from cdrom to memory) these calls suddenly start happen with interrupts enabled. I suspect the windows boot loader does something fishy here, but I can't proof it, I have not yet pinned down the exact location where interrupts get enabled while running at IPL=TPL_HIGH_LEVEL (which is what I suspect is happening, but of course this is not the only possible theory how that ASSERT got triggered). Not fully sure how to best continue debugging this, I don't think gdb can set an watchpoint on eflags.if ... > > --- a/OvmfPkg/Library/NestedInterruptTplLib/Tpl.c > > +++ b/OvmfPkg/Library/NestedInterruptTplLib/Tpl.c > > @@ -39,7 +39,15 @@ NestedInterruptRaiseTPL ( > > // > > ASSERT (GetInterruptState () == FALSE); > > InterruptedTPL = gBS->RaiseTPL (TPL_HIGH_LEVEL); > > - ASSERT (InterruptedTPL < TPL_HIGH_LEVEL); > > + if (InterruptedTPL >= TPL_HIGH_LEVEL) { > > + DEBUG (( > > + DEBUG_WARN, > > + "%a: Called at IPL %d, trying to fixup and continue...\n", > > + __func__, > > + InterruptedTPL > > + )); > > + InterruptedTPL = TPL_HIGH_LEVEL - 1; > > + } > > This workaround looks wrong to me: it will cause the subsequent call to > NestedInterruptRestoreTPL() to drop the TPL back down to TPL_HIGH_LEVEL-1, > which is lower than it would have been when the interrupt occurred. The > completed hardware interrupt would then result in an overall change of TPL, > which cannot be correct. The idea was to set InterruptedTPL to the highest level which is allowed to run with interrupts enabled and continue running with interrupts enabled, hoping that things get back into normal once the next timer interrupt arrives. Which -- now that I'm thinking about it again -- is clearly bogus, we are in the timer interrupt so IRQs have already been disabled at that point, so the fixup idea doesn't make much sense. So just leave InterruptedTPL alone and hope for the best? take care, Gerd