From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received-SPF: Pass (sender SPF authorized) identity=mailfrom; client-ip=2607:f8b0:4864:20::d2e; helo=mail-io1-xd2e.google.com; envelope-from=ard.biesheuvel@linaro.org; receiver=edk2-devel@lists.01.org Received: from mail-io1-xd2e.google.com (mail-io1-xd2e.google.com [IPv6:2607:f8b0:4864:20::d2e]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 5C39721A07A92 for ; Wed, 3 Oct 2018 10:35:11 -0700 (PDT) Received: by mail-io1-xd2e.google.com with SMTP id t7-v6so5682923ioj.13 for ; Wed, 03 Oct 2018 10:35:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=rSmXnGXn87X9jYmEEHrValMdDQyUzx+cUqOkwtWZ1co=; b=XQcTCh89rlhshQdFmgr8GI9gMgEL0/hpb90vPbziNLQp3pg3AU3kw4S7SZcz+gbbi1 dOh3wO0YT5uS0CZfy8bEDJbZ/9hvFxHuSELYWwfEok+IOPy4NFkqL0FIImPQCGURfcn6 dhSQ+FMTS+CZlDbudx9stLEMoqfyP1AFc6Yuk= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=rSmXnGXn87X9jYmEEHrValMdDQyUzx+cUqOkwtWZ1co=; b=JVyRNLJUlxHa8cw+RTensEKSPZX8E+nxSqyIlmzFr0lG/Grm/VeLNP2nLgdC+P5fY0 16r46X8nCdRlw9OKeawiUQ3LJrqpN++lgQ58HdT+g+uSOcbqs9QviqftRjUc1gSvhFUy /w9t9PCNoUz2no9xssOH0WALvnespVy5s1XnwpWHdfJhQxFFQEhSX+e1eglqQ+Sybptb gIuRLxTtnbykXQ4+7nvXZVgOnFr5Q7F8rL8Ih+27NrZHhg2cJ8eYYZeSZbui4IZ+fhsd zmQbCiZQce+qb0icyzuxvQ167dYabtBq3Ui+Cgkikh4iqg78yt/o3p+IQeVlHxQ4uLJh 7ecQ== X-Gm-Message-State: ABuFfojtOybZVH5728ZLnwcLtFIh4nU10XVP6Lc7pTxcE50iWRx0GfAY DbDd/bT1Gzdej70s8aLRhjZX6o7tBkfiB42JK2azcw== X-Google-Smtp-Source: ACcGV63X3PufAYep2ZhUiBC3y8n2W7GrFg4EicVcZ14ZBDGwIs6YXyDaURaUYtAOvvJMIAmZTfWc2pHw96SoShTXOhA= X-Received: by 2002:a6b:be83:: with SMTP id o125-v6mr1805766iof.173.1538588110024; Wed, 03 Oct 2018 10:35:10 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:a6b:5910:0:0:0:0:0 with HTTP; Wed, 3 Oct 2018 10:35:09 -0700 (PDT) In-Reply-To: <3b4a4980-42d0-51a0-eccd-ceba83b6c78f@oracle.com> References: <3b4a4980-42d0-51a0-eccd-ceba83b6c78f@oracle.com> From: Ard Biesheuvel Date: Wed, 3 Oct 2018 19:35:09 +0200 Message-ID: To: aaron.young@oracle.com, Leif Lindholm Cc: Laszlo Ersek , "edk2-devel@lists.01.org" , Brijesh Singh Subject: Re: Regression with PXE boot on OvmfPkg/VirtioNetDxe driver on aarch64 system X-BeenThere: edk2-devel@lists.01.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: EDK II Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 03 Oct 2018 17:35:11 -0000 Content-Type: text/plain; charset="UTF-8" (+ Leif) On 3 October 2018 at 18:56, wrote: > > Hello, > > I have some more information on this issue that I wanted to share. > > The SNP client appears to be GRUB - i.e. the ASSERT() is asserted just > after PXE code has loaded GRUB, presented the GRUB menu and a boot option is > selected. > > And, I have confirmed that it, as predicted, attempts to call > SNP->Transmit() twice with the same Buffer - without successfully reaping > the Buffer via SNP->GetStatus() in between calls - thus tripping the > ASSERT() in SnpSharedHelpers.c:VirtioNetMapTxBuf(). > > This was quite easy to see since GRUB appears to use only a single Buffer > for which to transmit packets a few bytes at a time. i.e. it just calls > SNP->Transmit() and then calls SNP->GetStatus() over and over again with the > same buffer to transmit its packets. > > From my debug, I can see it work successfully most of the time (i.e. > SNP->Transmit() is called successfully with a corresponding subsequent call > to SNP->GetStatus() which returns the Buffer). However, occasionally, > SNP->GetStatus() will return TxBuf=NULL (indicating that the Buffer has not > yet been transmitted). In this case, GRUB appears to attempt to call > SNP->GetStatus() *twice* before giving up (possibly due to a timer which I > can see in the GRUB code) and then attempts to transfer the Buffer again - > resulting in the ASSERT(). > > So, this explains the regression since the old code would allow the same > Buffer to to be re-transmitted. > > As a quick test, I coded up a quick workaround in VirtioNetMapTxBuf() to > walk through Dev->TxBufCollection to see if the Buffer is present - i.e. has > already been transmitted (and not yet reaped via SNP->GetStatus()) - and if > so, just return EFI_SUCCESS from VirtioNetTransmit(). This allowed the PXE > boot/installation to succeed. I'm not advocating this as a solution, but is > an interesting data point. > > Any suggestion where we go from here? > > Is there any way to make the code more tolerant of these re-transmissions > (like the code was before)? > > I'm not confident that declaring GRUB as not spec compliant is the answer. > > Thanks, > > -Aaron > > > > On 08/31/18 06:17, Laszlo Ersek wrote: >> >> On 08/30/18 22:17, aaron.young@oracle.com wrote: >>> >>> We have encountered what we believe to be a regression in the >>> OvmfPkg/VirtioNetDxe driver. >>> >>> This regression causes PXE boot to fail with the following ASSERT: >>> >>> ASSERT >>> >>> [VirtioNetDxe]/builddir/build/BUILD/ovmf-92d07e48907f/OvmfPkg/VirtioNetDxe/SnpSharedHelpers.c(184): >>> ((BOOLEAN)(0==1)) >>> >>> The failure was encountered on a aarch64/Armv8 system (Ampere server) >>> using the latest mainline (edk2.git/master) code as of 8-28-2018. >>> >>> On a hunch, I reverted the following patchset which resulted in PXE >>> boot being successful - which suggests this patchset is likely the >>> culprit: >>> >>> https://lists.01.org/pipermail/edk2-devel/2017-September/014679.html >>> >>> The command used to initiate the VM PXE boot/install is: >>> >>> virt-install --name guest1 --vcpus 16 --ram 32000 --arch aarch64 >>> --disk guest1.img,size=100 --network >>> type=direct,source=enP4p1s0,model=virtio --pxe >>> >>> NOTE that I also tried simply removing the offending ASSERT() and PXE >>> boot still fails with multiple errors (exceptions of some sort that I >>> didn't log). >>> >>> >>> What I'd like to know: >>> >>> 1. Is this a known failure? >> >> Please read the comment in the context of the failed assertion. If this >> assertion is triggered, then that's an error in the code that consumes >> the SNP instance provided by VirtioNetDxe. The client code invokes >> behavior that is undefined by the UEFI spec. >> >> You can find more details at: >> >> >> http://mid.mail-archive.com/2adc266a-949d-c546-9bed-d78fea19d969@redhat.com >> >>> 2. Was aarch64 given much testing with this new patchset? >> >> Yes. I described the tests that I had done before pushing the patch set >> here: >> >> >> http://mid.mail-archive.com/581a568d-b0ce-34a1-8e52-95adf14c03e0@redhat.com >> >> >> http://mid.mail-archive.com/1e4267dc-0133-e833-f805-8381ef82c52e@redhat.com >> >> >> And most recently (independently of the patch set you refer to), I >> successfully tested VirtioNetDxe on aarch64 as part of PXEv4, PXEv6, >> HTTPv4, HTTPv6 net-boots, for commit ae08ea246fe9 >> ("ArmVirtPkg/ArmVirtQemu: enable the IPv6 stack", 2018-07-13): >> >> http://mid.mail-archive.com/20180712234112.3566-1-lersek@redhat.com >> >>> 3. Is reverting this patchset a safe workaround for now until a fix is >>> found? Or is there another simple workaround perhaps? >> >> Reverting the set should make the symptom go away, but that just papers >> over the undefined behavior in the protocol caller. >> >> The goal of the series was to enable VirtioNetDxe to function correctly >> on top of: >> - a virtio-1.0 PCI transport, provided by Virtio10Dxe, such that >> - explicit DMA mapping is performed using the EFI_PCI_IO_PROTOCOL >> underlying the virtio transport, such that >> - the platform provides an IOMMU protocol implementation for >> PciHostBridgeDxe, which underlies PciBusDxe. >> >> If this use case does not apply to your platform, then you can revert >> the patch set. >> >> In particular, in the ArmVirtQemu platform, we don't hook >> "OvmfPkg/Library/PlatformHasIoMmuLib/PlatformHasIoMmuLib.inf" into >> "PciHostBridgeDxe" via NULL class resolution, thus PciHostBridgeDxe >> there will not wait for another platform driver to decide dynamically >> whether the platform has an actual IOMMU protocol or not. And, >> accordingly, we never produce an IOMMU protocol instance in ArmVirtQemu. >> >> (Obviously the patch set was extensively regression-tested to confirm it >> was a no-op on platforms where the above use case wouldn't apply.) >> >>> 4. Any hunches what the issue could be? >> >> Code that uses VirtioNetDxe's SNP protocol interface invokes undefined >> behavior; please see my answer to your Q1. >> >>> Thanks for any info/help/suggestions, >> >> I'd like to give you a hint on debugging this further as well: >> >> (1) in the VirtioNetTransmit() function, please add DEBUG messages both >> before and after the call to VirtioNetMapTxBuf(). Before, log "Buffer" >> (with %p). After, log "DeviceAddress" (with "%Lx"). >> >> (2) In the VirtioNetGetStatus() function, please add DEBUG messages both >> before and after the call to VirtioNetUnmapTxBuf(). Before, log >> "DeviceAddress" (with "%Lx"). After, log "*TxBuf" (with "%p"). >> >> In each of (1) and (2), the values printed before/after should be >> identical (= identity-mapping for DMA). >> >> Please write a python script that parses the log generated by (1) and >> (2). The python script should maintain an associative array where the >> key type is "address", and the value type is "integer" (starting value 0). >> >> For each "Buffer" address printed by (1), the python script should >> increase the counter for that address. >> >> For each "*TxBuf" address printed by (2), the python script should >> decrease the counter for that address. >> >> If any counter goes above 1, then the client code invokes undefined >> behavior. If any counter goes under 0, then we messed up something. If >> all counters only take values 0 and 1, but you still hit the assert, >> then again we messed up something. >> >> The idea is to check whether: >> >> - the SNP consumer code tries to Transmit() the same buffer for a second >> time *before* it gets it back from GetStatus() -- this is the error that >> I'm currently suspecting, >> >> - or we messed up something (return a buffer that hasn't been queued for >> transmission, or else we mistake a not-yet-queued buffer for an already >> queued buffer when it is Transmit()ted for the first time). >> >> >> ... BTW, if you build ArmVirtQemu for the RELEASE target, the ASSERT() >> will disappear from VirtioNetMapTxBuf(), and Transmit() will simply >> return EFI_DEVICE_ERROR. This is consistent with your last paragraph -- >> i.e., "boot still fails with multiple errors" -- and the reason for >> returning this error is that the SNP consumer really tries to invoke >> undefined behavior. It's not always possible to detect UB, but whenever >> we can, without a huge performance hit, then in DEBUG builds it's >> appropriate to trip an assert, and in RELEASE builds, to return an error. >> >> Thanks >> Laszlo >> >>> -Aaron Young >>> >> Thanks >> Laszlo > >