From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received-SPF: Pass (sender SPF authorized) identity=mailfrom; client-ip=141.146.126.78; helo=aserp2120.oracle.com; envelope-from=aaron.young@oracle.com; receiver=edk2-devel@lists.01.org Received: from aserp2120.oracle.com (aserp2120.oracle.com [141.146.126.78]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 29109211616A5 for ; Wed, 3 Oct 2018 17:00:08 -0700 (PDT) Received: from pps.filterd (aserp2120.oracle.com [127.0.0.1]) by aserp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w93Nxrff030554; Thu, 4 Oct 2018 00:00:02 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=subject : to : cc : references : from : message-id : date : mime-version : in-reply-to : content-type : content-transfer-encoding; s=corp-2018-07-02; bh=jv3JC3Cs1o1dT5j4Tjwhdukw4wvJHKi+SCMDtkO2/ng=; b=VftFf5+Zb6UWlPyWT8a3aP6wbLmL/iqvjp4aDFFNIOhpaLsRAtiLSVtdHdYXOm+lY2OX zKePzil+PAEkcYjdwWC9rWxW+7yeJFtaYCj90uPxxIKzYb4iGw0UNQC7d/wZ9g1AfOA3 Tbr9QurrBW7C9UdbIabCjlsoozGAB46SaH1I1+ZDHokGSpDpr5LcXlfk1Y161i2MfoMg hX3tUF/uzdM8/eumcAdHJp2YdJwNtosMizexy0WDQJoWsEpCDEarK2BETNMrnDmy+n0u E4EazXaKnf/6PSy81j7Dp4YFTrjMnuzHkiVxQMebSvuwreAbp0pN2fhBK/JcQV2jf08t rw== Received: from userv0022.oracle.com (userv0022.oracle.com [156.151.31.74]) by aserp2120.oracle.com with ESMTP id 2mt1bq7c15-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 04 Oct 2018 00:00:02 +0000 Received: from userv0122.oracle.com (userv0122.oracle.com [156.151.31.75]) by userv0022.oracle.com (8.14.4/8.14.4) with ESMTP id w93NxuWF025387 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 3 Oct 2018 23:59:56 GMT Received: from abhmp0016.oracle.com (abhmp0016.oracle.com [141.146.116.22]) by userv0122.oracle.com (8.14.4/8.14.4) with ESMTP id w93NxtsQ009612; Wed, 3 Oct 2018 23:59:56 GMT Received: from [10.159.144.187] (/10.159.144.187) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Wed, 03 Oct 2018 23:59:55 +0000 To: Ard Biesheuvel , Leif Lindholm Cc: Laszlo Ersek , "edk2-devel@lists.01.org" , Brijesh Singh References: <3b4a4980-42d0-51a0-eccd-ceba83b6c78f@oracle.com> From: aaron.young@oracle.com Organization: Oracle Corporation Message-ID: Date: Wed, 3 Oct 2018 16:59:54 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.0 MIME-Version: 1.0 In-Reply-To: X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9035 signatures=668707 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1810030219 Subject: Re: Regression with PXE boot on OvmfPkg/VirtioNetDxe driver on aarch64 system X-BeenThere: edk2-devel@lists.01.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: EDK II Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 04 Oct 2018 00:00:09 -0000 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US I am suspecting that this patch to GRUB is the cause of a Buffer being re-transmitted before reaping the Buffer via SNP->GetStatus(): https://git.centos.org/blob/rpms!grub2.git/1065bd29e776aef83f927747882140dcb6fd5cde/SOURCES!0183-efinet-retransmit-if-our-device-is-busy.patch So, to reproduce the issue, the GRUB used via PXE boot needs to include this patch. Thanks, -Aaron On 10/03/18 10:35, Ard Biesheuvel wrote: > (+ Leif) > > On 3 October 2018 at 18:56, wrote: >> Hello, >> >> I have some more information on this issue that I wanted to share. >> >> The SNP client appears to be GRUB - i.e. the ASSERT() is asserted just >> after PXE code has loaded GRUB, presented the GRUB menu and a boot option is >> selected. >> >> And, I have confirmed that it, as predicted, attempts to call >> SNP->Transmit() twice with the same Buffer - without successfully reaping >> the Buffer via SNP->GetStatus() in between calls - thus tripping the >> ASSERT() in SnpSharedHelpers.c:VirtioNetMapTxBuf(). >> >> This was quite easy to see since GRUB appears to use only a single Buffer >> for which to transmit packets a few bytes at a time. i.e. it just calls >> SNP->Transmit() and then calls SNP->GetStatus() over and over again with the >> same buffer to transmit its packets. >> >> From my debug, I can see it work successfully most of the time (i.e. >> SNP->Transmit() is called successfully with a corresponding subsequent call >> to SNP->GetStatus() which returns the Buffer). However, occasionally, >> SNP->GetStatus() will return TxBuf=NULL (indicating that the Buffer has not >> yet been transmitted). In this case, GRUB appears to attempt to call >> SNP->GetStatus() *twice* before giving up (possibly due to a timer which I >> can see in the GRUB code) and then attempts to transfer the Buffer again - >> resulting in the ASSERT(). >> >> So, this explains the regression since the old code would allow the same >> Buffer to to be re-transmitted. >> >> As a quick test, I coded up a quick workaround in VirtioNetMapTxBuf() to >> walk through Dev->TxBufCollection to see if the Buffer is present - i.e. has >> already been transmitted (and not yet reaped via SNP->GetStatus()) - and if >> so, just return EFI_SUCCESS from VirtioNetTransmit(). This allowed the PXE >> boot/installation to succeed. I'm not advocating this as a solution, but is >> an interesting data point. >> >> Any suggestion where we go from here? >> >> Is there any way to make the code more tolerant of these re-transmissions >> (like the code was before)? >> >> I'm not confident that declaring GRUB as not spec compliant is the answer. >> >> Thanks, >> >> -Aaron >> >> >> >> On 08/31/18 06:17, Laszlo Ersek wrote: >>> On 08/30/18 22:17, aaron.young@oracle.com wrote: >>>> We have encountered what we believe to be a regression in the >>>> OvmfPkg/VirtioNetDxe driver. >>>> >>>> This regression causes PXE boot to fail with the following ASSERT: >>>> >>>> ASSERT >>>> >>>> [VirtioNetDxe]/builddir/build/BUILD/ovmf-92d07e48907f/OvmfPkg/VirtioNetDxe/SnpSharedHelpers.c(184): >>>> ((BOOLEAN)(0==1)) >>>> >>>> The failure was encountered on a aarch64/Armv8 system (Ampere server) >>>> using the latest mainline (edk2.git/master) code as of 8-28-2018. >>>> >>>> On a hunch, I reverted the following patchset which resulted in PXE >>>> boot being successful - which suggests this patchset is likely the >>>> culprit: >>>> >>>> https://lists.01.org/pipermail/edk2-devel/2017-September/014679.html >>>> >>>> The command used to initiate the VM PXE boot/install is: >>>> >>>> virt-install --name guest1 --vcpus 16 --ram 32000 --arch aarch64 >>>> --disk guest1.img,size=100 --network >>>> type=direct,source=enP4p1s0,model=virtio --pxe >>>> >>>> NOTE that I also tried simply removing the offending ASSERT() and PXE >>>> boot still fails with multiple errors (exceptions of some sort that I >>>> didn't log). >>>> >>>> >>>> What I'd like to know: >>>> >>>> 1. Is this a known failure? >>> Please read the comment in the context of the failed assertion. If this >>> assertion is triggered, then that's an error in the code that consumes >>> the SNP instance provided by VirtioNetDxe. The client code invokes >>> behavior that is undefined by the UEFI spec. >>> >>> You can find more details at: >>> >>> >>> http://mid.mail-archive.com/2adc266a-949d-c546-9bed-d78fea19d969@redhat.com >>> >>>> 2. Was aarch64 given much testing with this new patchset? >>> Yes. I described the tests that I had done before pushing the patch set >>> here: >>> >>> >>> http://mid.mail-archive.com/581a568d-b0ce-34a1-8e52-95adf14c03e0@redhat.com >>> >>> >>> http://mid.mail-archive.com/1e4267dc-0133-e833-f805-8381ef82c52e@redhat.com >>> >>> >>> And most recently (independently of the patch set you refer to), I >>> successfully tested VirtioNetDxe on aarch64 as part of PXEv4, PXEv6, >>> HTTPv4, HTTPv6 net-boots, for commit ae08ea246fe9 >>> ("ArmVirtPkg/ArmVirtQemu: enable the IPv6 stack", 2018-07-13): >>> >>> http://mid.mail-archive.com/20180712234112.3566-1-lersek@redhat.com >>> >>>> 3. Is reverting this patchset a safe workaround for now until a fix is >>>> found? Or is there another simple workaround perhaps? >>> Reverting the set should make the symptom go away, but that just papers >>> over the undefined behavior in the protocol caller. >>> >>> The goal of the series was to enable VirtioNetDxe to function correctly >>> on top of: >>> - a virtio-1.0 PCI transport, provided by Virtio10Dxe, such that >>> - explicit DMA mapping is performed using the EFI_PCI_IO_PROTOCOL >>> underlying the virtio transport, such that >>> - the platform provides an IOMMU protocol implementation for >>> PciHostBridgeDxe, which underlies PciBusDxe. >>> >>> If this use case does not apply to your platform, then you can revert >>> the patch set. >>> >>> In particular, in the ArmVirtQemu platform, we don't hook >>> "OvmfPkg/Library/PlatformHasIoMmuLib/PlatformHasIoMmuLib.inf" into >>> "PciHostBridgeDxe" via NULL class resolution, thus PciHostBridgeDxe >>> there will not wait for another platform driver to decide dynamically >>> whether the platform has an actual IOMMU protocol or not. And, >>> accordingly, we never produce an IOMMU protocol instance in ArmVirtQemu. >>> >>> (Obviously the patch set was extensively regression-tested to confirm it >>> was a no-op on platforms where the above use case wouldn't apply.) >>> >>>> 4. Any hunches what the issue could be? >>> Code that uses VirtioNetDxe's SNP protocol interface invokes undefined >>> behavior; please see my answer to your Q1. >>> >>>> Thanks for any info/help/suggestions, >>> I'd like to give you a hint on debugging this further as well: >>> >>> (1) in the VirtioNetTransmit() function, please add DEBUG messages both >>> before and after the call to VirtioNetMapTxBuf(). Before, log "Buffer" >>> (with %p). After, log "DeviceAddress" (with "%Lx"). >>> >>> (2) In the VirtioNetGetStatus() function, please add DEBUG messages both >>> before and after the call to VirtioNetUnmapTxBuf(). Before, log >>> "DeviceAddress" (with "%Lx"). After, log "*TxBuf" (with "%p"). >>> >>> In each of (1) and (2), the values printed before/after should be >>> identical (= identity-mapping for DMA). >>> >>> Please write a python script that parses the log generated by (1) and >>> (2). The python script should maintain an associative array where the >>> key type is "address", and the value type is "integer" (starting value 0). >>> >>> For each "Buffer" address printed by (1), the python script should >>> increase the counter for that address. >>> >>> For each "*TxBuf" address printed by (2), the python script should >>> decrease the counter for that address. >>> >>> If any counter goes above 1, then the client code invokes undefined >>> behavior. If any counter goes under 0, then we messed up something. If >>> all counters only take values 0 and 1, but you still hit the assert, >>> then again we messed up something. >>> >>> The idea is to check whether: >>> >>> - the SNP consumer code tries to Transmit() the same buffer for a second >>> time *before* it gets it back from GetStatus() -- this is the error that >>> I'm currently suspecting, >>> >>> - or we messed up something (return a buffer that hasn't been queued for >>> transmission, or else we mistake a not-yet-queued buffer for an already >>> queued buffer when it is Transmit()ted for the first time). >>> >>> >>> ... BTW, if you build ArmVirtQemu for the RELEASE target, the ASSERT() >>> will disappear from VirtioNetMapTxBuf(), and Transmit() will simply >>> return EFI_DEVICE_ERROR. This is consistent with your last paragraph -- >>> i.e., "boot still fails with multiple errors" -- and the reason for >>> returning this error is that the SNP consumer really tries to invoke >>> undefined behavior. It's not always possible to detect UB, but whenever >>> we can, without a huge performance hit, then in DEBUG builds it's >>> appropriate to trip an assert, and in RELEASE builds, to return an error. >>> >>> Thanks >>> Laszlo >>> >>>> -Aaron Young >>>> >>> Thanks >>> Laszlo >>