From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received-SPF: Pass (sender SPF authorized) identity=mailfrom; client-ip=156.151.31.85; helo=userp2120.oracle.com; envelope-from=aaron.young@oracle.com; receiver=edk2-devel@lists.01.org Received: from userp2120.oracle.com (userp2120.oracle.com [156.151.31.85]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id B648020337348 for ; Wed, 3 Oct 2018 09:56:43 -0700 (PDT) Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w93GsAaQ005230; Wed, 3 Oct 2018 16:56:39 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=subject : to : cc : references : from : message-id : date : mime-version : in-reply-to : content-type : content-transfer-encoding; s=corp-2018-07-02; bh=q2Kck52m/0CO11rSrki/JPNojGK3+TQiFT1m7gJ40OY=; b=bS4wKQSoiIUpArREoBaJLnKFQbQejgYD2APha6lhDW3K2NZefWdn3qtYgjvj1fM4YI17 GUZ5q6Am6Okn+kb5xtQman75BGWO4Y6YftvcGVpb76pAVrem3ItnVHzZfr3oQFP7GVbm 5PsGsju/gYzPCvuTZZuwsRJ+R7KBsWT/1ppTmbiZpjQCiCGsy8pzaP/cwd1gCb6YwJjH VsAuowMVzST8i9BzKMEZwZXwBuUWX9NaGuBQqEj9aJWIWpd6AJYyBBJyWDlr27mvtANc Yqf0yiRHzfBrN/d0tX7OSPvKWf+78E0MKWNnj0kFgKZ9RsLDvJmIH5ku2YK9s7aZ5S4L YA== Received: from userv0022.oracle.com (userv0022.oracle.com [156.151.31.74]) by userp2120.oracle.com with ESMTP id 2mt21r5arj-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 03 Oct 2018 16:56:39 +0000 Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by userv0022.oracle.com (8.14.4/8.14.4) with ESMTP id w93Gudl4010810 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 3 Oct 2018 16:56:39 GMT Received: from abhmp0019.oracle.com (abhmp0019.oracle.com [141.146.116.25]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id w93GubK3021246; Wed, 3 Oct 2018 16:56:37 GMT Received: from [10.159.144.187] (/10.159.144.187) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Wed, 03 Oct 2018 16:56:37 +0000 To: Laszlo Ersek , edk2-devel@lists.01.org Cc: brijesh.singh@amd.com, Ard Biesheuvel References: From: aaron.young@oracle.com Organization: Oracle Corporation Message-ID: <3b4a4980-42d0-51a0-eccd-ceba83b6c78f@oracle.com> Date: Wed, 3 Oct 2018 09:56:36 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.0 MIME-Version: 1.0 In-Reply-To: X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9035 signatures=668707 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1810030160 Subject: Re: Regression with PXE boot on OvmfPkg/VirtioNetDxe driver on aarch64 system X-BeenThere: edk2-devel@lists.01.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: EDK II Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 03 Oct 2018 16:56:44 -0000 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US Hello, I have some more information on this issue that I wanted to share. The SNP client appears to be GRUB - i.e. the ASSERT() is asserted just after PXE code has loaded GRUB, presented the GRUB menu and a boot option is selected. And, I have confirmed that it, as predicted, attempts to call SNP->Transmit() twice with the same Buffer - without successfully reaping the Buffer via SNP->GetStatus() in between calls - thus tripping the ASSERT() in SnpSharedHelpers.c:VirtioNetMapTxBuf(). This was quite easy to see since GRUB appears to use only a single Buffer for which to transmit packets a few bytes at a time. i.e. it just calls SNP->Transmit() and then calls SNP->GetStatus() over and over again with the same buffer to transmit its packets. From my debug, I can see it work successfully most of the time (i.e. SNP->Transmit() is called successfully with a corresponding subsequent call to SNP->GetStatus() which returns the Buffer). However, occasionally, SNP->GetStatus() will return TxBuf=NULL (indicating that the Buffer has not yet been transmitted). In this case, GRUB appears to attempt to call SNP->GetStatus() *twice* before giving up (possibly due to a timer which I can see in the GRUB code) and then attempts to transfer the Buffer again - resulting in the ASSERT(). So, this explains the regression since the old code would allow the same Buffer to to be re-transmitted. As a quick test, I coded up a quick workaround in VirtioNetMapTxBuf() to walk through Dev->TxBufCollection to see if the Buffer is present - i.e. has already been transmitted (and not yet reaped via SNP->GetStatus()) - and if so, just return EFI_SUCCESS from VirtioNetTransmit(). This allowed the PXE boot/installation to succeed. I'm not advocating this as a solution, but is an interesting data point. Any suggestion where we go from here? Is there any way to make the code more tolerant of these re-transmissions (like the code was before)? I'm not confident that declaring GRUB as not spec compliant is the answer. Thanks, -Aaron On 08/31/18 06:17, Laszlo Ersek wrote: > On 08/30/18 22:17, aaron.young@oracle.com wrote: >> We have encountered what we believe to be a regression in the >> OvmfPkg/VirtioNetDxe driver. >> >> This regression causes PXE boot to fail with the following ASSERT: >> >> ASSERT >> [VirtioNetDxe]/builddir/build/BUILD/ovmf-92d07e48907f/OvmfPkg/VirtioNetDxe/SnpSharedHelpers.c(184): >> ((BOOLEAN)(0==1)) >> >> The failure was encountered on a aarch64/Armv8 system (Ampere server) >> using the latest mainline (edk2.git/master) code as of 8-28-2018. >> >> On a hunch, I reverted the following patchset which resulted in PXE >> boot being successful - which suggests this patchset is likely the culprit: >> >> https://lists.01.org/pipermail/edk2-devel/2017-September/014679.html >> >> The command used to initiate the VM PXE boot/install is: >> >> virt-install --name guest1 --vcpus 16 --ram 32000 --arch aarch64 >> --disk guest1.img,size=100 --network >> type=direct,source=enP4p1s0,model=virtio --pxe >> >> NOTE that I also tried simply removing the offending ASSERT() and PXE >> boot still fails with multiple errors (exceptions of some sort that I >> didn't log). >> >> >> What I'd like to know: >> >> 1. Is this a known failure? > Please read the comment in the context of the failed assertion. If this > assertion is triggered, then that's an error in the code that consumes > the SNP instance provided by VirtioNetDxe. The client code invokes > behavior that is undefined by the UEFI spec. > > You can find more details at: > > http://mid.mail-archive.com/2adc266a-949d-c546-9bed-d78fea19d969@redhat.com > >> 2. Was aarch64 given much testing with this new patchset? > Yes. I described the tests that I had done before pushing the patch set > here: > > http://mid.mail-archive.com/581a568d-b0ce-34a1-8e52-95adf14c03e0@redhat.com > > http://mid.mail-archive.com/1e4267dc-0133-e833-f805-8381ef82c52e@redhat.com > > > And most recently (independently of the patch set you refer to), I > successfully tested VirtioNetDxe on aarch64 as part of PXEv4, PXEv6, > HTTPv4, HTTPv6 net-boots, for commit ae08ea246fe9 > ("ArmVirtPkg/ArmVirtQemu: enable the IPv6 stack", 2018-07-13): > > http://mid.mail-archive.com/20180712234112.3566-1-lersek@redhat.com > >> 3. Is reverting this patchset a safe workaround for now until a fix is >> found? Or is there another simple workaround perhaps? > Reverting the set should make the symptom go away, but that just papers > over the undefined behavior in the protocol caller. > > The goal of the series was to enable VirtioNetDxe to function correctly > on top of: > - a virtio-1.0 PCI transport, provided by Virtio10Dxe, such that > - explicit DMA mapping is performed using the EFI_PCI_IO_PROTOCOL > underlying the virtio transport, such that > - the platform provides an IOMMU protocol implementation for > PciHostBridgeDxe, which underlies PciBusDxe. > > If this use case does not apply to your platform, then you can revert > the patch set. > > In particular, in the ArmVirtQemu platform, we don't hook > "OvmfPkg/Library/PlatformHasIoMmuLib/PlatformHasIoMmuLib.inf" into > "PciHostBridgeDxe" via NULL class resolution, thus PciHostBridgeDxe > there will not wait for another platform driver to decide dynamically > whether the platform has an actual IOMMU protocol or not. And, > accordingly, we never produce an IOMMU protocol instance in ArmVirtQemu. > > (Obviously the patch set was extensively regression-tested to confirm it > was a no-op on platforms where the above use case wouldn't apply.) > >> 4. Any hunches what the issue could be? > Code that uses VirtioNetDxe's SNP protocol interface invokes undefined > behavior; please see my answer to your Q1. > >> Thanks for any info/help/suggestions, > I'd like to give you a hint on debugging this further as well: > > (1) in the VirtioNetTransmit() function, please add DEBUG messages both > before and after the call to VirtioNetMapTxBuf(). Before, log "Buffer" > (with %p). After, log "DeviceAddress" (with "%Lx"). > > (2) In the VirtioNetGetStatus() function, please add DEBUG messages both > before and after the call to VirtioNetUnmapTxBuf(). Before, log > "DeviceAddress" (with "%Lx"). After, log "*TxBuf" (with "%p"). > > In each of (1) and (2), the values printed before/after should be > identical (= identity-mapping for DMA). > > Please write a python script that parses the log generated by (1) and > (2). The python script should maintain an associative array where the > key type is "address", and the value type is "integer" (starting value 0). > > For each "Buffer" address printed by (1), the python script should > increase the counter for that address. > > For each "*TxBuf" address printed by (2), the python script should > decrease the counter for that address. > > If any counter goes above 1, then the client code invokes undefined > behavior. If any counter goes under 0, then we messed up something. If > all counters only take values 0 and 1, but you still hit the assert, > then again we messed up something. > > The idea is to check whether: > > - the SNP consumer code tries to Transmit() the same buffer for a second > time *before* it gets it back from GetStatus() -- this is the error that > I'm currently suspecting, > > - or we messed up something (return a buffer that hasn't been queued for > transmission, or else we mistake a not-yet-queued buffer for an already > queued buffer when it is Transmit()ted for the first time). > > > ... BTW, if you build ArmVirtQemu for the RELEASE target, the ASSERT() > will disappear from VirtioNetMapTxBuf(), and Transmit() will simply > return EFI_DEVICE_ERROR. This is consistent with your last paragraph -- > i.e., "boot still fails with multiple errors" -- and the reason for > returning this error is that the SNP consumer really tries to invoke > undefined behavior. It's not always possible to detect UB, but whenever > we can, without a huge performance hit, then in DEBUG builds it's > appropriate to trip an assert, and in RELEASE builds, to return an error. > > Thanks > Laszlo > >> -Aaron Young >> > Thanks > Laszlo