From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received-SPF: Pass (sender SPF authorized) identity=mailfrom; client-ip=66.187.233.73; helo=mx1.redhat.com; envelope-from=lersek@redhat.com; receiver=edk2-devel@lists.01.org Received: from mx1.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 9E5902110BD45 for ; Fri, 31 Aug 2018 06:17:57 -0700 (PDT) Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com [10.11.54.5]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 9E1597DAC2; Fri, 31 Aug 2018 13:17:56 +0000 (UTC) Received: from lacos-laptop-7.usersys.redhat.com (ovpn-120-58.rdu2.redhat.com [10.10.120.58]) by smtp.corp.redhat.com (Postfix) with ESMTP id 9817F63A73; Fri, 31 Aug 2018 13:17:55 +0000 (UTC) To: aaron.young@oracle.com, edk2-devel@lists.01.org Cc: brijesh.singh@amd.com, Ard Biesheuvel References: From: Laszlo Ersek Message-ID: Date: Fri, 31 Aug 2018 15:17:54 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: X-Scanned-By: MIMEDefang 2.79 on 10.11.54.5 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.2]); Fri, 31 Aug 2018 13:17:56 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.2]); Fri, 31 Aug 2018 13:17:56 +0000 (UTC) for IP:'10.11.54.5' DOMAIN:'int-mx05.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'lersek@redhat.com' RCPT:'' Subject: Re: Regression with PXE boot on OvmfPkg/VirtioNetDxe driver on aarch64 system X-BeenThere: edk2-devel@lists.01.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: EDK II Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 31 Aug 2018 13:17:58 -0000 Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit On 08/30/18 22:17, aaron.young@oracle.com wrote: >   We have encountered what we believe to be a regression in the > OvmfPkg/VirtioNetDxe driver. > >   This regression causes PXE boot to fail with the following ASSERT: > > ASSERT > [VirtioNetDxe]/builddir/build/BUILD/ovmf-92d07e48907f/OvmfPkg/VirtioNetDxe/SnpSharedHelpers.c(184): > ((BOOLEAN)(0==1)) > >   The failure was encountered on a aarch64/Armv8 system (Ampere server) > using the latest mainline (edk2.git/master) code as of 8-28-2018. > >   On a hunch, I reverted the following patchset which resulted in PXE > boot being successful - which suggests this patchset is likely the culprit: > > https://lists.01.org/pipermail/edk2-devel/2017-September/014679.html > >   The command used to initiate the VM PXE boot/install is: > >   virt-install --name guest1  --vcpus 16 --ram 32000 --arch aarch64 > --disk guest1.img,size=100 --network > type=direct,source=enP4p1s0,model=virtio --pxe > >   NOTE that I also tried simply removing the offending ASSERT() and PXE > boot still fails with multiple errors (exceptions of some sort that I > didn't log). > > >   What I'd like to know: > >   1. Is this a known failure? Please read the comment in the context of the failed assertion. If this assertion is triggered, then that's an error in the code that consumes the SNP instance provided by VirtioNetDxe. The client code invokes behavior that is undefined by the UEFI spec. You can find more details at: http://mid.mail-archive.com/2adc266a-949d-c546-9bed-d78fea19d969@redhat.com >   2. Was aarch64 given much testing with this new patchset? Yes. I described the tests that I had done before pushing the patch set here: http://mid.mail-archive.com/581a568d-b0ce-34a1-8e52-95adf14c03e0@redhat.com http://mid.mail-archive.com/1e4267dc-0133-e833-f805-8381ef82c52e@redhat.com And most recently (independently of the patch set you refer to), I successfully tested VirtioNetDxe on aarch64 as part of PXEv4, PXEv6, HTTPv4, HTTPv6 net-boots, for commit ae08ea246fe9 ("ArmVirtPkg/ArmVirtQemu: enable the IPv6 stack", 2018-07-13): http://mid.mail-archive.com/20180712234112.3566-1-lersek@redhat.com >   3. Is reverting this patchset a safe workaround for now until a fix is > found? Or is there another simple workaround perhaps? Reverting the set should make the symptom go away, but that just papers over the undefined behavior in the protocol caller. The goal of the series was to enable VirtioNetDxe to function correctly on top of: - a virtio-1.0 PCI transport, provided by Virtio10Dxe, such that - explicit DMA mapping is performed using the EFI_PCI_IO_PROTOCOL underlying the virtio transport, such that - the platform provides an IOMMU protocol implementation for PciHostBridgeDxe, which underlies PciBusDxe. If this use case does not apply to your platform, then you can revert the patch set. In particular, in the ArmVirtQemu platform, we don't hook "OvmfPkg/Library/PlatformHasIoMmuLib/PlatformHasIoMmuLib.inf" into "PciHostBridgeDxe" via NULL class resolution, thus PciHostBridgeDxe there will not wait for another platform driver to decide dynamically whether the platform has an actual IOMMU protocol or not. And, accordingly, we never produce an IOMMU protocol instance in ArmVirtQemu. (Obviously the patch set was extensively regression-tested to confirm it was a no-op on platforms where the above use case wouldn't apply.) >   4. Any hunches what the issue could be? Code that uses VirtioNetDxe's SNP protocol interface invokes undefined behavior; please see my answer to your Q1. >   Thanks for any info/help/suggestions, I'd like to give you a hint on debugging this further as well: (1) in the VirtioNetTransmit() function, please add DEBUG messages both before and after the call to VirtioNetMapTxBuf(). Before, log "Buffer" (with %p). After, log "DeviceAddress" (with "%Lx"). (2) In the VirtioNetGetStatus() function, please add DEBUG messages both before and after the call to VirtioNetUnmapTxBuf(). Before, log "DeviceAddress" (with "%Lx"). After, log "*TxBuf" (with "%p"). In each of (1) and (2), the values printed before/after should be identical (= identity-mapping for DMA). Please write a python script that parses the log generated by (1) and (2). The python script should maintain an associative array where the key type is "address", and the value type is "integer" (starting value 0). For each "Buffer" address printed by (1), the python script should increase the counter for that address. For each "*TxBuf" address printed by (2), the python script should decrease the counter for that address. If any counter goes above 1, then the client code invokes undefined behavior. If any counter goes under 0, then we messed up something. If all counters only take values 0 and 1, but you still hit the assert, then again we messed up something. The idea is to check whether: - the SNP consumer code tries to Transmit() the same buffer for a second time *before* it gets it back from GetStatus() -- this is the error that I'm currently suspecting, - or we messed up something (return a buffer that hasn't been queued for transmission, or else we mistake a not-yet-queued buffer for an already queued buffer when it is Transmit()ted for the first time). ... BTW, if you build ArmVirtQemu for the RELEASE target, the ASSERT() will disappear from VirtioNetMapTxBuf(), and Transmit() will simply return EFI_DEVICE_ERROR. This is consistent with your last paragraph -- i.e., "boot still fails with multiple errors" -- and the reason for returning this error is that the SNP consumer really tries to invoke undefined behavior. It's not always possible to detect UB, but whenever we can, without a huge performance hit, then in DEBUG builds it's appropriate to trip an assert, and in RELEASE builds, to return an error. Thanks Laszlo >   -Aaron Young > Thanks Laszlo