From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id F256681D6A for ; Wed, 23 Nov 2016 08:54:44 -0800 (PST) Received: from int-mx11.intmail.prod.int.phx2.redhat.com (int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.24]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 78CA18123E; Wed, 23 Nov 2016 16:54:44 +0000 (UTC) Received: from lacos-laptop-7.usersys.redhat.com (ovpn-116-97.phx2.redhat.com [10.3.116.97]) by int-mx11.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id uANGsgH7013465; Wed, 23 Nov 2016 11:54:43 -0500 To: Evgeny Yakovlev References: <2340021c-4bcb-2622-07a8-6e6173f94d81@redhat.com> <9fcf577d-cf9e-db6f-c0f8-6842baf8bb83@redhat.com> Cc: edk2-devel@ml01.01.org, eyakovlev@virtuozzo.com, den@virtuozzo.com, Jeff Fan From: Laszlo Ersek Message-ID: Date: Wed, 23 Nov 2016 17:54:41 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.5.0 MIME-Version: 1.0 In-Reply-To: X-Scanned-By: MIMEDefang 2.68 on 10.5.11.24 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.25]); Wed, 23 Nov 2016 16:54:44 +0000 (UTC) Subject: Re: OvmfPkg: VM crashed trying to write to RO memory from CommonInterruptEntry X-BeenThere: edk2-devel@lists.01.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: EDK II Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Nov 2016 16:54:45 -0000 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit On 11/23/16 09:37, Evgeny Yakovlev wrote: > You are right of course about the old tree, no objections here. I will > try to advocate for an update however i am pretty sure we're stuck with > our version for some time at least. > > Still, my original question was about is it normal for OVMF Sec/Pei > stage to have its stack so close to 0x100000 The SEC phase, and the part of the PEI phase that runs before installing and migrating to the permanent PEI RAM, use the range 8256KB..8288KB as heap and stack (half/half, 16 KB and 16 KB). This range is above 8MB. The address you specify is 1MB. I don't think we ever set up such a stack intentionally. It is possible that the failure occurs in one of the AP startup routines (the APs start in real mode), and the exception handler is executed by the AP. To know more, the serial port and debug port outputs would be interesting; but, I should note, based on past experience, when the APs run into such issues in their startup routines, the final symptoms are usually garbage / chaotic. These issues are usually fixed by preventing the APs from wandering off into the woods in the first place (for example, nailing down race conditions between BSP and APs), and edk2 has seen a lot of improvements in that area, with the arrival of MpInitLib. > and/or why interrupt > handler in UefiCpuPkg/Library/CpuExceptionHandlerLib/X64 does not switch > to a separate stack. I hope Jeff can provide some insight here. Thanks Laszlo > Code in UefiCpuPkg/Library/CpuExceptionHandlerLib/X64 hasn't been > touched for 2 years so our version is still relevant. > > 2016-11-22 19:58 GMT+03:00 Laszlo Ersek >: > > On 11/22/16 14:58, Evgeny Yakovlev wrote: > > Wow, that is more than i expected :) > > > >> I wonder if you started to see this issue very recently. > > Very recently, however we use a pretty old OVMF build, circa 2015 > > Ugh. Please update OVMF first... A whole lot of things has changed in > edk2 in this year. > > > > >> OVMF debug log > > Sorry, we hadn't had it enabled when VM crashed and these crashes are very > > rare. We will try to capture it when it happens again > > > >> - your host CPU model, > > cpu family : 6 > > model : 42 > > model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz > > stepping : 7 > > > >> - the host kernel (KVM) version, > > Our kernel is roughly based on RHEL7.2 (kernel version 3.10.0-327.36.1). We > > also have some upstream KVM patches backported. > > > >> - the guest CPU model, > > -cpu > > SandyBridge,+vme,+ds,+acpi,+ss,+ht,+tm,+pbe,+dtes64,+monitor,+ds_cpl,+smx,+est,+tm2,+xtpr,+pdcm,+pcid,+osxsave,-arat,-xsaveopt,-xgetbv1,-vmx,-xsavec,hv_time,hv_relaxed,hv_vapic,hv_spinlocks=0x1fff,hv_vpindex,hv_runtime,hv_synic,hv_stimer,hv_reset,hv_crash > > > >> - the guest CPU topology. > > 8 sockets, 1 core per socket, 1 thread per core > > > > Hope that helps! > > The fact that you are using 8 VCPUs is definitely relevant. However, I > don't think it would make sense to try to analyze any errors with an > OVMF / edk2 tree this old. Please try to reproduce the issue with a > fresh build from master. > > Thanks! > Laszlo > > > 2016-11-22 16:41 GMT+03:00 Laszlo Ersek >: > > > >> Hello Evgeny, > >> > >> On 11/22/16 13:57, Evgeny Yakovlev wrote: > >>> We are running windows UEFI-based VMs on QEMU/KVM with OvmfPkg. > >>> > >>> Very rarely we are experiencing a crash when VM tries to write to RO > >> memory > >>> very early during UEFI boot process. > >>> > >>> Crash happens when VM tries to execute this code in interrupt > handler: > >>> > https://github.com/tianocore/edk2/blob/master/UefiCpuPkg/Library/ > > >> CpuExceptionHandlerLib/X64/ExceptionHandlerAsm.asm#L244-L246 > >>> > >>> > >>> fxsave [rdi], where RDI = 0xffe60 > >>> > >>> Which is bad - it points to ISA BIOS F-segment area. > >>> > >>> This memory was mapped by qemu for read only access, which is > reflected > >> in > >>> KVM EPT: > >>> 00000000000e0000-00000000000fffff (prio 1, R-): isa-bios > >>> > >>> This is a very early IRQ0 interrupt, presumably during early > >> initialization > >>> phase (Sec or Pei). > >>> > >>> Looks like CommonInterruptHandler does not switch to a separate > stack and > >>> works on interrupted context's stack, which was fairly close to 1MB > >>> boundary when IRQ0 fired (RSP around 1002c0). When > CommonInterruptEntry > >>> reached highlighted code it subtracted 512 bytes from current > RSP which > >>> dropped to 0xffe60, below 1MB and into QEMU RO region. > >>> > >>> We were figuring out how to best fix this. Possible solutions are to > >> switch > >>> to a separate stack in CommonInterruptEntry, relocate early > OvmfPkg stack > >>> to somewhere farther away from 1MB, to run with interrupts > disabled until > >>> we reach a later phase or maybe something else. > >>> > >>> Any comments would be very appreciated! > >> > >> I wonder if you started to see this issue very recently. > >> > >> I suspect (hope!) that the symptoms you are experiencing are a > >> consequence of a bug in UefiCpuPkg that I've debugged and fixed just > >> today. (I hope to post the patches today.) > >> > >> While testing those patches on your end will of course tell us if > your > >> issue has the same root cause, you could gather a few more > symptoms even > >> before I get around posting the patches. The bug that I'm working > on has > >> extremely varied crash symptoms (basically the APs wander off > into the > >> weeds), and some of those symptoms have involved > CpuExceptionHandlerLib. > >> The point is, by the time we get into CpuExceptionHandlerLib, all is > >> lost -- it is executing on an AP whose state is corrupt anyway. The > >> fxsave symptom is a red herring, most likely. > >> > >> CpuExceptionHandlerLib works fine otherwise, especially when invoked > >> from the BSP -- we've used the output dumped by > CpuExceptionHandlerLib > >> to the serial port several times to track down issues. > >> > >> So, my request is that you please capture the OVMF debug log > (please see > >> the "OvmfPkg/README" file for how). I'm curious if it crashes > where and > >> how I suspect it crashes. > >> > >> Also, it would help if you provided > >> - your host CPU model, > >> - the host kernel (KVM) version, > >> - the guest CPU model, > >> - the guest CPU topology. > >> > >> Thanks! > >> Laszlo > >> > > _______________________________________________ > > edk2-devel mailing list > > edk2-devel@lists.01.org > > https://lists.01.org/mailman/listinfo/edk2-devel > > > > >