From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pbonzini@redhat.com>
Received: from mx5-phx2.redhat.com (mx5-phx2.redhat.com [209.132.183.37])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by ml01.01.org (Postfix) with ESMTPS id 6C3D681C8A
 for <edk2-devel@ml01.01.org>; Wed,  9 Nov 2016 03:23:32 -0800 (PST)
Received: from zmail13.collab.prod.int.phx2.redhat.com
 (zmail13.collab.prod.int.phx2.redhat.com [10.5.83.15])
 by mx5-phx2.redhat.com (8.14.4/8.14.4) with ESMTP id uA9BNXFv048390;
 Wed, 9 Nov 2016 06:23:33 -0500
Date: Wed, 9 Nov 2016 06:23:31 -0500 (EST)
From: Paolo Bonzini <pbonzini@redhat.com>
To: Laszlo Ersek <lersek@redhat.com>
Cc: Jiewen Yao <jiewen.yao@intel.com>, edk2-devel@ml01.01.org,
 Michael D Kinney <michael.d.kinney@intel.com>,
 Feng Tian <feng.tian@intel.com>, Jeff Fan <jeff.fan@intel.com>,
 Star Zeng <star.zeng@intel.com>
Message-ID: <867102584.11653909.1478690611541.JavaMail.zimbra@redhat.com>
In-Reply-To: <08406bf5-4377-63a1-8dd9-34479c015d4b@redhat.com>
References: <1478251854-14660-1-git-send-email-jiewen.yao@intel.com>
 <08406bf5-4377-63a1-8dd9-34479c015d4b@redhat.com>
MIME-Version: 1.0
X-Originating-IP: [10.4.164.1, 10.5.101.130]
X-Mailer: Zimbra 8.0.6_GA_5922 (ZimbraWebClient - FF49 (Linux)/8.0.6_GA_5922)
Thread-Topic: Enable SMM page level protection.
Thread-Index: KQWMDEodUYoSJIG8Qu8AtnU/w9DBUQ==
Subject: Re: [PATCH V2 0/6] Enable SMM page level protection.
X-BeenThere: edk2-devel@lists.01.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: EDK II Development  <edk2-devel.lists.01.org>
List-Unsubscribe: <https://lists.01.org/mailman/options/edk2-devel>,
 <mailto:edk2-devel-request@lists.01.org?subject=unsubscribe>
List-Archive: <http://lists.01.org/pipermail/edk2-devel/>
List-Post: <mailto:edk2-devel@lists.01.org>
List-Help: <mailto:edk2-devel-request@lists.01.org?subject=help>
List-Subscribe: <https://lists.01.org/mailman/listinfo/edk2-devel>,
 <mailto:edk2-devel-request@lists.01.org?subject=subscribe>
X-List-Received-Date: Wed, 09 Nov 2016 11:23:32 -0000
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


>   * Second, the instruction that causes things to blow up is <0f aa>,
>     i.e., RSM. I have absolutely no clue why RSM is executed:

It's probably not RSM.  RSM is probably the last instruction executed
before, and it's still in the buffer because, as you said, there's no
way that you can fetch an instruction while CR3 points into SMM.

My first thought was that the MMU is for some reason out of contact
with reality, but actually the CR3 write is correct:

             CPU-24446 [002] 39841.871040: kvm_exit:             reason CR_ACCESS rip 0x9f05e info 103 0
             CPU-24446 [002] 39841.871040: kvm_cr:               cr_write 3 = 0x7ff7f000

and it's coming from the stub as well.  So the second thought was that
the wakeup buffer has the wrong CR3 put into the wakeup buffer's Cr3 location.
I'm glad I kept looking because it was much more entertaining.  Especially
knowing that I (probably) will not have to fix it. :)

The basic idea for debugging was to look for interesting events and
use 0x402 writes to correlate them to the debug log.  For example, most
accesses to 0x9f??? are obviously not traced by KVM, but the first ones
are:

31519-              CPU-24444 [006] 39841.783344: kvm_exit:             reason EPT_VIOLATION rip 0x855d82 info 181 0
31520:              CPU-24444 [006] 39841.783344: kvm_page_fault:       address 9f000 error_code 181
280224-             CPU-24444 [006] 39841.860940: kvm_exit:             reason EPT_VIOLATION rip 0x7ffd0d15 info 182 0
280225:             CPU-24444 [006] 39841.860940: kvm_page_fault:       address 9f000 error_code 182

(The number is just the line number in the trace).  Luckily your machine
didn't have EPT accessed/dirty bits, so KVM trapped both the first read
and the first write.

The read is at

WakeupBufferStart = 9F000, WakeupBufferSize = 1000

but it's not too interesting.  The second is a good one to start debugging
because it's from SMRAM (though not from SMM, since the first kvm_enter_smm
happens later at 305930).  So it makes sense that it writes an SMRAM CR3.
There is a write to the debug log just before, at 279993, and it writes
"SmmRestoreCpu()".  As expected, the write is followed by a flurry of MSR
writes, the APIC programming at 280131, so I am pretty sure that the write to
mExchangeInfo->Cr3 comes from PrepareApStartupVector.

FWIW, I first looked at the call chain up from BackupAndPrepareWakeupBuffer,
but that led me nowhere for an hour.  So I was a bit lucky indeed. :)

Anyhow, SmmRestoreCpu is the SmmS3ResumeEntryPoint for S3Resume2Pei, and
indeed, earlier in the log you have this debugging output from S3Resume2Pei:

SMM S3 CR3                      = 7FF7F000

Doh, maybe I should have looked at the log before the trace.  Who knows.
Anyway, the SMM_S3_RESUME_STATE is initialized by InitSmmS3ResumeState,
so the CR3 is the one that is initialized by InitSmmS3Cr3 in
UefiCpuPkg/PiSmmCpuDxeSmm/X64/SmmProfileArch.c.  At this point I
was still thinking that this CR3 was wrong, but by looking at the
places where SMM is entered, and correlating that with debug log writes,
the puzzle was relatively easy to solve:

1) SMBASE relocation, done by SmmRestoreCpu:

305930:             CPU-24445 [005] 39841.871264: kvm_enter_smm:        vcpu 1: entering SMM, smbase 0x30000
306000:             CPU-24445 [005] 39841.871318: kvm_enter_smm:        vcpu 1: leaving SMM, smbase 0x7ffb3000
306051:             CPU-24446 [002] 39841.871349: kvm_enter_smm:        vcpu 2: entering SMM, smbase 0x30000
306108:             CPU-24446 [002] 39841.871390: kvm_enter_smm:        vcpu 2: leaving SMM, smbase 0x7ffb5000
306161:             CPU-24447 [004] 39841.871421: kvm_enter_smm:        vcpu 3: entering SMM, smbase 0x30000
306218:             CPU-24447 [004] 39841.871463: kvm_enter_smm:        vcpu 3: leaving SMM, smbase 0x7ffb7000
306254:             CPU-24444 [006] 39841.871473: kvm_enter_smm:        vcpu 0: entering SMM, smbase 0x30000
306311:             CPU-24444 [006] 39841.871512: kvm_enter_smm:        vcpu 0: leaving SMM, smbase 0x7ffb1000

2) S3ResumeExecuteBootScript (again, the previous 0x402 write ends
at 334597 and promptly gives us a clue):

334698:             CPU-24445 [005] 39841.882706: kvm_enter_smm:        vcpu 1: entering SMM, smbase 0x7ffb3000
334699:             CPU-24447 [004] 39841.882706: kvm_enter_smm:        vcpu 3: entering SMM, smbase 0x7ffb7000
334741:             CPU-24444 [006] 39841.882723: kvm_enter_smm:        vcpu 0: entering SMM, smbase 0x7ffb1000
334742:             CPU-24446 [002] 39841.882724: kvm_enter_smm:        vcpu 2: entering SMM, smbase 0x7ffb5000
334875:             CPU-24444 [006] 39841.882755: kvm_enter_smm:        vcpu 0: leaving SMM, smbase 0x7ffb1000

Here I think that it's where things go awry.  The lines after
S3ResumeExecuteBootScript() are

   Close all SMRAM regions before executing boot script
   Lock all SMRAM regions before executing boot script

and indeed the first is at 334898, immediately after VCPU0 leaves
SMM.  But, closing and locking of SMRAM happens while the APs are
still in SMM!  The BSP instead goes on merrily and, after the debug
log has "PeiMpInitLib: CpuMpEndOfPeiCallback () invoked" (0x402
write ends at 364869) we have another access to 0x9f000, this time a
write.  It's RestoreWakeupBuffer:

364908-             CPU-24444 [006] 39841.890320: kvm_exit:             reason EPT_VIOLATION rip 0x855d82 info 182 0
364909:             CPU-24444 [006] 39841.890320: kvm_page_fault:       address 9f000 error_code 182

Again VCPUs 1..3 are still in SMM, but the BSP couldn't care less. :)

We're only 35% through the trace but we're actually close to the end.
At 365704 OVMF says it's transferring control to the Linux's wakeup
vector, and Linux takes control real soon:

365805:             CPU-24444 [006] 39841.890477: kvm_exit:             reason CR_ACCESS rip 0x9aec5 info 4 0
365807:             CPU-24444 [006] 39841.890477: kvm_cr:               cr_write 4 = 0xb0
365817:             CPU-24444 [006] 39841.890479: kvm_entry:            vcpu 0

We don't even need to look closer at what happens after this point,
as we can imagine that the APs are just waiting for something to happen.
But if you do look, all you see is reads to the PMTimer, which makes sense.
And a while after, once they are fed up, they bring VCPU 0 back to SMM:

994855               CPU-24446 [000] 39841.982774: kvm_apic:             apic_write APIC_ICR = 0x4200
994856               CPU-24447 [002] 39841.982774: kvm_apic:             apic_write APIC_ICR = 0x4200
994857               CPU-24445 [005] 39841.982774: kvm_apic:             apic_write APIC_ICR = 0x4200
994858               CPU-24446 [000] 39841.982774: kvm_apic_ipi:         dst 0 vec 0 (SMI|physical|assert|edge|dst)
994859               CPU-24445 [005] 39841.982774: kvm_apic_ipi:         dst 0 vec 0 (SMI|physical|assert|edge|dst)
994860               CPU-24447 [002] 39841.982774: kvm_apic_ipi:         dst 0 vec 0 (SMI|physical|assert|edge|dst)
994861               CPU-24446 [000] 39841.982775: kvm_apic_accept_irq:  apicid 0 vec 0 (SMI|edge)
994862               CPU-24445 [005] 39841.982775: kvm_apic_accept_irq:  apicid 0 vec 0 (SMI|edge)
994863               CPU-24447 [002] 39841.982775: kvm_apic_accept_irq:  apicid 0 vec 0 (SMI|edge)

The rendezvous completes, the APs can finally leave SMM but all they can do
is meet their fate and crash horribly:

994869               CPU-24444 [006] 39841.982776: kvm_exit:             reason EXTERNAL_INTERRUPT rip 0xffffffff813a9548 info 0 800000fd
...
994880               CPU-24444 [006] 39841.982777: kvm_enter_smm:        vcpu 0: entering SMM, smbase 0x7ffb1000
995135:             CPU-24444 [006] 39841.982821: kvm_enter_smm:        vcpu 0: leaving SMM, smbase 0x7ffb1000
995136:             CPU-24445 [005] 39841.982821: kvm_enter_smm:        vcpu 1: leaving SMM, smbase 0x7ffb3000
995137:             CPU-24446 [000] 39841.982821: kvm_enter_smm:        vcpu 2: leaving SMM, smbase 0x7ffb5000
995138:             CPU-24447 [002] 39841.982821: kvm_enter_smm:        vcpu 3: leaving SMM, smbase 0x7ffb7000
995148:             CPU-24447 [002] 39841.982827: kvm_emulate_insn:     0:9f0fd: 0f aa FAIL
995152:             CPU-24446 [000] 39841.982828: kvm_emulate_insn:     0:9f0fd: 0f aa FAIL

I hope you enjoyed it more than the poor APs. :)

Paolo

>     (a) in the area that used to host the AP startup routine for the MP
>     services PPI -- note that we also have "Transfer to 16bit OS waking
>     vector" in the log, so CpuMpEndOfPeiCallback() restores the borrowed
>     area completeley! --,
> 
>     (b) and why *after* all four VCPUs have just left SMM, together.
> 
>   * The RSM instruction is handled successfully elsewhere, for example
>     when all four VCPUs leave SMM, at the bottom of the diagram above:
> 
> > CPU-24447 [002] 39841.982810: kvm_emulate_insn:     0:7ffbf179: 0f aa
> > CPU-24446 [000] 39841.982810: kvm_emulate_insn:     0:7ffbd179: 0f aa
> > CPU-24445 [005] 39841.982810: kvm_emulate_insn:     0:7ffbb179: 0f aa
> > CPU-24444 [006] 39841.982811: kvm_emulate_insn:     0:7ffb9179: 0f aa
> 
>   * The guest-phys address 7ff7f000 that we see just before the error:
> 
> > CPU-24447 [002] 39841.982825: kvm_page_fault:       address 7ff7f000
> > error_code 83
> > CPU-24446 [000] 39841.982825: kvm_page_fault:       address 7ff7f000
> > error_code 83
> > CPU-24447 [002] 39841.982826: kvm_emulate_insn:     0:9f0fd: 0f aa
> > CPU-24444 [006] 39841.982827: kvm_exit:             reason
> > EXTERNAL_INTERRUPT rip 0xffffffff813a954f info 0 800000fc
> > CPU-24447 [002] 39841.982827: kvm_emulate_insn:     0:9f0fd: 0f aa FAIL
> > CPU-24447 [002] 39841.982827: kvm_userspace_exit:   reason
> > KVM_EXIT_INTERNAL_ERROR (17)
> 
>     can be found higher up in the trace; namely, it is written to CR3
>     several times. It's the root of the page tables.
> 
>   * The 7F80_1000..7FFF_FFFF guest-phys addresses are all in SMRAM.
> 
> * I also tried the "info tlb" monitor command, via "virsh
>   qemu-monitor-command --hmp", while the guest was auto-paused after the
>   crash.
> 
>   I cannot provide results: QEMU appeared to return a message that would
>   be longer than 16MB after encoding by libvirt, and libvirt rejected
>   that ("Unable to encode message payload", see VIR_NET_MESSAGE_MAX).
> 
>   Anyway, the KVM trace, and the QEMU register dump, look consistent
>   with what Paolo said about "Code=?? ?? ??...":
> 
>     The question marks usually mean that the page tables do not map a
>     page at that address.
> 
>   CR3=000000007ff7f000 points into SMRAM, but we are outside of SMM
>   (SMM=0). We can't translate *any* guest-virtual address, as we can't
>   even begin walking the page tables.
> 
> Thanks
> Laszlo
>