From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: mx.groups.io; dkim=missing; spf=pass (domain: redhat.com, ip: 209.132.183.28, mailfrom: lersek@redhat.com) Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by groups.io with SMTP; Wed, 31 Jul 2019 11:58:31 -0700 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id DAEB95859E; Wed, 31 Jul 2019 18:58:30 +0000 (UTC) Received: from lacos-laptop-7.usersys.redhat.com (ovpn-116-110.ams2.redhat.com [10.36.116.110]) by smtp.corp.redhat.com (Postfix) with ESMTP id 6D2C15D6A7; Wed, 31 Jul 2019 18:58:29 +0000 (UTC) Subject: Re: [edk2-devel] [Patch 0/2] UefiCpuPkg: Default avoid print. To: "Brian J. Johnson" , devel@edk2.groups.io, Eric Dong Cc: Ray Ni , Michael Kinney References: <20190731073502.24640-1-eric.dong@intel.com> <3a28f2c6-6ef1-c830-b3c6-3cf69c5ca60f@hpe.com> From: "Laszlo Ersek" Message-ID: <93d4a7e7-9b86-b3c9-a476-ac1a40dc4723@redhat.com> Date: Wed, 31 Jul 2019 20:58:28 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: <3a28f2c6-6ef1-c830-b3c6-3cf69c5ca60f@hpe.com> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.39]); Wed, 31 Jul 2019 18:58:30 +0000 (UTC) Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit On 07/31/19 18:34, Brian J. Johnson wrote: > I do wonder if there would be a clean way to let a DebugLib instance > itself declare that AP_DEBUG() is safe. That way a platform would > only need to override the DebugLib instance in the DSC file, rather > than both the instance and the PCD. (I know, I'm nitpicking.) A > library can't override PCDs in its calling modules, of course. I > suppose the AP_DEBUG() macro could call a new DebugLib entry point to > test for AP safety before doing anything else, say > DebugPrintOnApIsSafe(). Or it could even be a global CONST BOOLEAN > defined by the library. But that would require all DebugLib instances > to change, which is something you were trying to avoid. That's right -- I tried to imagine some other approaches, but they'd need new DebugLib functions, and likely force platforms to write new code. > However, it's not always practical to track down all uses of DEBUG(). > An AP can easily call a library routine which uses DEBUG() rather than > AP_DEBUG(), buried under several layers of transitive library > dependencies. In other words, it's not always practical to determine > ahead of time if a given DEBUG() call may be done on an AP. This problem is valid IMO, but I think its scope is a lot wider than just DebugLib. Assume the programmer is looking at a function that may be invoked on an AP, and they are about to call a function for taking care of a specific sub-task. If the programmer cannot prove the thread-safety of the *entire* call tree underneath the function call they are about to add, they simply must not add the call. The thread-safety of the DebugLib instance in use is just a part of the thread-safety of said call tree. Put differently, code that runs APs must be extremely self-contained; I'd rule out any and all lib classes from direct use unless a specific library instance, advertizing thread safety, would be chosen in the platform DSC file. But, if we adopted this approach, we could even introduce a new AP-oriented library *class* for debug messages, offer a Null implementation in edk2, and ask platforms to bring their own. > I know that AP code runs in a very restricted environment and that > people who use MpServices are supposed to understand the > repercussions, but it gets very difficult when libraries are > involved. :( Exactly -- the first restriction people should understand is, "stay away from libraries as much as you can". > So would a better solution be to modify the common unsafe DebugLib > instances to have DebugPrintEnabled() return FALSE on APs? That would > probably require a new BaseLib interface to determine if the caller is > running on the BSP or an AP. I agree that "AmIAnAP()" would be a pre-requisite. > (For IA32/X64 this isn't too hard -- it just needs to check a bit in > the local APIC. Still not trivial, as some DebugLib instances might want to target runtime drivers (or even SMM drivers). For runtime drivers the complication is that a runtime (virtual address) mapping for the LAPIC MMIO range would be needed (if I understand correctly anyway). And for both runtime and SMM drivers, it could be a problem that on physical hardware, the MMIO range of the LAPIC can be moved (reprogrammed) to a different base address, possibly by the OS too. I could be quite confused about this, of course; I don't eat LAPICs for breakfast :) I just recall an SMM firmware vulnerability that was in part based on moving the LAPIC base address elsewhere. Hm... googling suggests the attack was called "The Memory Sinkhole". > I have no idea about other architectures.) That wouldn't solve the > problem everywhere -- anyone using a custom DebugLib would have to > update it themselves. But it would solve it solidly in the majority > of cases. > > Thoughts? My fear might not be reasonable, but I feel quite uncomfortable about LAPIC accesses in DebugLib APIs. The information ("BSP or AP") is safer to determine at the call site, I think, even if it takes more human work. I could very well be biased. In OvmfPkg we have a minuscule amount of code that runs on APs, and even that code is written with total minimalism in mind. Leaping to a different topic... Years ago I was tracking down an MTRR setup bug in the Xen hypervisor (as it was shipped as a part of RHEL5). It is necessary to setup MTRRs identically on all CPUs, plus it has to be done while all CPUs are in a "pen" doing nothing but setting up MTRRs. The bug was exactly in that part of the code (running simultaneously on more than a hundred CPUs). It was impossible to print anything to the serial console -- first because it would be unreadable for humans, and second because the delays would perturb the buggy behavior. In the end I had to introduce an array of per-CPU debug structures where each CPU would record its own view (snapshot) of a shared resource -- a shared resource that should have been protected by mutual exclusion between the CPUs. After the CPUs left the "pen" (with the invalid MTRR configuration established), I'd use the BSP to dump the array. That showed me that some CPUs had overlapping / inconsistent views of the shared resource between each other. This proved that the mutual exclusion primitive didn't work as expected -- it turns out that the semaphore (or spinlock, not sure) in question used an INT8 counter, which overflowed when more than 127 CPUs contended for the resource. I'm sure I'm misremembering parts of this story (from several years distance), the moral is that debugging in a multiprocessing environment may easily require its own dedicated infrastructure. In edk2, we don't have anything like that, I think. Could we build it, sufficiently generally? Like, prepare a log buffer for each CPU before calling StartupAllAps(), log only to that buffer during the concurrent execution, and finally dump the buffers? I guess if we don't *reach* the "finally" part, we could still dump the RAM and investigate the log buffers that way... Dumping the RAM is certainly an option for virtual machines, but it might be viable for physical setups too (JTAG, debug agent... dunno). Sorry about the wild speculation :) Thanks Laszlo