Skip to content

KVM API Integration

Christopher Pelloux edited this page Feb 26, 2022 · 38 revisions

Description

The goal of this effort is to implement KVM support in MicroV.

  • For more information on KVM, please see this.
  • For more information on MicroV, please see this.

MicroV

High Level Components

KVM normally consists of a userspace application that performs emulation and handles some VMExits (not all). Traditionally this application is QEMU, but it could also be kvmtool from Google or rust-vmm from Amazon's Firecracker (just to name the big ones). In phase 1, we will focus on QEMU, ensuring MicroV is executing VMs properly on AMD hardware. Future phases will add support for rust-vmm which requires additional KVM APIs to be implemented to work properly.

The phase 1 high level components are:

  • Bareflank Microkernel: This is what will run in Ring 0 of VMX-root. It's only job is to load the MicroV extension and handle privileged operations, memory management, state transition, policy enforcement, etc. It doesn't actually implement a hypervisor, but it is aware of HVE tasks that an extension needs help with.
  • MicroV: This is Bareflank Microkernel Extension that runs in userspace of VMX root. It is the thing that implements the hypervisor (sort of, more on that later). MicroV's main job is to implement the MicroV ABI Spec, which defines how a VMX nonroot kernel and userspace communicate with MicroV to cooperatively implement a complete VMM. For Phase 1, MicroV implements the MicroV ABI Spec, which is a hypercall interface, most of which just calls into the Microkernel to handle state reads/writes. It will also trap on VMExits and for the most part, return the VMExit information to the kernel in VMX nonroot, which in turn will hand the information to userspace in VMX nonroot to actually handle, meaning for most operations MicroV is just a pass-through mechanism. In future Phases, MicroV will have to also handle LAPIC and IOAPIC emulation which is needed to support rust-vmm, and generally better for performance.
  • KVM Driver: This is a wrapper driver designed to emulate the actual KVM driver. All of the existing KVM tools expect to run from VMX root and simply IOCTL into the Linux kernel to implement Guest VM support. With MicroV, Linux is running in VMX nonroot, and therefore the Linux kernel cannot handle most of the APIs userspace will be asking of it. Instead, this driver will simply forward the IOCTL to MicroV using the MicroV ABI Spec.

Step 1 (quick demo)

The first step will be to get the following working: https://zserge.com/posts/kvm/

This will require that we verify that this works with regular KVM. Once that is done, we can use this to ensure that MicroV can run the same code. Getting MicroV to run such a simple example will ensure all of the basics are in place and working. This includes the following KVM APIs:

  • SHMI_INIT
  • SHMI_FINI
  • KVM_CREATE_VM
  • KVM_CREATE_VCPU
  • KVM_SET_USER_MEMORY_REGION
  • KVM_GET_VCPU_MMAP_SIZE
  • KVM_GET_REGS
  • KVM_SET_REGS
  • KVM_GET_SREGS
  • KVM_SET_SREGS
  • KVM_RUN

When userspace calls an IOCTL, it will end up in the entry.c code in the shim driver. The entry.c code will call the appropriate dispatch_xxx function. These dispatch functions DO NOT IMPLEMENT the hypercall. All they do is execute copy_to_user, copy_from_user to/from a struct on the stack and then call the appropriate handle_xxx function in the shim's src directory. For an example of the copy functions, see the following: https://elixir.bootlin.com/linux/v5.13-rc7/source/virt/kvm/kvm_main.c#L3504

The handle_ function actually implement the KVM IOCTL. The handle_ functions CANNOT CALL LINUX APIS. This code is designed to be common between all operating systems. If a Linux API is needed, it should use a platform_xxx function to do it. If a platform_xxx function is missing, please reach out over slack with what you think should be added. We will need to work together to determine what the API should look like to ensure it will work with other operating systems. In general, the handle_xxx functions will be making the actual calls to mv_xxx hypercalls, implementing KVM IOCTLs that are shim only (meaning there are no hypercalls to make), and handling KVM to MicroV ABI conversions as needed.

For KVM_RUN, we will need the following exit reasons implemented:

  • KVM_EXIT_IO

From a MicroV point of view, we will need the following ABIs defined and implemented:

  • mv_id_op_version
  • mv_handle_op_open_handle
  • mv_handle_op_close_handle
  • mv_pp_op_get_shared_page_gpa
  • mv_pp_op_set_shared_page_gpa
  • mv_vm_op_create_vm
  • mv_vm_op_mmio_map
  • mv_vm_op_mmio_unmap
  • mv_vp_op_create_vp
  • mv_vs_op_create_vs
  • mv_vs_op_reg_get
  • mv_vs_op_reg_set
  • mv_vs_op_reg_get_list
  • mv_vs_op_reg_set_list
  • mv_vs_op_run
  • mv_vs_op_gla_to_gpa

SHIM_INIT:
This will call mv_id_op_version and make sure the version is correct. If it is, it will call mv_handle_op_open_handle to get a handle. It will then, on each PP, call mv_pp_op_set_shared_page_gpa to set the GPA of the PP's shared page. This shared page will be used to pass non-register based arguments between MicroV and the shim. For example, mv_vs_op_run takes a structure, as there is more data than can git in the registers alone. The shared page is used for all of this.

SHIM_FINI:
Calls mv_handle_op_close_handle

KVM_CREATE_VM:
Calls mv_vm_op_create_vm. This will return a VMID. The shim will have to create an FD for userspace software. Any time that FD is used, it will need to use the VMID associated with the FD for MicroV hypercalls.

KVM_CREATE_VCPU:
Calls mv_vp_op_create_vp and mv_vs_op_create_vs. MicroV has a VP and a VS. Most of the time, you will just work with the VS. For this first step, think of them as the same thing. These will return a VPID and a VSID. The shim will have to create an FD for userspace software. Any time that FD is used, it will need to use the VPID or VSID associated with the FD for MicroV hypercalls.

KVM_SET_USER_MEMORY_REGION:
Calls mv_vm_op_mmio_map. Will have to perform GVA to GPA conversions using virt_to_phys from the Linux kernel. DO NOT USE mv_vs_op_gva_to_gla or mv_vs_op_gla_to_gpa. Those hypercalls are only needed for integration testing (or if we ever need to implement KVM_TRANSLATE). It should not be used by the shim unless we have to implement KVM_TRANSLATE in the future. This IOCTL might also have to break this one call into multiple hypercalls if it cannot be fit into a single hypercall. Also, this hypercall might perform a continuation as it is slow, so the IOCTL might have to handle this. Here are a couple of important notes:

  • Userspace for a normal VM might ask the shim to map gigabytes of memory. The memory set here is the "physical" memory that the guest will use, so it is the guest's representation of RAM. Like actual MMIO on x86, there might be multiple regions. Some are RAM, some are memory mapped PCI devices, etc. So userspace may actually call this more than once. That is why KVM has this idea of "slots". We will have to implement this.
  • The memory region provided by userspace CANNOT BE PAGED OUT. This is important. The shim driver will have to "lock" this memory. Not sure how this is done from the Linux kernel, but we need to determine what Linux APIs are used to take a userspace buffer of memory and tell the kernel that it cannot be paged out.
  • Userspace will provide a "userspace" virtual address and a size. This is not a "kernel" virtual address. It is a "userspace" virtual address (thanks to Meltdown, they are not the same). MicroV only talks "guest physical addresses (GPAs)". The shim driver is executing in the kernel, which is in a VM, so the kernel's idea of a physical address is a GPA. So, all the shim has to do is a userspace address (i.e., virtual address) to physical address (i.e., GPA) translation.
  • On Linux, the virt_to_phys function may or may not work. Linux sadly has a million ways to translate from virt to phys depending on how the memory was allocated. What the right APIs are will need to be figured out. Again, look at the KVM driver as it has to do this already. Since all of this code will be in the "src" directory, it has to be cross platform, so platform_virt_to_phys should be updated and used here.
  • The userspace provided memory region is virtually contiguous. This does not mean it is physically contiguous. This means that the shim driver will need to start with the userspace provided virtual address, look up it's physical address and record it in an MDL entry. It will then have to add 4k (i.e., 0x1000) to the virtual address, and look up the physical address. If the physical address is still physically contiguous, the MDL entry's size can be increased by 4k. If the physical address is not physically contiguous, a new MDL entry will have to be created. This process is then repeated until the entire userspace memory buffer has been translated for every 4k page. This ensures that we are keeping memory in the MDL that will be provided to MicroV as physically contiguous as possible. With any luck, the MDL will actually describe 2M or higher physical pages which can save memory on the MicroV side of things. If we want to get fancy, we could even loop through the MDL for every translation to ensure that the current page being translated does not already exist in an MDL entry. For example, if we see 0x1000, 0x3000, and 0x2000, these addresses all look like they are NOT physically contiguous, but we know they are, they are just not in the right order. Once all of the MDL entries are recorded, we could loop through all of the entries and see if we can combine them, ensuring that we provide MicroV with an MDL with the least number of entries (which really ensures that physical memory is as contiguous as possible).
  • Since it is possible that every virtual address to physical address will be completely random, we might have an MDL that has one entry for every 4k virtual address being set, it is possible that the shim will have to call MicroV's hypercall several times as you can only fit a limited number of MDL entries in the shared page. To handle this, the shim should reserve a buffer of memory to fit buffer/4k entries (which is the worst case). All of the MDL entries should be calculated and combined. Once that is done, the shim can loop through the entries, adding them to the shared page. Once the shared page is full, mv_vm_op_mmio_map is called. From there, the shim starts at the beginning of the shared page and continues to add MDL entries and the process is repeated until MicroV has been given all of them. The reason we should calculate all of the MDL entries first, and then start making calls to MicroV is to ensure that we have translated all of the virtual to physical addresses, and combined as many as possible.
  • Mapping memory is slow. For phase 1, we will likely just deal with this. But, in the future, MicroV will likely track how much time it is taking to complete this hypercall. If it takes too long, interrupts will begin to pile up and cause problems. The "HyperV Top Level Specification" defines about how long this can take before bad things happen. To ensure the hypercall only takes a certain amount of time, once the time has elapsed, MicroV will return from the hypercall with a RETRY failure code. If the shim sees this failure code, it needs to execute this hypercall again, with the same exact parameters. Literally just run the hypercall again. MicroV will continue where it left off. The reason MicroV returns with the error code is that once it returns, it is likely that the shim will not actually execute, and instead, the Root VM will have to handle a bunch of interrupts. Once these are done, the kernel will continue to execute the shim, and since the shim will see the RETRY status code, it will simply execute the hypercall again, and MIcroV will continue where it left off. This provides MicroV with a means to give the root VM some time to do some house keeping for long running hypercalls.
  • The ABI actually provides a means to for the shim to tell MicroV that it will handle a continuation when it wants to using a hypercall flag. This allows the shim to make other hypercalls before performing the continuation in case it has house keeping to do as well. This flag exists, but we don't have any plans to support it right now. We simply added it incase it is needed as Xen currently has this.

KVM_GET_VCPU_MMAP_SIZE:
Returns the size of the KVM_RUN struct. No hypercalls are needed for this.

KVM_GET_REGS:
Calls mv_vs_op_reg_get_list. This hypercall allows you to fill in a list of the registers that you want and it will return their value or it will return an error code. Simply ask for the registers that KVM wants for this hypercall.

KVM_SET_REGS:
Calls mv_vs_op_reg_set_list. This hypercall allows you to fill in a list of the registers that you want to set and it will set their values or it will return an error code. Simply ask to set the registers that KVM wants for this hypercall.

KVM_GET_SREGS:
Calls mv_vs_op_reg_get_list and mv_vs_op_msr_get_list. These hypercall allows you to fill in a list of the registers and MSRs that you want and it will return their values or it will return an error code. Simply ask for the registers and MSRs that KVM wants for this hypercall.

KVM_SET_SREGS:
Calls mv_vs_op_reg_set_list and mv_vs_op_msr_set_list. This hypercall allows you to fill in a list of the registers and MSRs that you want to set and it will set their values or it will return an error code. Simply ask to set the registers and MSRs that KVM wants for this hypercall.

KVM_RUN:
Calls mv_vs_op_run. This IOCTL will have to translate between MicroV's ABI and KVM's. The following are some important notes:

  • The way that KVM_RUN works is that when the userspace application is ready to run the VM, it will do so by making a call to KVM_RUN which runs a VCPU. When guest SMP support is finally added, you would actually have more than one thread for each VCPU and KVM_RUN is executed by each thread. When MicroV detects that there is something for the userspace app to complete, it will return from KVM_RUN with an exit reason and in some cases, information needed to handle the exit. Once the exit has been handled by the userspace application, it will execute KVM_RUN again, and the process repeats until it is time to kill the VM, or the VM kills itself.
  • MicroV will have something similar to KVM_EXIT_IO, but it will not have anything for KVM_EXIT_SHUTDOWN. KVM_EXIT_SHUTDOWN is a Linux specific thing that the shim will have to implement. How this is done is currently unknown. The application that runs the VM will be in an endless loop, so I assume that a CTRL+C would be used to stop the VM. How the shim driver catches this and returns from KVM_RUN with KVM_EXIT_SHUTDOWN is unknown. Look at the KVM driver to see how it handle this.

Step 2 (QEMU demo)

The second step will be to get QEMU working without the need for MicroV to handle interrupts (meaning no IRQCHIP). https://github.com/qemu/qemu

Like step one, this will not include guest SMP support. Since QEMU will be handling guest interrupts and LAPIC, IOAPIC, PIC and PIT emulation, the gust will be slower than we would like, but this demo will provide the ability to run a full Linux Ubuntu 18.04/20.04 VM. This demo will include the ability to run a guest VM on any root VP, including root VP migration (meaning the root OS can move the guest VM from one root VP to another). The following additional IOCTLs will need to be implemented:

  • KVM_GET_API_VERSION
  • KVM_CHECK_EXTENSION
  • KVM_GET_MSR_INDEX_LIST
  • KVM_GET_MSRS
  • KVM_SET_MSRS
  • KVM_SET_TSS_ADDR
  • KVM_SET_IDENTITY_MAP_ADDR
  • KVM_IOEVENTFD
  • KVM_GET_SUPPORTED_CPUID
  • KVM_SET_CPUID2
  • KVM_SET_MP_STATE
  • KVM_GET_MP_STATE
  • KVM_GET_FPU
  • KVM_SET_FPU
  • KVM_GET_TSC_KHZ
  • KVM_SET_CLOCK
  • KVM_GET_CLOCK
  • KVM_INTERRUPT

Is this still needed?

  • KVM_SET_SIGNAL_MASK

For KVM_CHECK_EXTENSION, we will need the following implemented:

  • KVM_CAP_EXT_CPUID: return 1
  • KVM_CAP_NR_VCPUS: return mv_pp_op_online_pps()
  • KVM_CAP_NR_MEMSLOTS: return MICROV_MAX_SLOTS
  • KVM_CAP_GET_TSC_KHZ: return 1
  • KVM_CAP_MAX_VCPUS: return MICROV_MAX_VCPUS
  • KVM_CAP_MAX_VCPU_ID: return INT16_MAX
  • KVM_CAP_TSC_DEADLINE_TIMER: return 1
  • KVM_CAP_USER_MEMORY: return 1
  • KVM_CAP_DESTROY_MEMORY_REGION_WORKS: return 1
  • KVM_CAP_JOIN_MEMORY_REGIONS_WORKS: return 1
  • KVM_CAP_SET_TSS_ADDR: return 1
  • KVM_CAP_MP_STATE: return 1
  • KVM_CAP_IMMEDIATE_EXIT: return 1

For KVM_RUN, we will need the following exit reasons implemented:

  • KVM_EXIT_INTR
  • KVM_EXIT_MMIO
  • KVM_EXIT_SYSTEM_EVENT

From a MicroV point of view, we will need the following ABIs defined and implemented:

  • mv_pp_op_msr_get_supported
  • mv_pp_op_cpuid_get_supported
  • mv_pp_op_tsc_get_khz
  • mv_pp_op_tsc_set_khz
  • mv_vs_op_msr_get
  • mv_vs_op_msr_set
  • mv_vs_op_msr_get_list
  • mv_vs_op_msr_set_list
  • mv_vs_op_cpuid_get
  • mv_vs_op_cpuid_set
  • mv_vs_op_cpuid_get_list
  • mv_vs_op_cpuid_set_list
  • mv_vs_op_fpu_get
  • mv_vs_op_fpu_set
  • mv_vs_op_mp_state_get
  • mv_vs_op_mp_state_set
  • mv_vs_op_clock_get
  • mv_vs_op_clock_set
  • mv_vs_op_interrupt
  • mv_vs_op_tsc_get_khz

In addition, MicroV will need to add support for the following additional features:

  • vm switch support
  • control register emulation
  • msr emulation
  • cpuid emulation
  • proper io emulation
  • proper mmio emulation
  • interrupt management
  • clock support
  • mtrr support
  • EPT/NPT enabled in the root VM

KVM_GET_API_VERSION:
Shim Only: Return whatever KVM currently returns.

KVM_CHECK_EXTENSION:
Shim Only: Return non-supported for everything except the for the extensions listed above.

KVM_GET_MSR_INDEX_LIST:
Call mv_pp_op_msr_get_supported and translate as needed.

KVM_SET_TSS_ADDR:
Call mv_vs_op_reg_set for mv_reg_t_tr_base

KVM_IOEVENTFD:
Shim Only: This is probably the most complicated modification to the shim that is needed for step 2. The notes about how it works can be seen in the following link: https://patchwork.kernel.org/project/kvm/patch/[email protected]/

This is a VM IOCTL, not a VCPU IOCTL. Thankfully, we store the VM associated with a VCPU in the shim_vcpu_t, so when an exit occurs, we can get to the VM object to support the proper execution of this API. One thing that is not clear how KVM handles the performance of this API. Specifically, it seems KVM is likely storing a list, and must parse at O(N) to determine which event needs to be triggered. If this is true, we will need a static array in the shim_vm_t of kvm_ioeventfd structs (size MICROV_MAX_IOEVENTFDS which will need to be added to the CMake config), and then we would just loop through it on every IO/MMIO exit to determine if there is an event to trigger. This seems terrible because it would add a O(N) search to every PIO/MMIO exit, so research into how this might be handled by KVM should be done. Either way, there is nothing to do on the MicroV side of things to support this.

Also, userspace will likely be "waiting" on the event in a thread, so our integration tests will need some thread logic to support the proper testing of this.

KVM_GET_SUPPORTED_CPUID:
Call mv_pp_op_cpuid_get_supported and translate as needed.

KVM_GET_TSC_KHZ:
Call mv_pp_op_tsc_get_khz

KVM_SET_CPUID2:
Call mv_vs_op_cpuid_set

KVM_SET_MSRS:
Call mv_vs_op_msr_set_list

KVM_SET_SIGNAL_MASK:
Shim Only: TBD

KVM_GET_CLOCK:
Call mv_vs_op_clock_get

KVM_SET_FPU:
mv_vs_op_fpu_set

KVM_SET_MP_STATE:
mv_vs_op_mp_state_set

KVM_GET_MP_STATE:
mv_vs_op_mp_state_get

KVM_GET_FPU:
mv_vs_op_fpu_get

KVM_GET_MSRS:
Call mv_vs_op_msr_get_list

KVM_SET_CLOCK:
Call mv_vs_op_clock_set

KVM_INTERRUPT:
Call mv_vs_op_interrupt

mv_pp_op_msr_get_supported:
This calls pp_pool.msrs_get_supported(), which needs to be given an RDL so that it can return the results. In general, this should only return the MSRs defined in AMD's VMCB (for both Intel and AMD), as well as the APIC BASE. If others are needed the can be added, but support for these MSRs will have to be added to the VS using the emulated_msrs_t.

mv_pp_op_cpuid_get_supported:
This calls pp_pool.cpuid_get_supported(), which needs to be given an CDL so that it can return the results. Which bits should be enabled in CPUID is currently in Boxy (do not use MicroV/mono for this as it supports features we want to add in step 4).

mv_pp_op_tsc_get_khz:
The first thing to do is ensure that an invariant TSC is supported. This is CPUID code that must be added to the PP init logic, which means that the PP init code will need to be modified to return an error code (don't forget to add a bsl::finally to release() on error). Other CPUID checks can be added here too.

This calls pp_pool.tsc_get(). The code for getting the TSC frequency can be found in Boxy. It is a combination of CPUID (0x15 and 0x16), an MSR and some custom CPUID leaves from VMWare. We need to be able to support VMWare. To do this, we can use CPUID, and if the 0x40000XXX returns a proper frequency and we are actually on VMWare, we can return the frequency reported by VMWare. The rest of the frequency information can be found in CPUID or some MSRs (sadly, on Intel, this information is all over the place). There are a couple of CPUs that need this frequency information hardcoded into the kernel. See the Linux kernel which which ones, as CPUID reports 0, and so do the MSRs, but the TSC is invariant. This is a bug with Intel on that one.

As for mv_pp_op_tsc_set_khz, which we do not need right now, but might want in the future, this is only needed for migration when going from a fast CPU to a slower CPU. In this case, you can use TSC scaling, which allows you to set the CPU frequency to something slow that what hardware reports.

mv_vs_op_msr_get:
mv_vs_op_msr_set:
mv_vs_op_msr_get_list:
mv_vs_op_msr_set_list:
All four of these simply make there way from the dispatch logic to the vs_t. See the RDL code in the dispatch helpers, and the VS dispatch logic on how to sanitize the hypercall before handling this. The dispatch logic should have an infinitely wide contract and handle all possible inputs, good and bad before the vs_t ever gets called. The vs_t should declare all assumptions using narrow contracts, and make them as tight as possible to ensure that any mistakes in the dispatch logic are caught while debugging.

MSRs that are defined in AMD's VMCB are handled by the Microkernel, and so calls to the microkernel can be made to get/set these values. The APIC BASE needs to go to the emulated_lapic_t for set/get. All other MSRs that might be needed would go to the emulated_msrs_t, where the MSR would be stored.

Any MSR that is not on this list either needs to be added, or it needs to inject a GPF. If any other MSRs are needed, it is possible that we simply need to add them to the emulated_msrs_t, but do nothing with them. Meaning, just because the VM needs the MSR does not mean that we need to support the features that the MSR is asking for. Instead, we can just allow the VM to read/write to the MSR, and throw an error if it configures a bit that we know would have to be supported by the VMM. In other cases, the VM might actually need support, and the VMM will have to handle it. With Boxy, the only MSRs where the feature MSRs, which we made sure nothing was turned on that we needed to handle, and some PERF MSRs that can be faked as they are not supported, so there really were no MSRs that had to be supported. To get this to work however, we had a custom kernel with most features removed from the kernel, which we cannot do as we need to support Ubuntu 20.04, so it is possible that some MSRs will have to be added.

NOTE: Any MSR that does not throw a GPF must be added to the supported list in the PP found above.

mv_vs_op_cpuid_get:
mv_vs_op_cpuid_set:
mv_vs_op_cpuid_get_list:
mv_vs_op_cpuid_set_list:
All four of these simply make there way from the dispatch logic to the vs_t. See the RDL code in the dispatch helpers, and the VS dispatch logic on how to sanitize the hypercall before handling this. The dispatch logic should have an infinitely wide contract and handle all possible inputs, good and bad before the vs_t ever gets called. The vs_t should declare all assumptions using narrow contracts, and make them as tight as possible to ensure that any mistakes in the dispatch logic are caught while debugging.

The CPUID bits that need to be supported can be found in Boxy. Do not use MicroV/mono for this as features like AVX that we want to support in Step 4 are used in Mono already. Boxy has the more minimal feature set defined.

get()/get_list() simply return whatever the guest would see. Meaning, if the guest ran CPUID, the value that the vs_t would return is what is returned here, just using a mv_cdl_entry_t and mv_cdl_t.

set()/set_list() allow userspace to change which features are supported. If the call attempts to enabled a bit, an error must be returned as this is not supported by either MicroV or KVM. You can only disable bits. If a bit is disabled, it is possible that we might have to change some inner workings of MicroV, although at the moment, none come to mind.

NOTE: Any CPUID feature that does not return 0 must be added to the supported list in the PP found above.

mv_vs_op_fpu_get:
mv_vs_op_fpu_set:
This just calls vs_pool.fpu_get(), vs_pool.fpu_set(). There is also no reason not to implement the XSAVE part of this as well as it is basically the same thing. The idea with these is to just copy the state save area, and return what mode the vs_t is in so that userspace knows how to parse the shared page.

Code to the PP will have to be added to ensure that XSAVE support is provided, and that the size of the XSAVE region is smaller than a page. The hypercall ABI will be future proof, allowing you to ask for more than one page if needed, but in general, XSAVE will likely alway be a single page.

mv_vs_op_mp_state_get:
mv_vs_op_mp_state_set:
This just calls vs_pool.mp_state_get(), vs_pool.mp_state_set(). Every VS (root and guest) starts in mv_mp_state_t_initial. Once a VS is run, it is set to mv_mp_state_t_running. Eventually the VS will hit a state where it cannot process anything until an interrupt is fired. On x86, there are a couple of ways this can happen, but the one to start with (as MWAIT will be disabled) is the HLT instruction. Once HLT is executed, the VS enters mv_mp_state_t_wait. If a CLI/HLT occurs, we enter a shutdown event (as the only thing that get the VS to execute at that point is an NMI). On x86, we can also tell the VS to enter an INIT state by setting the MP state to mv_mp_state_t_init. When this happens, on Intel we need to set the guest activity state (see Bareflank v2.1 for details). There is nothing to do in the INIT state but wait for a SIPI. Once we see SIPI, the VS needs to be initialized to it's initial SIPI state. And then again, we enter mv_mp_state_t_running on the next execution. So the FSM would look like:

              ---------------------
              |                   |
              v                   |
   initial -> running -> wait     |
   |   ^ ^    |    ^     |        |
   |   | |    |    |     |        |
   |   | ------    -------        |
   |   -------------------        |
   |                              |
   ---> init -> sipi --------------

For now, since we are not supporting SMP, we will always start in the mv_mp_state_t_initial, and once we see a run command, we will enter the mv_mp_state_t_running state and stay there until we see a HLT will move us into the wait state and back to the running state once we are run again.

Once we implement UEFI support, the UEFI driver responsible for starting up MicroV (which is not the loader in the Hypervisor repo as the examples in the hypervisor repo cannot start an OS from UEFI, so there needs to be a second, MicroV specific application that does this), will call mv_vs_op_mp_state_set(), and set the state of each AP to INIT. From there, it will start the OS. The OS will boot each AP, and transition the AP through the INIT/SIPI process until it is running.

When we add SMP support, the same thing will occur, but from a QEMU point of view. QEMU will create the APs, and then set their states to INIT. Each AP will have it's own LAPIC ID, and MicroV will see the signals for INIT/SIPI to these APs coming from the guest OS, and INIT/SIPI will be executed. The code for how INIT/SIPI is done between UEFI and SMP are the same. It's just who starts the process that changes, and in the UEFI case, the INIT/SIPIs are real, going to real LAPIC IDs. In the guest SMP case, the INIT/SIPI will actually be sent to the PP, but which PP gets this signal will depend how we translate the fake LAPIC IDs to real ones.

The hardest part about SMP support is not INIT/SIPI. It is interrupts and race conditions. Once we add SMP support, any mods to the guest's EPT/NPT will need to be remotely flushed (meaning any PP that has executed the VS must be flushed just in case the TLB has entries from the VM). Likely the best way to handle this is to keep track of which PPs a VM has touched. We already have an active array in the VM, we just need to add a dirty array as well. The other issue with interrupts is when you need to deliver an interrupt, like an INIT/SIPI, IPI, external interrupt, or whatever to a specific VS, that interrupt needs to be physically sent to the PP that the VS is assigned to. You CANNOT modify a VS unless that VS is on your PP, or it can be migrated to your PP first. So, for example, suppose an interrupt that should be given to VS 1, assigned to PP 1, is taken on PP 0. MicroV will see the trap on PP 0. But the interrupt needs to be given to VS 0. You either need to migrate VS 1 to PP 0 (bad idea), or you need to IPI PP 1 so that you can trap on PP 1 and then queue the interrupt.

The other complicated part about SMP support is race conditions. Once you add SMP support, any missing locks will result in race condition based corruption or deadlock. DO NOT create global locks. For example, in the example above, we IPI PP 1 and then queue the interrupt. We do not lock the queue and then add the interrupt from PP 0. This is because adding the lock would create a global lock that would reduce performance as the total number of VMs increases. This is a mistake that Xen has made which is a problem that does not need to exist. Instead, IPI the interrupt to PP 1. Once the PP can take the interrupt, it will be seen on PP 1, and we can queue without a lock, because if you are adding to the queue on PP 1, it means MicroV (and not a guest) is executing on PP 1, and no race condition can occur. Same thing with interrupt windows. Don't try to inject unless you are in a window. This ensures that you are never adding to the queue, and taking from the queue at the same time. It means there is an extra VMExit, but it keeps things really simple and avoids races.

To handle IPIs BTW, repurpose INIT. Do not add IPI support to the Microkernel unless absolutely needed. It is either just a memory access or an MSR write, and so there is no performance improvement for it being in the microkernel. The only reason to add IPI support to the Microkernel would be because maps are being abused between PPs and MicroV is trashing the TLB, in which case you would need it, but this should be avoided instead. Also note that how INIT was handled in MicroV/mono is not the right way to do it (lessons learned). Instead of a global lock, each VS should have a mailbox which is locked. One VS can lock the mailbox of another VS, and then add a message and unlock. Then the VS can IPI the other VS and you are done. If the VS sees an INIT, it reads from the mailbox and performs the action. You might ask... what about a real INIT. Well, the BSP needs to trap on attempts to send an INIT, and then IPI this INIT using the mailbox instead. Traps on INIT will be needed anyways, so this ensures that INIT is always used for the mailbox, and then INIT and SIPI commands move through this mailbox instead of the normal means. By doing this, external interrupts will not need to be turned on in the root VM unless PCI pass-through is needed (and the VISR approach can be used to prevent that too, but I would start by trapping on external interrupts and then adding the VISR optimization after it is working).

mv_vs_op_clock_get:
mv_vs_op_clock_set:
Ok, so these are not going to be fun. See the following code for this: https://github.com/Bareflank/boxy/blob/master/bfvmm/src/hve/arch/intel_x64/virt/vclock.cpp#L57

There is also code here that QEMU will be doing that might help make sense of this: https://github.com/Bareflank/boxy/blob/master/bfexec/src/main.cpp#L125

The issue here is that MicroV (as well as KVM), have no idea what the wallclock is. Meaning, they do not know what time it is. They can read the TSC however. So, to keep track of time, we need a way to pair the current time (like 5:30pm), with the current TSC value. The wallclock however has NS resolution, so we have to be really quick. In other words, you read the wallclock, and then read the TSC value. In Boxy, we had to do this a couple of times to ensure that we were not interrupted. In the Shim, we can simply turn preemption off while we do this, so that loop is not needed.

Once you have a wallclock and TSC value, you can now keep track of time. When the vs_t is created, we need to simply read the TSC value, and set the wallclock to 0 NS from EPOC. If someone asks for the time using clock_get(), we need to take the current TSC, and calculate the difference between the TSC when the wallclock was set, and now. The total number of ticks in the TSC can be used to calculate how many NS have passed because we require an invariant TSC.

So the basic idea here is:

  • On init, set wallclock to 0 NS, current TSC
  • On clock_set(), set the wallclock to NS,TSC provided by the call to clock_set().
  • On clock_get(), current TSC - TSC when the wallclock was set == ticks that have occurred since the wallclock was set. Use the TSC frequency to calculate the number of NS that have occurred since the wallclock was set. wallclock NS + the NS we just calculated == current time in NS.

Oh... but it gets more complicated than this. The entire reason that bsl::safe_integral was added was because of these calculations. Because the wallclock is in NS, the TSC calculations from TSC to NS will overflow a 64bit integer. The link provided above has all of the details on how to do this math without running into this overflow. Read it, learn it, understand it well.

Finally, there are issues with how Sleep/Resume is handled in the Root VM. When you put a computer to sleep, the RDTSC value is reset to 0. The guest however has not been put to sleep, and instead it has been paused. This is that the TSC Offset is for. When you wake from sleep/resume, you basically set the wallclock again. The problem is, the TSC value will have gone backwards in time. If the guest were to read the TSC, it would see this and panic (literally, the Linux kernel will panic). We have two options, we could trap RDTSC and adjust the result, or we can use the TSC Offset. The offset allows us to add a positive/negative value to the existing TSC so that if the guest expects the TSC to be larger than it actually is, we can make this correction without having to trap. Since we currently do not support sleep/resume, this is a non-issue until after Step 4. Just shut the VM down on sleep/resume or other events.

mv_vs_op_inject_exception:
mv_vs_op_queue_interrupt:
These allow us to inject an exception or queue an interrupt in a vs_t. Exceptions cannot be masked, so we always "inject" them, meaning the guest will see the exception on the entry right away. External interrupts can be masked, which means that they need to be queued until an "interrupt window" is open. How this is handled on Intel vs AMD is totally different. On Intel, you wait for an interrupt window and then inject like you would an exception. On AMD, you add a virtual interrupt. AMD's approach would seem better on paper at first, but it is not. The problem with AMD's approach is there is no simple means to know when the next interrupt needs to be injected. On Intel, you leave the VMExit on interrupt window open. When you inject an interrupt, the window is closed, so the processor is free to handle the interrupt as needed without a VMExit occurring. Once the OS executes an IRET, the window is open again, and you get a VMExit. You can then check your queue, and repeat. On AMD, there is no VMExit on interrupt window exit (sort of), so the idea was you would trap on IRET to know when you can inject again. There are a ton of issues with this. The trap on IRET traps BEFORE the IRET completes, which means you need to single step IRET before the window is actually open. This single step is non-trivial as you have to emulate all sorts of horrible edge cases, some of which likely are impossible to handle properly. Trapping on IRET is also not supported with SEV. To deal with this, we can use the VIRQ VMExit to simulate the VMExit on open window. What you do is queue a vector (any vector and might as pick a fake one). Just before this vector would be injected you get a VMExit. From there, you can use eventinj to inject the vector you actually want to inject, leaning the VIRQ programmed with garbage. This will cause a trap the next time the window is open, and from there you can check your queue just like with Intel. In other words, you use the VIRQ fields solely to ensure that you get a VIRQ VMExit so that you can use it like an interrupt window VMExit on Intel.

vm switch support:
Any state that is managed by MicroV needs to be swapped when a VM switch occurs. Meaning, if we go from guest to root, or root to guest, we need to swap this state.

We will need to add DR0-3 and CR8 to the microkernel's state so that they are handled properly without the extra overhead of a syscall. This will also ensure that there are no issues with a guest VM messing around with what these values are while the VMM is running.

So this just leaves MSRs and XSAVE. There are APIs for getting/setting these already, and XSAVE can be run from userspace.

control register emulation:
Issues with the control registers include:

  • On Intel only, CR4.VMXE being disabled. The Microkernel already prevents this.
  • The cache disable and write through bits being set. Again, the Microkernel already prevents this from happening.
  • On Intel only, the PE/PG not being able to be cleared when unrestricted mode is enabled. This is also, already handled by the microkernel.
  • Switching from 64bit mode and properly handling EFER. This is not handled by the Microkernel as it has no means to do this. The code for doing this can be seen here:

https://github.com/Bareflank/hypervisor/blob/d2203d8fa44339e1c7dd0ce8264f5248b43a7648/bfvmm/src/hve/arch/intel_x64/vmexit/control_register.cpp#L178

msr emulation:
Any MSR defined in the VMCB is handled by the Microkernel. APIC BASE needs to be in the emulated_lapic_t, so that just leaves any MSR that we need to support that is not in this list. Most MSRs we can just give the VM a GPF. Remember that the OS will look for MSR support by read an MSR and waiting for a GPF. So if we do not send a GPF, the OS will think the MSR is supported. Some MSRs, the guest will expect that the MSR actually exists and works. In these cases, we will need to have MSR registers in the emulated_msrs_t so that guest can read/write them. Most of the time, that is all we need to do. Just save anything that is set(), and return on get(). It is possible that setting an emulated MSR should change how MicroV works, or in more rare cases, the MSR will actually have to be written to hardware.

An example of changing how the hardware works or is configured wasn't an issue with Boxy, but it might be with QEMU as we need to handle Windows, and Windows might need more MSRs that we have not seen.

Loading an MSR into hardware is unlikely. Boxy and MicroV/mono has a bunch of this that it had to do because MSRs like LSTAR were not used by the hypervisor. This is not the case with the Microkernel, so these MSRs are already handled on every VMExit anyways.

In general, it is expected that there is nothing to do there, other than if an MSR is needed, add it to the emulated_msrs_t, and allow the guest to get()/set() it.

cpuid emulation:
The emulated_cpuid_t needs to handle CPUID based on how the APIs need to function. In general, the Boxy code here what we need with one minor difference. https://github.com/Bareflank/boxy/blob/master/bfvmm/src/hve/arch/intel_x64/emulation/cpuid.cpp#L79

The difference is, this code was hardcoded, and instead, we need to store the feature bits and return them so the APIs can turn off feature bits that QEMU does not want to support.

proper io emulation:
We need to handle IN and INS/OUTS. Note that for the guest VM, we do not actually read a Port. We simply tell QEMU.

proper mmio emulation:
We need to implement support for flags. Otherwise QEMU will have no way of trapping on memory regions for devices it is emulating. We also need to test overlapping memory in the integration tests.

interrupt management:
We need to add an interrupt queue to the vs_t and trap on external interrupts. All external interrupts can be injected directly into the root, as we know that the window will be open. But QEMU will want to inject interrupts and exceptions into the guest VM, and so a queue will be needed to ensure that is possible. In general, this code is really simple.

clock support:
Each vs_t will need an emulated_clock_t, so that it can track time. See the above for more details on this.

vp migration:
Still not sure if this is working perfect. Issues have been seen, but they may have been issues with compilation, and not migration (meaning the hypervisor compiler weird as a simple recompile fixed the two issues that were seen, going from reproducing each time the code was run, to never reproducing again).

We will have to translate TSC values with the TSC offset. To handle this, the shim will have to set the wallclocks using the clock set ABI, which will have a feature for doing the initial TSC calculations. This will basically force each CPU to spin on the shared page. Once written to, each vs_t will read the TSC and recode the value. When a VS is migrated, it will have to look up what the TSC value is on the to/from PPs and adjust the TSC offset as needed so that the guest's view of the TSC is the same. This is needed because most CPUs have invariant TSCs, but they do not have stable TSCs, meaning their values are not the same on all cores, and there is no way for the OS to ensure that the TSC value is always the same and perfect without support for this from hardware.

TLB control:
Any changes to the following need a TLB flush

  • CR0.PG, CR0.WP, CR0.CD, CR0.NW
  • CR3
  • CR4.PGE, CR4.PAE, CR4.PSE
  • EFER.NXE, EFER.LMA, EFER.LME

Step 3 (rust-vmm demo)

The third step will be to get rust-vmm working. This will require MicroV to handle interrupts, meaning MicroV will have to emulate the LAPIC, IOAPIC, PIC and PIT. https://github.com/rust-vmm/vmm-reference

Like the previous demos, this will not include guest SMP support. PCI pass-through will however be included to ensure that the emulated devices provided by MicroV are working properly. To start, PCI pass-through will only include NICs. To support PCI pass-through, MicroV will enable external interrupt exiting in the root VM. Future versions of MicroV might include a root VM driver to trap on PCI pass-through specific interrupts and forward them to the correct guest VMs as needed to allow MicroV to disable external interrupt handling from the root VM. But for now, external interrupt exiting will be used to simplify this demo. The following additional IOCTLs will need to be implemented:

  • KVM_CREATE_IRQCHIP
  • KVM_IRQ_LINE
  • KVM_GET_IRQCHIP
  • KVM_SET_IRQCHIP
  • KVM_SET_GSI_ROUTING
  • KVM_GET_LAPIC
  • KVM_SET_LAPIC
  • KVM_SIGNAL_MSI
  • KVM_CREATE_PIT2
  • KVM_GET_PIT2
  • KVM_SET_PIT2
  • KVM_IRQFD

Step 4 (higher TRL demo)

The fourth step will be to raise the TRL of the previous demos and include lockdown support (i.e., deprivilege the root VM) This will include support for additional test systems and PCI pass-through devices. All unit and integration testing will be complete as well. To deprivilege the root VM, a complete analysis will need to be made as to what memory and additional gust state like general purpose and system registers QEMU and rust-vmm require. All additional resources will be locked down to prevent access to the root VM. This lock-down should take place before the first mv_vs_op_run is called.

Future Steps

In the future, the following will also be added:

  • Guest SMP support
  • Support for Intel
  • Support for ARM (aarch64, ServerReady only)
  • Support for RISC-V (tbd)
  • Support for Windows guest VMs
  • Support for RTOS guest VMs
  • Support for Windows root VM
  • Support for RTOS root VM
  • Support for AMD nested virtualization
  • Support for Intel nested virtualization
  • Support for device domains

Additional Notes

There are some IOCTLs that we are not sure if they are needed. If they are, they should be removed from this list and added to the lists above as needed. These include:

  • KVM_GET_VCPU_EVENTS
  • KVM_SET_VCPU_EVENTS
  • KVM_ENABLE_CAP
  • KVM_CREATE_DEVICE
  • KVM_GET_DEVICE_ATTR
  • KVM_SET_DEVICE_ATTR
  • KVM_HAS_DEVICE_ATTR
  • KVM_GET_XSAVE
  • KVM_SET_XSAVE
  • KVM_GET_XCRS
  • KVM_SET_XCRS

IOCTLs that we believe are not needed are as follows. Again, if these are needed, they should be removed from this list and added to the lists above. Keep in mind that if they are needed, it is possible that we are not providing the right capabilities to software, and we might simply need to tweak the capabilities instead. Finally, some of these might be needed if we add support for nested virtualization, SEV/TDX, or VM migration.

  • KVM_GET_DIRTY_LOG
  • KVM_XEN_HVM_CONFIG
  • KVM_SET_BOOT_CPU_ID
  • KVM_SET_TSC_KHZ
  • KVM_NMI
  • KVM_KVMCLOCK_CTRL
  • KVM_SET_GUEST_DEBUG
  • KVM_SMI
  • KVM_X86_SET_MCE
  • KVM_MEMORY_ENCRYPT_OP
  • KVM_MEMORY_ENCRYPT_REG_REGION
  • KVM_MEMORY_ENCRYPT_UNREG_REGION
  • KVM_HYPERV_EVENTFD
  • KVM_GET_NESTED_STATE
  • KVM_SET_NESTED_STATE
  • KVM_REGISTER_COALESCED_MMIO
  • KVM_UNREGISTER_COALESCED_MMIO
  • KVM_CLEAR_DIRTY_LOG
  • KVM_GET_SUPPORTED_HV_CPUID
  • KVM_SET_PMU_EVENT_FILTER