Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos/azure: move image-specific configs from azure-common to azure-image, fix console output #359365

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

codgician
Copy link
Member

@codgician codgician commented Nov 26, 2024

[azure-common.nix]

  • Moved image specific configs (namely file system configs and grub configs) out of azure-common.nix, because these settings are only intended for images generated using azure-image.nix. This will make azure-common.nix more generic so folks generating azure image using other methods (e.g. disko) can leverage it.
  • Removed reference of headless.nix to make the console work (which was left out by mistake from nixos/azure: add Gen 2 VM, aarch64 and accelerated networking support #333508). Thanks @nh2 for the great finding.
  • Enabled cloud-init by default. Otherwise, the current waagent would try ifupdown for networking management, which is not available in NixOS ecosystem.

[azure-image.nix]

  • Included image specific configs (mentioned as above).
  • Fixed grub console output for Gen 2 VM by forcing grub to run in text mode.
  • Added option for customizing root partition label.
  • Updated the default configuration.nix copied into the image (azure-config-user.nix), prompting user to set the correct vmGeneration before rebuilding. Doing this instead of dynamically generating to remain compatibility, as I found some open-source configurations on GitHub are referencing this file directly.

[azure-agent.nix]

  • Removed assertion of Network Manager, because the current version of waagent works well with Network Manager.
  • Removed reference of python39 because the current version of waagent is depending on a newer Python version which already exist in the environment given how it is packaged.

Changes in this PR is testable via flake codgician/azure-aarch64-nixos (instructions for generating images provided in README.md). Tested for both Gen 2 grub and Gen 2 systemd-boot on aarch64.

image

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 25.05 Release Notes (or backporting 24.11 and 25.05 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Add a 👍 reaction to pull requests you find important.

@github-actions github-actions bot added 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: module (update) This PR changes an existing module in `nixos/` labels Nov 26, 2024
@codgician codgician requested a review from nh2 November 26, 2024 18:42
@codgician codgician self-assigned this Nov 26, 2024
@codgician codgician force-pushed the fix-azure-modules branch 2 times, most recently from 9e744ad to 3ede656 Compare November 27, 2024 13:23
@codgician codgician changed the title nixos/azure: move image-specific configs from azure-common to azure-image, and code cleanups nixos/azure: move image-specific configs from azure-common to azure-image, fix console output Nov 27, 2024
@codgician codgician marked this pull request as ready for review November 27, 2024 17:14
@ofborg ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 1-10 labels Nov 28, 2024
Copy link
Contributor

@nh2 nh2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, but the release notes is a blocker.

Note I have not tested it, only read the code.

@github-actions github-actions bot added 8.has: documentation This PR adds or changes documentation 8.has: changelog labels Nov 29, 2024
@codgician
Copy link
Member Author

Generally LGTM, but the release notes is a blocker.

Note I have not tested it, only read the code.

Thanks @nh2 for the insights, I have made improvements according to comments and updated the release note.

@AkechiShiro
Copy link
Contributor

AkechiShiro commented Nov 29, 2024

I've been trying to test it but I'm running in an issue (server side from Microsoft, I've been asked by Microsoft to open a support ticket for their team to further investigate their server side errors) azcopy is failling with 500 when I try to copy the vhd to a managed disk.

If you have any recommended way of uploading the vhd @codgician feel free to let me know I'll try without azcopy

@codgician
Copy link
Member Author

I've been trying to test it but I'm running in an issue (server side from Microsoft, I've been asked by Microsoft to open a support ticket for their team to further investigate their server side errors) azcopy is failling with 500 when I try to copy the vhd to a managed disk.

If you have any recommended way of uploading the vhd @codgician feel free to let me know I'll try without azcopy

I also use azcopy for copying, the detailed command could be found here (you may need to change account name / container name / subscription id) accordingly.

Uploading can also be done on Azure Portal. In storage account page -> Storage browser -> Blob containers, select your container and click "Upload" for uploading blob, just make sure the type is selected as "Page Blob".

image

@codgician
Copy link
Member Author

Rebased with latest master branch to resolve merge conflicts.

@nh2
Copy link
Contributor

nh2 commented Nov 30, 2024

Added fileSystems.<name>.label in the latest iteration.

@codgician I cannot see that in your latest push yet, it still shows device = "/dev/disk/by-label/... for me in multiple places.

@codgician
Copy link
Member Author

codgician commented Dec 1, 2024

Added fileSystems.<name>.label in the latest iteration.

@codgician I cannot see that in your latest push yet, it still shows device = "/dev/disk/by-label/... for me in multiple places.

The device definition has to be preserved because it is directly referenced if boot.growPartition is enabled:

rootDevice="${config.fileSystems."/".device}"
. If removed it leads to build failure.

A line of inherit (cfg) label; is added beneath for users needing to access the label.

@codgician
Copy link
Member Author

codgician commented Dec 12, 2024

Rebased with latest master branch (where services.waagent option module is introduced and replacing azure-agent.nix). Test flake also updated.

@AkechiShiro
Copy link
Contributor

AkechiShiro commented Dec 15, 2024

I've been able to do a quick test (systemd-boot x86) but I'm running into a few issues, I've used an old VHD (built before your last comment) and a freshly built one, I've noted the following :

  • Channel nixos and stateVersion set to 24.11 but uname -a shows I'm under 25.05 (waagent seems to run fine).
  • systemd-boot menu at boot is indeed usable in the serial console
  • Result generated is no longer a disk.vhd for the test flake but more something like nixos-image-azure-25.05.20241212.ec68a18-x86_64-linux.vhd (building a configuration with stateVersion set to 24.11 or 25.05 doesn't change the output format, it still build a nixos-image-azure-25.05.date.<hash>-x86_64-linux.vhd.

sudo nixos-rebuild switch yields this error (note nix-channel --update does work and nix-shell -p do work) :

while calling the 'findFile' builtin
         at /nix/store/pc1d5jpff9zqnkks7nkkk3snms7h57m6-source/nixos/default.nix:1:60:
            1| { configuration ? import ./lib/from-env.nix "NIXOS_CONFIG" <nixos-config>
             |                                                            ^
            2| , system ? builtins.currentSystem

       error: file 'nixos-config' was not found in the Nix search path (add it using $NIX_PATH or -I)

NIX_PATH is set to nixpkgs=flake:nixpkgs:/nix/var/nix/profiles/per-user/root/channels, I tried setting it to nixpkgs=flake:nixpkgs:/nix/var/nix/profiles/per-user/root/channels:nixos-config=/etc/nixos/configuration.nix but this still did not work.

configuration.nix contains exactly what is in nixos/modules/virtualisation/azure-config-user.nix, I'm not sure how you were able to nixos-rebuild switch successfully.

@codgician did you export specific NIX_PATH variable or am I testing in the wrong way ? (I'm using a bit of a modified version for this script : https://github.com/rudesome/nixos-on-azure/blob/master/upload-image.sh)

I don't see where the original configuration.nix that I had in the test repository is on the machine when I boot it, but it does seem to affect the image.

If I switch the stateVersion to 25.05 (and then build a new ISO with that configuration) :

  • waagent crashes at the moment => making the VM unable to join the Internet (logging into the serial console is still possible),
  • nixos channel is still set to 24.11
  • I believe one more patch/substitution is necessary to point it towards the proper path to openssl, there might be a way to configure this in the new waagent service you have introduced.
INFO Daemon Daemon Clean protocol and wireserver endpoint
INFO Daemon Daemon Wire server endpoint:168.63.129.16
INFO Daemon Daemon Fabric preferred wire protocol version:2015-04-05
INFO Daemon Daemon Wire protocol version:2012-11-30
INFO Daemon Daemon Server preferred version:2015-04-05
ERROR Daemon Daemon Event: name=WALinuxAgent, op=UnhandledError, message=[Errno 2] No such file or directory: '/usr/bin/openssl'
Traceback (most recent call last):
  File "/nix/store/89p16phbqbxqlj54km8knsa2fkhhhjr1-waagent-2.12.0.2/lib/python3.12/site-packages/azurelinuxagent/daemon/main.py", line 83, in run
    self.daemon(child_args)
  File "/nix/store/89p16phbqbxqlj54km8knsa2fkhhhjr1-waagent-2.12.0.2/lib/python3.12/site-packages/azurelinuxagent/daemon/main.py", line 144, in daemon
    self.provision_handler.run()
  File "/nix/store/89p16phbqbxqlj54km8knsa2fkhhhjr1-waagent-2.12.0.2/lib/python3.12/site-packages/azurelinuxagent/pa/provision/cloudinit.py", line 53, in run
    self.protocol_util.get_protocol()  # Trigger protocol detection
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/89p16phbqbxqlj54km8knsa2fkhhhjr1-waagent-2.12.0.2/lib/python3.12/site-packages/azurelinuxagent/common/protocol/util.py", line 299, in get_protocol
    protocol = self._detect_protocol(save_to_history=save_to_history, init_goal_state=init_goal_state)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/89p16phbqbxqlj54km8knsa2fkhhhjr1-waagent-2.12.0.2/lib/python3.12/site-packages/azurelinuxagent/common/protocol/util.py", line 220, in _detect_protocol
    protocol.detect(init_goal_state=init_goal_state, save_to_history=save_to_history)
  File "/nix/store/89p16phbqbxqlj54km8knsa2fkhhhjr1-waagent-2.12.0.2/lib/python3.12/site-packages/azurelinuxagent/common/protocol/wire.py", line 84, in detect
    cryptutil.gen_transport_cert(trans_prv_file, trans_cert_file)
  File "/nix/store/89p16phbqbxqlj54km8knsa2fkhhhjr1-waagent-2.12.0.2/lib/python3.12/site-packages/azurelinuxagent/common/utils/cryptutil.py", line 47, in gen_transport_cert
    shellutil.run_command(cmd)
  File "/nix/store/89p16phbqbxqlj54km8knsa2fkhhhjr1-waagent-2.12.0.2/lib/python3.12/site-packages/azurelinuxagent/common/utils/shellutil.py", line 288, in run_command
    return __run_command(command_action=command_action, command=command, log_error=log_error, encode_output=encode_output)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/89p16phbqbxqlj54km8knsa2fkhhhjr1-waagent-2.12.0.2/lib/python3.12/site-packages/azurelinuxagent/common/utils/shellutil.py", line 190, in __run_command
    return_code, stdout, stderr = command_action()
                                  ^^^^^^^^^^^^^^^^
  File "/nix/store/89p16phbqbxqlj54km8knsa2fkhhhjr1-waagent-2.12.0.2/lib/python3.12/site-packages/azurelinuxagent/common/utils/shellutil.py", line 255, in command_action
    process = _popen(command, stdin=popen_stdin, stdout=stdout, stderr=stderr, shell=False)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/89p16phbqbxqlj54km8knsa2fkhhhjr1-waagent-2.12.0.2/lib/python3.12/site-packages/azurelinuxagent/common/utils/shellutil.py", line 398, in _popen
    process = subprocess.Popen(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/zv1kaq7f1q20x62kbjv6pfjygw5jmwl6-python3-3.12.7/lib/python3.12/subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/nix/store/zv1kaq7f1q20x62kbjv6pfjygw5jmwl6-python3-3.12.7/lib/python3.12/subprocess.py", line 1955, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: '/usr/bin/openssl'
, duration=0
WARNING Daemon Daemon Daemon ended with exception -- Sleep 15 seconds and restart daemon

I believe adding this line into the postPatch of waagent would fix the issue :

   substituteInPlace azurelinuxagent/common/conf.py \
      --replace-fail "/usr/bin/openssl" ${openssl}/bin/openssl

@codgician
Copy link
Member Author

codgician commented Dec 16, 2024

@AkechiShiro Thank you for taking the time for testing. ❤️

Channel nixos and stateVersion set to 24.11 but uname -a shows I'm under 25.05 (waagent seems to run fine).

The reason you are seeing 24.11 is because the default system channel is still set to 24.11 as of today in master branch, see: https://github.com/NixOS/nixpkgs/blob/a9a3e341c6bc326e5a3de9324c33ba283f7ce242/nixos/modules/config/nix-channel.nix#L73C1-L74C1

Result generated is no longer a disk.vhd for the test flake but more something like nixos-image-azure-25.05.20241212.ec68a18-x86_64-linux.vhd (building a configuration with stateVersion set to 24.11 or 25.05 doesn't change the output format, it still build a nixos-image-azure-25.05.date.-x86_64-linux.vhd.

This is introduced recently by PR #359339. Since then the image name would follow image.fileName configuration. I have updated my flake and specified image.fileName = "disk.vhd" to avoid confusion.

sudo nixos-rebuild switch yields this error (note nix-channel --update does work and nix-shell -p do work) :

The /etc/nixos/configuration.nix inside the image is copied from azure-config-user.nix and does not reflect the actual configuration of the generated image. Thus, in my testing I have been rebuilding from flake instead of directly running sudo nixos-rebuild switch. I am open to any suggestions on improving this experience.

did you export specific NIX_PATH variable or am I testing in the wrong way ? (I'm using a bit of a modified version for this script : https://github.com/rudesome/nixos-on-azure/blob/master/upload-image.sh)

I did not export specific NIX_PATH (at least my modifications does not introduce any new exports). I think this is irrelevant with the script you are using and you are testing the right way.

waagent crashes at the moment => making the VM unable to join the Internet (logging into the serial console is still possible),

I added a default OS.OpensslPath configuration in the latest iteration to fix this.

image

@AkechiShiro
Copy link
Contributor

Thanks a lot @codgician for the amazing work on bringing more Azure support for NixOS, I think it would be great to not rely on flakes until they're stable, I'm not sure if there is a way to expose some kind of flag to let me use configuration.nix in order to use channels + npins for instance for dependencies.

I'll attempt to do a retest the week-end that is coming (on x86, arm, and also run GRUB).

Would you have also any idea on the process in Azure/MS side to sign the Lanzaboote shim ? If we could get SecureBoot + TPM on NixOS image in Azure, as well it would be very nice.

@codgician
Copy link
Member Author

Thanks a lot @codgician for the amazing work on bringing more Azure support for NixOS, I think it would be great to not rely on flakes until they're stable, I'm not sure if there is a way to expose some kind of flag to let me use configuration.nix in order to use channels + npins for instance for dependencies.

I'll attempt to do a retest the week-end that is coming (on x86, arm, and also run GRUB).

Would you have also any idea on the process in Azure/MS side to sign the Lanzaboote shim ? If we could get SecureBoot + TPM on NixOS image in Azure, as well it would be very nice.

It is possible to add custom Secure Boot key for Azure image via ARM template, referring: https://learn.microsoft.com/en-us/azure/virtual-machines/trusted-launch-secure-boot-custom-uefi. The custom key binds to an Azure Compute Gallery Image, requiring users to do some manual setups.

@codgician
Copy link
Member Author

codgician commented Dec 29, 2024

Just some nitpicking :D Thank you for your awesome work! Some maybe out-of-scope suggestions:

I think networking.useNetworkd needs lib.mkDefault true, because not setting one throws a warning:

evaluation warning: The combination of `systemd.network.enable = true`, `networking.useDHCP = true` and `networking.useNetworkd = false` can cause both networkd and dhcpcd to manage the same interfaces. This can lead to loss of networking. It is recommended you choose only one of networkd (by also enabling `networking.useNetworkd`) or scripting (by disabling `systemd.network.enable`)

I also consider making virtualisation.azure.acceleratedNetworking true by default, it doesn't seem to break things: https://learn.microsoft.com/en-us/azure/virtual-network/accelerated-networking-overview?tabs=redhat#limitations-and-constraints

For networking.useNetworkd, the module did not force it to true because I found waagent also works well with NetworkManager, so I'd prefer to leave this option to end-user. I did a quick search and did not see any azure modules setting systemd.network.enable = true; by default, but please let me know if I've missed anything.

And for making virtualisation.azure.acceleratedNetworking true by default, this is a good point, but I think we can do that in future PRs (after this change reaches more audience for testing).

And thank you for your awesome suggestions on code improvement, I have applied them in the latest commit.

@wegank wegank removed the 12.approvals: 2 This PR was reviewed and approved by two reputable people label Dec 30, 2024
@usertam usertam added the 12.approvals: 1 This PR was reviewed and approved by one reputable person label Jan 3, 2025
@wegank wegank added 12.approvals: 3+ This PR was reviewed and approved by three or more reputable people and removed 12.approvals: 1 This PR was reviewed and approved by one reputable person labels Jan 3, 2025
@SamLukeYes
Copy link
Member

For networking.useNetworkd, the module did not force it to true because I found waagent also works well with NetworkManager, so I'd prefer to leave this option to end-user. I did a quick search and did not see any azure modules setting systemd.network.enable = true; by default, but please let me know if I've missed anything.

It is enabled by cloud-init module:

systemd.network.enable = lib.mkIf cfg.network.enable true;

@codgician
Copy link
Member Author

For networking.useNetworkd, the module did not force it to true because I found waagent also works well with NetworkManager, so I'd prefer to leave this option to end-user. I did a quick search and did not see any azure modules setting systemd.network.enable = true; by default, but please let me know if I've missed anything.

It is enabled by cloud-init module:

systemd.network.enable = lib.mkIf cfg.network.enable true;

Thank you @SamLukeYes, I have added networking.useNetworkd = true; and rebased in the latest iteration.

@wegank wegank removed the 12.approvals: 3+ This PR was reviewed and approved by three or more reputable people label Jan 12, 2025
@usertam
Copy link
Contributor

usertam commented Jan 13, 2025

This PR has been stalled for little over a month now, with 3 approvals and no breaking changes (this PR is mostly additions). Can we get this merged soon? @nh2 @wegank

Copy link
Member

@SamLukeYes SamLukeYes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works fine for me so far

@wegank wegank added the 12.approvals: 3+ This PR was reviewed and approved by three or more reputable people label Jan 14, 2025
Copy link
Contributor

@AkechiShiro AkechiShiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks a lot for the work @codgician

@codgician codgician mentioned this pull request Jan 20, 2025
13 tasks
nixos/modules/virtualisation/waagent.nix Outdated Show resolved Hide resolved
@codgician
Copy link
Member Author

@ofborg build waagent

@wegank wegank removed the 12.approvals: 3+ This PR was reviewed and approved by three or more reputable people label Jan 22, 2025
@wegank wegank added the 12.approvals: 3+ This PR was reviewed and approved by three or more reputable people label Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: changelog 8.has: documentation This PR adds or changes documentation 8.has: module (update) This PR changes an existing module in `nixos/` 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 1-10 12.approvals: 3+ This PR was reviewed and approved by three or more reputable people
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants