Introduction
Secure, reproducible operating system for single-board computers, with atomic A/B OTA updates, automatic rollback, and container-based application deployment.
AtomixOS turns compact SBC hardware into a repeatable appliance platform: immutable OS images are built from Nix, delivered through rollback-safe updates, and extended at runtime with operator-provisioned containers and optional Nixstasis-based remote management.
Why AtomixOS?
Remote embedded devices that receive over-the-air updates face a fundamental reliability problem: if an update fails
mid-write or the new image doesn’t boot correctly, the device is bricked. Traditional package-manager approaches (e.g.,
apt upgrade) have a measurable failure rate from power loss, partial writes, and dependency conflicts.
AtomixOS eliminates this class of failure through:
- Atomic A/B updates – installs to the inactive slot pair while the active slot stays online; no partial state
- Automatic rollback – U-Boot boot-count logic falls back to the previous working slot after 3 consecutive boot failures
- Hardware watchdog (currently disabled on Rock64) – integration and tests are in place; runtime enablement is pending final boot-stability validation on hardware
- Local health-check confirmation – commits new slots only after verifying that all services and containers are healthy for a sustained 60-second window
- Signed RAUC bundles – reproducible, CA-signed
.raucbartifacts built from the Nix flake - Read-only root filesystem – squashfs rootfs with OverlayFS (tmpfs upper layer) prevents runtime drift; every boot starts from a known-good state
Supported Hardware
| Board | SoC | Architecture | Storage |
|---|---|---|---|
| Rock64 | RK3328 | aarch64 | 16 GB eMMC |
Key Properties
- Reproducible – the entire system image is built from a single Nix flake with pinned inputs; same flake, same image
- Immutable – the squashfs root filesystem is read-only; writable state lives on a dedicated
/datapartition - Testable – a NixOS VM integration test suite covers the update lifecycle, provisioning paths, forensic log durability, network security, and rollback behavior without physical hardware
- EN18031 compliant – ships without default credentials; per-device credentials are provisioned at factory time; IP forwarding is disabled by default
Network Role
Each AtomixOS device acts as a gateway between an isolated LAN and the internet:
- WAN (eth0): DHCP client, deny-by-default inbound; application/VPN ports are provisioned explicitly
- LAN (eth1): Provisioned static IP, runs DHCP/DNS server (dnsmasq) and NTP server (chrony) for local devices
- No routing: IP forwarding is disabled; LAN devices have zero internet access
- Remote management: Nixstasis-hosted management and SSH key-only access; bootstrap stays LAN-local
Quick Start
# Build the flashable disk image set
mise run build
# Flash to eMMC (macOS)
mise run flash /dev/disk4
# Run all E2E tests
mise run e2e
# Run all E2E tests inside a Lima VM
mise run e2e --lima
See Building and Provisioning for detailed instructions.
Architecture
AtomixOS combines several architectural patterns to achieve reliable over-the-air updates on embedded hardware:
- A/B partition scheme with paired boot and rootfs slots
- Read-only squashfs rootfs with OverlayFS root (squashfs lower + tmpfs upper) for runtime state
- U-Boot boot-count rollback with watchdog integration (currently disabled on Rock64 during development)
- Network isolation with no IP forwarding between WAN and LAN interfaces
- EN18031-compliant authentication with no embedded credentials
This chapter covers each of these in detail. For the rationale behind specific design choices, see Design Decisions.
Partition Layout
The Rock64’s 16 GB eMMC uses a fixed A/B partition layout with raw U-Boot at the beginning and a persistent data
partition at the end. The flash image carries slot A only; initrd systemd-repart creates slot B and /data on first
boot.
General host and application logging stays tmpfs-first during runtime and is
forwarded through an rsyslog RAM queue before buffered appends land in
/data/logs.
Layout
Offset Size Content Filesystem Notes
0 16 MB U-Boot raw idbloader @ sector 64, u-boot.itb @ sector 16384
16 MB 128 MB boot-a vfat kernel Image, initrd, DTB, boot.scr
144 MB 1024 MB rootfs-a squashfs zstd compressed, 1 MB blocks; used as OverlayFS lower layer
1168 MB 128 MB boot-b vfat created on first boot by initrd systemd-repart
1296 MB 1024 MB rootfs-b -- created on first boot by initrd systemd-repart
2320 MB remaining data f2fs created on first boot by initrd systemd-repart
Slot Pairing
RAUC manages two slot pairs. Each pair contains a boot partition and a rootfs partition that are always written together atomically:
| Slot | Boot Partition | Rootfs Partition |
|---|---|---|
| A | boot-a (p1) | rootfs-a (p2) |
| B | boot-b (p3) | rootfs-b (p4) |
An update writes the new kernel/DTB to the inactive boot partition and the new squashfs to the inactive rootfs partition. The active slot pair is never modified during an update.
U-Boot Region
U-Boot occupies the first 16 MB of the eMMC as raw data (no partition). The RK3328 boot ROM loads the initial bootloader from fixed sector offsets:
| Component | Sector Offset | Byte Offset | Description |
|---|---|---|---|
idbloader.img | 64 | 32 KB | First-stage loader (TPL + SPL) |
u-boot.itb | 16384 | 8 MB | U-Boot proper (FIT image) |
U-Boot environment is stored in SPI flash exposed to Linux as /dev/mtd0 at offset 0x140000 with size 0x2000.
AtomixOS uses this single SPI environment for RAUC boot variables instead of raw eMMC environment writes.
Data Partition
The flashable image leaves the space after rootfs-a unallocated. On first boot,
initrd systemd-repart creates boot-b, rootfs-b, and /data there before the
live system is mounted. This avoids repartitioning from the switched-root system
while still preserving the inactive slot and /data across all updates and
rollbacks.
Contents created during provisioning:
/data/
.completed_first_boot First-boot sentinel
config/
admin-signers Admin SSH keys trusted for config re-apply signatures
ssh-authorized-keys/<user> Per-user SSH authorized keys
nixstasis/ Planned enrollment key and agent state
openvpn/client.conf OpenVPN recovery tunnel config (optional)
containers/ Reserved for future application workloads
logs/ Buffered general host and application logs appended from rsyslog
Logging Tiers
AtomixOS uses two runtime logging tiers with different durability goals:
Tier 1 journald runtime General host and container logs, tmpfs-first (`Storage=volatile`, runtime capped)
Tier 2 /data/logs Buffered rsyslog appends for bounded durable host and application diagnostics
Network Topology
Each AtomixOS device can use two Ethernet interfaces to keep LAN-side services isolated from WAN-side management and application ingress.
Interface Roles
flowchart LR
WAN["WAN<br/>internet"] --> ETH0["eth0<br/>DHCP client<br/>Deny-by-default inbound"]
LAN["LAN<br/>isolated devices"] --> ETH1["eth1<br/>Provisioned static IP<br/>DHCP/DNS: dnsmasq<br/>NTP: chrony"]
subgraph DEVICE["AtomixOS device"]
direction TB
ETH0
CORE["No IP forwarding<br/>FORWARD chain: DROP all"]
ETH1
APPS["Provisioned application containers<br/>No packet forwarding"]
end
ETH0 -. provisioned inbound ports .-> APPS
APPS -. local service access .-> ETH1
WAN Interface (eth0)
- Mapped to the onboard RK3328 GMAC via systemd
.linkfile (platform pathplatform-ff540000.ethernet) - DHCP v4 client via systemd-networkd
- Uses DHCP-provided DNS servers
- Firewall drops new inbound traffic by default
- Provisioned firewall state may open application or VPN ports from
/data/config/firewall-inbound.json
LAN Interface (eth1)
- USB Ethernet adapter (any supported chipset: r8152, ax88179, cdc_ether)
- Static IP: provisioned LAN gateway, falling back to
172.20.30.1/24 - Runs dnsmasq DHCP server from the provisioned range, with fallback
172.20.30.10–172.20.30.254 - Runs chrony NTP server for the provisioned LAN subnet, with fallback
172.20.30.0/24 - Runs gateway-local DNS only: dnsmasq serves local names on
53and does not forward upstream
Isolation Model
IP forwarding is explicitly disabled at the kernel level:
boot.kernel.sysctl = {
"net.ipv4.ip_forward" = 0;
"net.ipv6.conf.all.forwarding" = 0;
};
The nftables FORWARD chain has a drop policy with no exceptions. LAN devices get DHCP, DNS, NTP, SSH, and first-boot
bootstrap access on eth1, but no packet-level internet routing. WAN application or VPN exposure is created only from
provisioned firewall state.
NIC Naming
Deterministic interface naming uses systemd .link files rather than udev rules:
| Link File | Match | Name |
|---|---|---|
10-onboard-eth | Platform path platform-ff540000.ethernet | eth0 |
20-usb-eth | USB Ethernet drivers (r8152, ax88179, cdc_ether) | enabled as modules in Rock64 kernel config |
| WiFi | Unsupported until hardware selection | not part of current Rock64 image |
The onboard Ethernet is always eth0 regardless of USB device enumeration order. USB Ethernet adapters receive
kernel-assigned names (e.g., eth1, eth2).
Firewall Summary
| Interface | Direction | Allowed Ports |
|---|---|---|
| eth0 (WAN) | Inbound | provisioned firewall ports only |
| eth0 (WAN) | Inbound | TCP 22 (SSH) – only with flag file |
| eth1 (LAN) | Inbound | open by default; explicit lan scope switches to allowlisted ports only |
| tun0 (VPN) | Inbound | TCP 22 (SSH) |
| any | Forward | DROP (no exceptions) |
Provisioned WAN and optional restrictive LAN ports come from /data/config/firewall-inbound.json. SSH on WAN is
controlled by the presence of /data/config/ssh-wan-enabled. See the Firewall module for
implementation details.
Update & Rollback Flow
AtomixOS uses RAUC for A/B slot management combined with U-Boot boot-count logic and watchdog integration for automatic recovery from failed updates.
Normal Update Cycle
sequenceDiagram
participant Upgrade as os-upgrade.service
participant RAUC
participant Boot as U-Boot
participant Verify as os-verification.service
Upgrade->>RAUC: Poll update server with compact X-Device-ID and install new bundle
RAUC->>RAUC: Write boot + rootfs to the inactive slot pair
RAUC->>Boot: Set BOOT_ORDER=B A and BOOT_B_LEFT=3
Boot->>Boot: Reboot into updated slot and decrement BOOT_B_LEFT
Boot->>Verify: Start updated system
Verify->>Verify: Check network, services, and 60s stability
alt Checks pass
Verify->>RAUC: rauc status mark-good
RAUC->>Boot: Commit updated slot
else Checks fail across 3 boots
Verify-->>Boot: Exit non-zero
Boot->>Boot: Keep decrementing boot counter
Boot->>Boot: Fall back to previous good slot
end
Boot-Count Mechanism
U-Boot maintains three environment variables for slot selection:
| Variable | Purpose | Example |
|---|---|---|
BOOT_ORDER | Slot priority (first = preferred) | "A B" |
BOOT_A_LEFT | Remaining boot attempts for slot A | 3 |
BOOT_B_LEFT | Remaining boot attempts for slot B | 3 |
On each boot, U-Boot RAUC bootmeth selects the slot and decrements the boot attempt counter before loading
boot.scr. The script:
- Reads the bootmeth-provided boot partition and root partition variables
- Sets
rauc.slotandatomixos.lowerdevfor the selected slot - Loads kernel, initrd, and DTB from that slot’s boot partition
- Boots with
root=fstabso initrd mounts the selected squashfs by lower device
If a slot’s counter reaches 0, RAUC bootmeth skips it and tries the next slot in BOOT_ORDER. This ensures automatic
rollback after 3 consecutive boot failures.
Health Check Details
The os-verification.service performs these checks before committing a slot:
- Service checks: dnsmasq and chronyd must be active
- Network checks: eth0 must have a WAN IP; eth1 must have the expected LAN gateway IP
- Sustained check: all above conditions must hold for 60 seconds (checked every 5 seconds) to catch restart loops
Only after all checks pass does the service run rauc status mark-good, which resets the boot counter and commits the
slot.
First Boot Exception
On initial device provisioning, first-boot.service writes the sentinel file
(/data/.completed_first_boot) after successful provisioning import/validation and marks the slot good only when RAUC
is enabled. After this, all subsequent boots use the full health-check path.
Watchdog Integration (currently disabled on Rock64 during development)
The RK3328 hardware watchdog (dw_wdt) integration is implemented with these target settings:
- Runtime watchdog: 30 seconds – if systemd hangs, the device reboots
- Reboot watchdog: 10 minutes – if a reboot hangs, the watchdog forces a hard reset
These target settings are not enabled in the current release. When enabled later, both scenarios feed into the boot-count rollback path: repeated unsuccessful boots decrement the selected slot counter until U-Boot returns to the previous slot.
Update Polling
The os-upgrade.service runs on a systemd timer:
- First check: 5 minutes after boot
- Subsequent checks: every 1 hour (configurable)
- Random delay: up to 10 minutes (prevents thundering herd across fleet)
The service queries the update server with the compact lowercase 12-hex eth0 MAC as X-Device-ID and the current
version. If a newer bundle is available, it downloads to /data, installs via rauc install, and reboots. The hawkBit
path is reserved for future implementation and is not an operational update mode in the current image.
Authentication (EN18031)
AtomixOS ships with no embedded credentials. EN18031 compliance requires that each device has unique credentials provisioned at factory time – there are no default passwords or shared secrets.
Provisioning State
Persisted device-local state lives on /data:
| Item | Storage Path | Notes |
|---|---|---|
| Admin signer keys | /data/config/admin-signers | Admin SSH keys trusted for config re-apply |
| User SSH public keys | /data/config/ssh-authorized-keys/<user> | Per-user LAN/VPN SSH access |
| Nixstasis registration key | /data/config/nixstasis/registration-key | Planned persistent device enrollment credential |
| Nixstasis agent state | /data/config/nixstasis/ | Planned client state, tunnel config, and metadata |
Authentication Flows
SSH Access
- LAN (eth1): Key-only authentication via SSH public key
- VPN (tun0): Key-only authentication via SSH public key
- WAN (eth0): Disabled by default; enabled only when
/data/config/ssh-wan-enabledflag file exists
Physical Recovery
Rock64 keeps a separate physical break-glass path. If _RUT_OH_=1 is set in
U-Boot, the next boot starts a serial-only root autologin on ttyS2 and clears
that flag after use. This is a local recovery mechanism, not part of normal
network authentication.
Nixstasis Enrollment
The target remote-management model is Nixstasis-managed enrollment and access:
- The device identifies itself to Nixstasis using the
eth0MAC address. - Nixstasis checks that MAC against an approved inventory list.
- Approved devices receive a registration key and persist it on
/data. - Future device requests authenticate with that registration key.
- Nixstasis issues short-lived SSH credentials and establishes remote sessions over the reverse tunnel managed by the Nixstasis client.
The MAC address is an eligibility identifier, not a secret. The registration key is the first durable credential in the management flow.
Remote Management
Remote web access is intended to run from the Nixstasis environment rather than from services hosted directly on the device. The device remains responsible for SSH, LAN gateway services, update logic, and the Nixstasis client.
Device Identity
Each device is identified by the compact lowercase 12-hex MAC address of its onboard Ethernet (eth0). For example,
aa:bb:cc:dd:ee:ff becomes aabbccddeeff in the X-Device-ID header when polling for updates.
SSH Configuration
services.openssh = {
enable = true;
settings = {
PermitRootLogin = "no";
PasswordAuthentication = false;
};
};
User authorized keys are read from /data/config/ssh-authorized-keys/<user>, which is populated during provisioning.
Admin re-apply signer keys are stored separately in /data/config/admin-signers.
Nixstasis Enrollment
AtomixOS is moving toward an Nixstasis-managed enrollment and remote-access model.
Bootstrap Flow
- The device boots with no embedded remote-management credential.
- The Nixstasis client identifies the device using the
eth0MAC address. - Nixstasis checks that MAC against an approved inventory list.
- If approved, Nixstasis returns a registration key.
- The device persists that registration key on
/datafor future authenticated requests. - Nixstasis can then issue short-lived SSH credentials and establish remote sessions through the reverse tunnel managed by the device client.
Trust Model
- The MAC address is an identifier, not a secret.
- Inventory approval determines whether a device is eligible to enroll.
- The registration key is the first durable management credential.
- Short-lived SSH credentials are issued dynamically by Nixstasis and expire automatically.
Device Responsibilities
AtomixOS remains responsible for:
- local LAN gateway services (
dnsmasq,chrony, firewall) - SSH access for LAN/VPN recovery
- RAUC update and rollback flow
- persistent storage of enrollment state on
/data
Remote web management is intended to be hosted by Nixstasis rather than directly by the device.
Building
All build outputs target aarch64-linux. Builds require an aarch64-linux builder – either the nix-darwin
linux-builder (recommended on macOS), a Lima VM, or a native Linux system.
Prerequisites
- Nix with flakes enabled
- mise for task running (recommended)
- An
aarch64-linuxbuilder (nix-darwinlinux-builder, Lima VM, or native)
Building with mise
# Install tools and hooks
mise install
# Check the flake evaluates cleanly
mise run check
# Build individual artifacts
mise run build:squashfs # result-squashfs/
mise run build:rauc-bundle # result-rauc-bundle/
mise run build:boot-script # result-boot-script/
# Build everything and retain the latest image/bundle roots under .gcroots/
mise run build
mise run build refreshes the rooted build outputs under .gcroots/, keeps the
latest two distinct images and the latest two RAUC bundles, and can optionally
copy the newest .img to an explicit output path with -o <path>.
Building via Lima VM
All build tasks accept --lima to run inside a Lima VM. This is useful when the Lima VM has a warm Nix store cache or
when the nix-darwin linux-builder is not configured.
# Build the retained artifacts inside the default Lima VM
mise run build -- --lima
# Use a specific Lima VM
mise run build -- --lima --vm my-builder
# Build everything via Lima
mise run build -- --lima
The task ensures the Lima VM is started before building. The macOS home directory is mounted at the same path inside Lima, so the flake path works unchanged.
Build Artifacts
| Artifact | mise Task | Nix Output | Description |
|---|---|---|---|
| Squashfs rootfs | build:squashfs | packages.aarch64-linux.squashfs | Compressed root filesystem (~300 MB) |
| RAUC bundle | build:rauc-bundle | packages.aarch64-linux.rauc-bundle | Signed .raucb for OTA updates |
| Boot script | build:boot-script | packages.aarch64-linux.boot-script | Compiled U-Boot boot.scr |
| Disk image | build | packages.aarch64-linux.image | Latest .img rooted under .gcroots/images/image.1/ |
Building with Nix Directly
# Build the flashable image
nix build .#image -o result-image
# Build only the squashfs
nix build .#squashfs -o result-squashfs
Image Naming
The flashable image filename includes the pinned NixOS release series from flake.nix:
- Current:
atomixos-25.11.img(fromnixpkgs.url = "github:NixOS/nixpkgs/nixos-25.11") - Pattern:
atomixos-<series>.img
When you move to a new NixOS series (e.g., nixos-26.05), update flake.nix/flake.lock and rebuild. The image name
updates automatically.
Squashfs Size Constraint
The squashfs image must fit within the 1 GB rootfs partition slot. The build script enforces this with a size check – the build fails if the image exceeds the limit. The current NixOS closure compresses to approximately 300-400 MB.
To keep the closure small, the flake uses an overlay to strip unnecessary dependencies:
crunis built without CRIU support (removescriu+python3, saving ~102 MB)- Documentation, man pages, fonts, and XDG utilities are all disabled
security.sudois disabled (usesrun0instead)environment.defaultPackagesis emptied
Testing
The core mise run e2e task runs 9 NixOS VM integration tests that validate the RAUC update lifecycle, network
security, and rollback behavior. Additional provisioning and forensics checks are also available directly under the
flake checks.* outputs. Tests run on both Linux (TCG software emulation) and macOS (Apple Virtualization Framework).
Running Tests
Provisioning package
cd scripts/atomixos_provision
uv run --extra dev pytest
uv run --extra dev ruff check .
These tests cover the Litestar API, SSH-signature auth helpers, config parsing, bundle import, Quadlet rendering/sync, activation, job tracking, and service foundation modules.
All tests
mise run e2e
# Run all tests inside a Lima VM
mise run e2e --lima
mise run e2e --lima --vm my-builder
Individual tests
mise run e2e:rauc-slots # RAUC sees all 4 A/B slots after boot
mise run e2e:rauc-update # Bundle install writes to inactive slot pair, slot switches A->B
mise run e2e:rauc-rollback # Install to B, mark bad, verify rollback to A
mise run e2e:rauc-confirm # os-verification health checks pass, slot marked good (~3 min)
mise run e2e:rauc-power-loss # Crash VM mid-install, verify slot A intact after reboot
mise run e2e:rauc-watchdog # Freeze systemd to trigger watchdog, verify boot-count rollback
mise run e2e:firewall # 2-node test: WAN allows HTTPS/VPN, LAN allows SSH/DHCP/NTP
mise run e2e:network-isolation # 2-node test: LAN gets DHCP/NTP, cannot reach WAN
mise run e2e:ssh-wan-toggle # Flag file enables/disables SSH on WAN via nftables reload
# Run an individual test inside Lima
mise run e2e:rauc-slots --lima
Test Descriptions
| Test | Nodes | What it validates |
|---|---|---|
rauc-slots | 1 | RAUC detects all 4 A/B slots after first-boot repartitioning creates boot-b/rootfs-b |
rauc-update | 1 | Bundle install writes to inactive slot pair; slot switches from A to B |
rauc-rollback | 1 | Install to slot B, mark bad, verify automatic rollback to slot A |
rauc-confirm | 1 | Health checks pass within timeout, slot committed as good |
rauc-power-loss | 1 | Crash VM mid-install, verify slot A is intact after reboot |
rauc-watchdog | 1 | Freeze systemd to trigger watchdog reboot, verify boot-count rollback |
firewall | 2 | WAN node can reach HTTPS (443) and VPN (1194); LAN node can reach SSH, DHCP, NTP; all other ports blocked |
network-isolation | 2 | LAN node gets DHCP lease and NTP, cannot reach WAN addresses |
ssh-wan-toggle | 1 | SSH on WAN blocked by default; enabled when flag file created; disabled when removed |
Platform Performance
The mise task wrappers auto-detect the platform and select the correct flake output.
| Test | macOS (apple-virt) | Linux (TCG, Lima) | Speedup |
|---|---|---|---|
rauc-slots | 34s | 132s | 3.9x |
rauc-update | 25s | 137s | 5.5x |
rauc-rollback | 22s | 120s | 5.5x |
rauc-confirm | 95s | 171s | 1.8x |
rauc-power-loss | 46s | 184s | 4.0x |
rauc-watchdog | 57s | 315s | 5.5x |
firewall | 65s | 205s | 3.2x |
network-isolation | 68s | – | – |
ssh-wan-toggle | 35s | – | – |
| Total | ~7.5 min | ~21 min | ~3.7x |
The rauc-confirm test has the smallest speedup because most of its runtime is a fixed 60-second sustained health check
timer.
Interactive Debugging
Bundle Test VM
Use vm:bundle-test to boot an interactive AtomixOS VM for exercising real
config.toml bundles without physical hardware:
mise run vm:bundle-test
Each launch uses a fresh temporary VM disk image. The VM runner is still built through Nix and reused from the store when inputs have not changed, but runtime state from previous bundle tests is discarded when the VM exits.
The VM uses the QEMU hardware profile with eth0 as WAN and a second virtio NIC
as LAN. Host ports are forwarded for common operator workflows:
| Host URL/Port | Guest service |
|---|---|
ssh -p 10022 admin@127.0.0.1 | SSH |
http://127.0.0.1:8080 | Bootstrap/reapply UI |
http://127.0.0.1:8081 | Caddy HTTP |
https://127.0.0.1:8443 | Caddy HTTPS |
Build the runner without launching it:
mise run vm:bundle-test --build-only
Apply a bundle from the host:
tar --zstd -cvf config.tar.zst -C example/caddy-oidc .
curl -F config_file=@config.tar.zst http://127.0.0.1:8080/apply
For domain-based Caddy examples, map the example domain to localhost while testing from the host:
curl -k --resolve gateway.example.com:8443:127.0.0.1 \
https://gateway.example.com:8443/cockpit/
For browser testing, add a local hosts entry and include the forwarded HTTPS port in the URL:
127.0.0.1 gateway.example.com
Then browse to https://gateway.example.com:8443/.
The Cockpit container generates both the real device origin and the VM forwarded
HTTPS origin from GATEWAY_DOMAIN because Cockpit accepts space-separated
origins.
SSH is key-only and uses the provisioned /data/config/ssh-authorized-keys
state. On a fresh VM, use the serial console or apply a bundle containing an
admin key before expecting ssh -p 10022 admin@127.0.0.1 to succeed.
Launch an interactive QEMU VM with a Python REPL:
# Debug the default test (rauc-slots)
mise run e2e:debug
# Debug a specific test
mise run e2e:debug -t update
mise run e2e:debug -t confirm
mise run e2e:debug -t watchdog
# Keep VM state between runs
mise run e2e:debug -t slots --keep
Available test short names: slots, update, rollback, confirm, power-loss, watchdog, firewall, net-iso,
ssh-toggle.
Inside the REPL:
gateway.start() # boot the VM
gateway.wait_for_unit("multi-user.target")
gateway.succeed("rauc status") # run a command
gateway.shell_interact() # drop into a root shell
gateway.screenshot("name") # save a screenshot
# Ctrl+D to exit
Running Tests with Nix Directly
# Linux (TCG, no KVM required)
nix build .#checks.aarch64-linux.rauc-slots --no-link -L
# macOS (requires nix-darwin with linux-builder enabled)
nix build .#checks.aarch64-darwin.rauc-slots --no-link -L
# Local Darwin eval/builds that depend on nix/tests/rauc-qemu-config.nix should
# use a path flake ref so local files remain visible even if they are untracked.
nix build "path:$PWD#checks.aarch64-darwin.rauc-slots" --no-link -L
When iterating on a single Darwin check locally, evaluate and build the exact
derivation with the same path: flake ref:
drv=$(nix eval --raw "path:$PWD#checks.aarch64-darwin.rauc-slots.drvPath")
nix-store -r "$drv"
Test Architecture
Tests use the NixOS test framework (nixos-lib.runTest). Each test:
- Defines one or two virtual machines with the full AtomixOS service stack (using
hardware-qemu.nixinstead ofhardware-rock64.nix) - Boots the VM(s) and runs a Python test script that interacts via QEMU’s monitor interface
- Asserts on command output, service states, and network behavior
The QEMU target uses a custom RAUC backend that simulates U-Boot’s slot selection using files instead of environment
variables, allowing the full A/B update lifecycle to be tested without real hardware. The shared slot mapping for the
RAUC tests lives in nix/tests/rauc-qemu-config.nix.
Provisioning
Deploy AtomixOS to a Rock64 device by building a flashable disk image and writing it
to eMMC with dd (or mise run flash).
After Provisioning
On first boot:
- U-Boot loads
boot.scrfrom boot-a, echoes build ID, boots the kernel with initrd - The initrd mounts the selected squashfs slot at
/run/rootfs-base, thensysroot.mountassembles/as OverlayFS with a tmpfs-backed upper/work directory under/run/overlay-root - Initrd
systemd-repartcreates the/datapartition (f2fs) on first boot using the remaining eMMC space - Initrd persists a fresh-flash marker so switched-root provisioning can distinguish a new flash from a later
reprovisioned
/datawipe first-boot.servicelooks for/boot/config.tomlonly on a fresh flash, then USBconfig.toml, then starts the bootstrap web console on WAN and LAN port8080; after provisioning it narrows to the LAN gateway endpoint and waits indefinitely for operator input when no seed is present- The imported config is validated, persisted under
/data/config/, rendered into canonical Quadlet files, and synced into the active rootful and rootless Quadlet paths first-boot.serviceapplies Quadlets, LAN settings, and provisioned firewall rules, then marks the RAUC slot as good only if those runtime apply steps succeed- Network interfaces come up (eth0 via DHCP, eth1 static);
systemd-networkd-wait-onlineuses 30s timeout withanyInterface=true - Services start: dnsmasq, chrony, sshd, and the RAUC update timer when RAUC is enabled
The device is then ready to receive OTA updates and serve LAN clients.
For the canonical persisted state and runtime schemas, see Firmware Data Flow and Runtime Boundaries.
Reprovisioning
Wiping /data returns the device to the unprovisioned state without changing the A/B slot layout.
On the next boot:
- Initrd sees that
boot-balready exists, so it does not mark the boot as a fresh flash /boot/config.tomlis not replayedfirst-boot.servicesearches USBconfig.tomlsources first- If no USB seed is found, the bootstrap web console starts on WAN and LAN port
8080
Imported operator state remains bounded to /data/config/, including the imported config.toml, rendered Quadlet
files, admin SSH authorized keys, and other provisioning-derived runtime inputs.
Provisioning Service API
The bootstrap console is backed by a long-lived Litestar service. API routes are grouped by domain but still wired explicitly by the app factory:
| Route | Behavior |
|---|---|
GET /api/health | Returns service liveness. |
GET /api/nonce | Issues a single-use nonce for SSH-signature authentication. |
POST /api/validate | Validates a config.toml or config bundle without applying it. |
POST /api/config | Accepts a config source and returns 202 Accepted with a job URL. |
GET /api/jobs/{id} | Returns current provisioning job status, events, result, and rollback state. |
Mutating apply jobs are single-flight. Clients poll the returned job URL for progress and final status.
USB Recovery Mode
If the reset button is held from power-on for 5 seconds, U-Boot enters USB mass storage mode instead of booting Linux. The Rock64 OTG USB port then exposes the full eMMC as a removable disk, allowing the host to write a fresh image directly.
Flashable Disk Image
Build a complete .img file that can be written to eMMC (or SD card) using dd or any raw disk writer.
Build the Image
# Build with mise (stores the latest image under .gcroots/images/image.1)
mise run build
# Copy the latest image to a specific output path
mise run build -- -o atomixos-25.11.img
# Build via Lima VM
mise run build -- --lima
# Or with Nix directly (result stays in Nix store, symlinked to result-image/)
nix build .#image -o result-image
Flash to eMMC
macOS
Connect the eMMC module via a USB adapter. Identify the device (usually /dev/disk4):
diskutil list
Flash using the mise task:
# Auto-detect image, specify target disk
mise run flash /dev/disk4
# Specify image explicitly
mise run flash -i atomixos-25.11.img /dev/disk4
# Skip confirmation prompt
mise run flash -y /dev/disk4
The flash task automatically:
- Converts
/dev/diskNto/dev/rdiskN(raw device for faster writes) - Unmounts all partitions on the target disk
- Refuses to write to the macOS boot disk
- Runs
ddwithbs=4Mand progress reporting - Syncs and ejects when done
Linux
# With mise
mise run flash -y /dev/mmcblk0
# With dd directly
sudo dd if=atomixos-25.11.img of=/dev/mmcblk0 bs=4M status=progress
sudo sync
What’s in the Image
The flashable image contains:
| Region | Content |
|---|---|
| Raw (0-16 MB) | U-Boot (idbloader + u-boot.itb) |
| Partition 1 (boot-a) | Kernel Image, initrd, DTB, boot.scr (vfat) |
| Partition 2 (rootfs-a) | Squashfs root filesystem |
The image intentionally does not include slot B or /data. On first boot,
initrd systemd-repart creates boot-b (vfat), rootfs-b, and /data
(f2fs) using the remaining eMMC space before the real system mounts it.
First Boot Provisioning
The flashable image method does not embed credentials in the image. After
flashing, the device boots into the local provisioning flow and imports operator
configuration into /data/config/ from one of these sources:
/boot/config.tomlon a fresh flash- USB
config.tomlor supported config bundle - the bootstrap web console on WAN and LAN port
8080until initial provisioning completes
When a new config.toml is applied through one of those paths, the device
persists it under /data/config/, writes admin SSH authorized keys, renders the
declared Quadlet units, and continues first boot without requiring a second
reboot.
Reprovisioning is done by wiping /data and rebooting. Because initrd only
treats /boot/config.toml as a seed on a true fresh flash, reprovisioning uses
USB config.toml first and then falls back to the bootstrap UI instead of
replaying an old /boot/config.toml.
The image keeps root locked and does not ship a built-in operator account. On Rock64,
_RUT_OH_=1 enables a deterministic serial-only root recovery path on UART2
(ttyS2, 1.5 Mbaud) for the next boot.
LAN Range Configuration
The default LAN subnet is 172.20.30.0/24 with the gateway at 172.20.30.1. To change this, use the config:lan-range
mise task, which updates all configuration files in a single command.
Usage
mise run config:lan-range \
--gateway-cidr 10.50.0.1/24 \
--dhcp-start 10.50.0.10 \
--dhcp-end 10.50.0.254
What it Updates
The task modifies four files to keep the LAN configuration consistent:
| File | What Changes |
|---|---|
modules/networking.nix | eth1 static Address |
modules/lan-gateway.nix | dnsmasq dhcp-range, gateway DHCP option (3), NTP DHCP option (42), chrony allow subnet |
scripts/os-verification.sh | Expected eth1 IP in health checks |
After Changing
Rebuild:
mise run check
mise run build
Constraints
- Only
/24subnets are currently supported - DHCP start and end addresses must be within the specified subnet
- The gateway address (first part of
--gateway-cidr) is used as the static IP for eth1
Firmware Data Flow
AtomixOS keeps immutable firmware, provisioned runtime state, and update state in separate paths so A/B slot switches do not rewrite operator data.
Boot Flow
- U-Boot RAUC bootmeth selects the slot using
BOOT_ORDERandBOOT_x_LEFTfrom the SPI environment. boot.scrloads kernel, initrd, and DTB from the selected boot partition.boot.scrpassesroot=fstab,rauc.slot, andatomixos.lowerdevto Linux.- Initrd mounts the selected squashfs rootfs as
/run/rootfs-base. sysroot.mountassembles/as OverlayFS with squashfs lowerdir and tmpfs upper/work dirs.- Initrd
systemd-repartcreates missingboot-b,rootfs-b, and/datapartitions on a fresh flash.
Provisioning Flow
Provisioning imports exactly one operator configuration into /data/config/ from /boot/config.toml on fresh flash, a
USB seed, a supported seed bundle, or the LAN bootstrap console.
Persisted outputs are:
| Output | Path |
|---|---|
| Imported source config | /data/config/config.toml |
| Managed users | /data/config/users.json |
| User SSH keys | /data/config/ssh-authorized-keys/<user> |
| WAN inbound policy | /data/config/firewall-inbound.json |
| LAN runtime settings | /data/config/lan-settings.json |
| OS upgrade settings | /data/config/os-upgrade.json |
| Required health units | /data/config/health-required.json |
| Rendered Quadlets | /data/config/quadlet/*.container |
| Quadlet runtime metadata | /data/config/quadlet-runtime.json |
| Managed user tracking | /data/config/managed-users.json |
| Bundle payload files | /data/config/files/ |
first-boot.service fails before RAUC slot confirmation if Quadlet sync, LAN runtime apply, or provisioned firewall apply
fails.
Re-Apply Flow
Mutating bootstrap POST paths on an already-provisioned device require SSH signature authentication. The operator
requests a nonce via GET /api/nonce, then signs a request-bound message containing the nonce, target path, and
SHA-256 digest of the submitted config payload (ssh-keygen -Y sign -n atomixos-reapply). The request includes the
nonce and base64 signature in the X-AtomixOS-Nonce and X-AtomixOS-Signature headers. Nonces are single-use and
expire after 5 minutes (configurable via ATOMIXOS_NONCE_TTL).
Re-apply uses atomic candidate promotion:
- Validate and render candidate config in
/data/config-candidate/. - Rename active
/data/configto/data/config-rollback. - Rename candidate to
/data/config. - Run activation services synchronously (user apply, Quadlet sync, LAN apply, firewall).
- On success, clean up
/data/config-rollback. - On failure, restore
/data/config-rollbackto/data/configand re-activate.
POST /api/config is asynchronous for programmatic clients. It returns a typed response with job_id, state, and
job_url; the Location header points to the same job resource. The job records provisioning steps, service
deployment/status events, activation failures, final result, and rollback status.
First provisioning (no existing config.toml) remains unauthenticated and writes directly.
Managed Users Flow
atomixos-apply-users.service materializes managed users from /data/config/users.json on every boot and after
re-apply. It runs before sshd.service so accounts exist before SSH accepts connections. Admin users are added to the
wheel group. Users removed from the config are locked (expiredate=1, shell=/sbin/nologin). Protected image users
(root, appsvc) are never created or locked by this service.
Update Flow
os-upgrade.service reads /data/config/os-upgrade.json and skips polling cleanly when no provisioned update server is
configured. When polling is configured, it sends
the compact lowercase 12-hex eth0 MAC in X-Device-ID, compares available bundle metadata with the booted version,
downloads the bundle to /data, installs it with RAUC, and reboots into the newly selected slot.
os-verification.service commits a slot only after service, network, LAN, and required-unit checks remain healthy through
the sustained verification window.
Firewall and LAN Apply Flow
lan-gateway-apply.service consumes /data/config/lan-settings.json, writes the eth1 network drop-in, updates dnsmasq
and chrony runtime snippets, and restarts the affected services. provisioned-firewall-inbound.service consumes
/data/config/firewall-inbound.json and applies the requested WAN and LAN nftables rules for the configured scopes.
WAN remains deny-by-default unless explicitly opened. LAN is open by default, but an explicit lan scope replaces that
default-open rule with an allowlist of the configured ports merged with platform-required LAN ports.
Application Runtime Flow
Provisioned Quadlets are rendered under /data/config/quadlet/, mirrored into the active rootful or rootless systemd
Quadlet search path, and described by /data/config/quadlet-runtime.json. Rootless containers are constrained to pasta
networking with loopback publish rewrites; privileged rootful containers use host networking.
Runtime Boundaries
AtomixOS separates immutable platform code from operator-provisioned runtime behavior.
Immutable Platform
The image owns boot, kernel, initrd, RAUC, firewall defaults, SSH policy, local provisioning, LAN gateway services, OpenVPN recovery plumbing, and update confirmation logic. These live in the active squashfs slot and are replaced only by RAUC updates.
Persistent Operator State
/data/config/ owns runtime configuration imported during provisioning. RAUC slot writes do not modify /data.
Before initial provisioning, the bootstrap API is reachable on WAN and LAN and exposes POST /api/config for complete
config.toml files or supported config bundles. First-boot Boot UI submissions use a CSRF bootstrap token, not operator
authentication; first-boot programmatic /api/config submissions do not require that UI token. After provisioning, the
bootstrap API narrows to the LAN gateway endpoint. It uses the same validation, candidate promotion, activation, and
rollback path as the web console. Programmatic
clients receive 202 Accepted with job_id, initial state, job_url, and a Location: /api/jobs/{id} header, then
poll the job resource for final success, failure, rollback status, and service deployment events.
The API routes retain operation IDs and domain tags in code, and the production bootstrap service exposes live OpenAPI schema routes for online clients. Response bodies are typed in the provisioning package schemas while preserving the current JSON shapes.
The accepted config.toml schema is:
version = 1
[users.admin]
isAdmin = true
ssh_key = "ssh-ed25519 ..."
[network.firewall.inbound.wan]
tcp = [443]
udp = [1194]
[network.dnsmasq]
gateway_cidr = "172.20.30.1/24"
dhcp_start = "172.20.30.10"
dhcp_end = "172.20.30.254"
domain = "local"
gateway_aliases = ["atomixos"]
hostname_pattern = "atomixos-{mac}"
[network.ntp]
servers = ["time.cloudflare.com"]
[activation]
required = ["myapp"]
[containers.container.myapp]
privileged = false
[containers.container.myapp.Container]
Image = "ghcr.io/example/myapp:latest"
PublishPort = ["10080:8080"]
WAN ports stay deny-by-default unless listed. LAN stays open by default; if [network.firewall.inbound.lan] is
present with any ports, LAN switches to an explicit allowlist for only those ports. [network.dnsmasq] is optional;
omitted fields use the fallback LAN gateway contract. [network.ntp] is optional and defaults to Cloudflare NTP.
The machine-readable schema is committed at
schemas/config.schema.json and the import path validates against it before semantic checks.
Firewall JSON
/data/config/firewall-inbound.json is a JSON object with optional wan and lan objects. Each scope may contain
optional tcp and udp arrays of integer ports in 1..65535.
{
"wan": {
"tcp": [443],
"udp": [1194]
},
"lan": {
"tcp": [443]
}
}
Provisioned rules are added to WAN eth0 or LAN eth1 only when the matching scope is present. WAN remains
deny-by-default for new inbound traffic. LAN is open by default, but an explicit lan scope replaces that default-open
rule with the configured allowlist. Forwarding remains dropped.
LAN JSON
/data/config/lan-settings.json is generated from config.toml and includes the validated runtime fields consumed by
lan-gateway-apply.py.
{
"gateway_cidr": "172.20.30.1/24",
"gateway_ip": "172.20.30.1",
"subnet_cidr": "172.20.30.0/24",
"netmask": "255.255.255.0",
"dhcp_start": "172.20.30.10",
"dhcp_end": "172.20.30.254",
"domain": "local",
"hostname_pattern": "atomixos-{mac}",
"gateway_aliases": ["atomixos"]
}
The DHCP range must stay inside the /24 gateway subnet, must be ordered, and must not include the gateway IP.
Quadlet Safety Boundary
Provisioned containers are rendered into canonical Quadlet files under /data/config/quadlet/ before being synced into
Podman systemd search paths.
Rootful containers require privileged = true and are forced onto Network=host. Rootless containers use the appsvc
user, are forced onto Network=pasta, and non-loopback PublishPort binds are rewritten to 127.0.0.1.
Bundle imports may include files/; Quadlet values may reference ${CONFIG_DIR} and ${FILES_DIR} to bind files from
/data/config/ without embedding host-specific absolute paths in the seed.
Operational Unknowns
These items are intentionally outside the current firmware contract and must be resolved before changing the contract.
| Area | Current State | Resolution Needed |
|---|---|---|
| Active watchdog enforcement | Hardware driver is present; systemd manager watchdog settings are disabled | Complete Rock64 boot reliability validation, then enable RuntimeWatchdogSec=30s and RebootWatchdogSec=10min |
| USB WiFi | Kernel WiFi and Bluetooth stacks are disabled in the current image | Select supported hardware and firmware, then update kernel config, tests, and docs |
| hawkBit updates | useHawkbit disables polling and installs rauc-hawkbit-updater only | Define server configuration, credentials, systemd unit, and verification tests |
| Nixstasis client | Device-side state paths and management model are documented | Implement enrollment client, tunnel lifecycle, and credential rotation |
| Provisioned applications | AtomixOS renders and starts Quadlets from operator config | Define fleet policy for image provenance, registry auth, and rollout approval |
OIDC-Authenticated Device Management
This tutorial builds an OIDC-authenticated management stack on AtomixOS using three components:
- Caddy with AuthCrunch – reverse proxy with Microsoft Entra OIDC login and JWT-based authorization
- Cockpit-ws – browser-based device management console
- Admin-only route policy – allows administrators to reach Cockpit while leaving room for user-facing application routes
The result is a single sign-on flow: users authenticate once through Entra ID, and Caddy only exposes the Cockpit management console to admin users.
This tutorial is designed for local device management on a LAN. Caddy uses its internal certificate authority instead of Let’s Encrypt, so the device does not need a publicly routed domain or inbound internet access.
Contents
Prerequisites
The example bundle uses Microsoft Entra by default because Entra group claims map cleanly to admin/user roles. You can use any AuthCrunch-supported OIDC provider by changing the Caddyfile identity provider block, callback URI, and role mapping rules.
Microsoft Entra App Registration
-
In the Azure portal, open Microsoft Entra ID > App registrations
-
Select New registration
-
Set the redirect URI to:
https://<GATEWAY_DOMAIN>/auth/oauth2/azure/authorization-code-callback -
Note the Application (client) ID and Directory (tenant) ID
-
Under Certificates & secrets, create a new client secret and copy its value
-
Under Token configuration > Add groups claim, select Security groups
-
Create two Entra security groups:
AtomixOS-Admins– full device administrationAtomixOS-Users– read-only monitoring access
-
Assign users to the appropriate groups
Google OAuth Client
For Google, create an OAuth client in Google Cloud Console instead of an Entra app registration:
-
Open APIs & Services > Credentials
-
Create an OAuth client ID for a web application
-
Add this authorized redirect URI:
https://<GATEWAY_DOMAIN>/auth/oauth2/google/authorization-code-callback -
Note the client ID and client secret
-
Decide how to assign admin access. Common options are a Google Workspace group claim, a hosted-domain claim, or an explicit email allow-list in the AuthCrunch transform rules.
Then replace the Entra identity provider block in the Caddyfile with a Google
provider block. The exact AuthCrunch driver options may vary by AuthCrunch
version; the important values are the provider name (google), realm
(google), client ID, client secret, and callback URI path.
oauth identity provider google {
realm google
driver google
client_id {env.GOOGLE_CLIENT_ID}
client_secret {env.GOOGLE_CLIENT_SECRET}
scopes openid email profile
}
Also update enable identity provider azure to enable identity provider google and update transform rules from match realm azure to `match realm
Architecture
graph TD
lan((Local LAN browser)) -- "ports 80, 443" --> caddy
subgraph caddy["Caddy + AuthCrunch"]
ca1["/auth* → OIDC portal"]
ca2["/cockpit/* → admin-only reverse proxy"]
ca3["/app/* → user application routes"]
end
caddy -- "localhost:9090" --> cockpit
subgraph cockpit["Cockpit-ws"]
co1["--local-session"]
co2["cockpit-bridge"]
co3["host D-Bus and Podman sockets"]
end
Authentication Flow
- User navigates to
https://<GATEWAY_DOMAIN>/cockpit/ - Caddy checks for a valid JWT cookie; if absent, redirects to
/auth/ - AuthCrunch initiates Entra OIDC login
- After authentication, AuthCrunch maps Entra groups to roles:
AtomixOS-Adminsgroup receives theauthp/adminroleAtomixOS-Usersgroup receives theauthp/userrole
- AuthCrunch issues a JWT cookie with the mapped roles
- Caddy validates the JWT and allows
/cockpit/*only forauthp/admin - Cockpit runs behind Caddy with
--local-session; Cockpit performs no second login and relies on Caddy for authentication and authorization
Bundle Structure
example/caddy-oidc/
config.toml
files/
caddy/
Caddyfile
cockpit/
Containerfile
Substitute the placeholder values in config.toml, package this directory, and
provision the device. The Caddyfile and generated Cockpit configuration read
those values from container environment variables.
Placeholder Values
Replace these values before provisioning:
| Placeholder | Where | Description |
|---|---|---|
<SSH_PUBLIC_KEY> | config.toml | Your SSH public key for admin access |
<AZURE_TENANT_ID> | config.toml | Entra directory (tenant) ID |
<AZURE_CLIENT_ID> | config.toml | App registration client ID |
<AZURE_CLIENT_SECRET> | config.toml | App registration client secret |
<JWT_SHARED_KEY> | config.toml | Shared HMAC-SHA256 signing key |
<GATEWAY_DOMAIN> | config.toml | Local DNS name for the device |
<ENTRA_ADMIN_GROUP_NAME> | config.toml | Entra group name for admin role |
If you switch to Google or another provider, replace the Azure placeholders with that provider’s client ID/secret variables and update the Caddyfile environment entries accordingly.
Generate the JWT shared key with:
openssl rand -base64 32
Configuration Files
Local DNS and TLS
The browser must resolve <GATEWAY_DOMAIN> to the device’s LAN address. Use one
of these local options:
- Add a DNS record on your LAN router or development DNS server
- Add a hosts-file entry on the workstation you use to manage the device
- Use another local name resolution mechanism that maps the name to the device IP address
For example, if the gateway is reachable at 172.20.30.1:
172.20.30.1 gateway.example.com
Caddy serves HTTPS for this name with tls internal. That avoids public ACME
validation and works even when the domain is not reachable from the internet.
Browsers will not trust Caddy’s local CA by default; either trust the Caddy root
CA from the caddy-data volume on your management workstation or accept the
browser warning for local testing.
config.toml
The config defines two rootful containers, a network, a volume, and a build:
version = 1
[users.admin]
isAdmin = true
ssh_key = "<SSH_PUBLIC_KEY>"
[network.firewall.inbound.wan]
tcp = [80, 443]
[network.ntp]
servers = ["time.cloudflare.com"]
[activation]
required = ["caddy-gateway", "cockpit-ws"]
# -- Networks --------------------------------------------------------
[containers.network.management]
[containers.network.management.Network]
Subnet = "10.89.1.0/24"
# -- Volumes ---------------------------------------------------------
[containers.volume.caddy-data]
[containers.volume.caddy-data.Volume]
Driver = "local"
# -- Builds ----------------------------------------------------------
[containers.build.cockpit-ws]
[containers.build.cockpit-ws.Build]
File = "${FILES_DIR}/cockpit/Containerfile"
ImageTag = "localhost/cockpit-ws:latest"
Network = "host"
# -- Containers ------------------------------------------------------
[containers.container.caddy-gateway]
privileged = true
[containers.container.caddy-gateway.Unit]
Description = "Caddy gateway with AuthCrunch OIDC"
[containers.container.caddy-gateway.Container]
Image = "ghcr.io/authcrunch/authcrunch:latest"
Environment = [
"GATEWAY_DOMAIN=<GATEWAY_DOMAIN>",
"AZURE_TENANT_ID=<AZURE_TENANT_ID>",
"AZURE_CLIENT_ID=<AZURE_CLIENT_ID>",
"AZURE_CLIENT_SECRET=<AZURE_CLIENT_SECRET>",
"ENTRA_ADMIN_GROUP_NAME=<ENTRA_ADMIN_GROUP_NAME>",
"JWT_SHARED_KEY=<JWT_SHARED_KEY>",
]
Volume = [
"${FILES_DIR}/caddy/Caddyfile:/etc/caddy/Caddyfile:ro",
"${FILES_DIR}/caddy/ui:/etc/caddy/ui:ro",
"caddy-data:/data",
]
[containers.container.caddy-gateway.Install]
WantedBy = ["multi-user.target"]
[containers.container.cockpit-ws]
privileged = true
[containers.container.cockpit-ws.Unit]
Description = "Cockpit web console behind OIDC"
After = ["cockpit-ws-build.service"]
Requires = ["cockpit-ws-build.service"]
[containers.container.cockpit-ws.Container]
Image = "localhost/cockpit-ws:latest"
Pull = "never"
PodmanArgs = ["--pid=host", "--privileged"]
Environment = [
"GATEWAY_DOMAIN=<GATEWAY_DOMAIN>",
]
Volume = [
"/run/dbus/system_bus_socket:/run/dbus/system_bus_socket",
"/run/podman/podman.sock:/run/podman/podman.sock",
"/run/systemd:/run/systemd",
"/run/udev:/run/udev:ro",
"/:/host",
"/var/log/journal:/var/log/journal:ro",
"/etc/os-release:/etc/os-release:ro",
]
[containers.container.cockpit-ws.Install]
WantedBy = ["multi-user.target"]
Key points:
- Caddy is
privileged = truebecause it binds ports 80/443 - Cockpit-ws is
privileged = truebecause it runs a local management session with host D-Bus, systemd, journal, and Podman sockets mounted into the container - The
cockpit-wscontainer depends on its build service viaAfterandRequires Pull = "never"prevents Podman from trying to fetch the locally builtlocalhost/cockpit-ws:latesttag from a registry- The
cockpit-wsbuild usesNetwork = "host"to avoid Podman build-time netavark/nftables setup on constrained device images - The
${FILES_DIR}token is replaced at provision time with the path to the extracted bundle files GATEWAY_DOMAINis passed to both containers; Caddy uses it for the site address and Cockpit uses it to generate the real-device and VM-forwarded origins in/etc/cockpit/cockpit.conf- Caddy uses
tls internal, so HTTPS is local-only and does not require public DNS or Let’s Encrypt validation - The
managementnetwork is defined for future use when containers move off host networking
Caddyfile
{
http_port 80
https_port 443
admin off
order authenticate before respond
order authorize before basicauth
security {
oauth identity provider azure {
realm azure
driver azure
tenant_id {env.AZURE_TENANT_ID}
client_id {env.AZURE_CLIENT_ID}
client_secret {env.AZURE_CLIENT_SECRET}
scopes openid email profile
}
authentication portal myportal {
crypto default token lifetime 3600
crypto key sign-verify {env.JWT_SHARED_KEY}
enable identity provider azure
ui {
theme basic
template login /etc/caddy/ui/login.template
template portal /etc/caddy/ui/portal.template
template generic /etc/caddy/ui/generic.template
custom css path /etc/caddy/ui/atomixos-auth.css
custom js path /etc/caddy/ui/atomixos-auth.js
static_asset "assets/images/atomixos-logo.png" "image/png" /etc/caddy/ui/atomixos-logo.png
static_asset "assets/images/cockpit.svg" "image/svg+xml" /etc/caddy/ui/cockpit.svg
static_asset "assets/images/microsoft-entra.svg" "image/svg+xml" /etc/caddy/ui/microsoft-entra.svg
static_asset "assets/images/user.svg" "image/svg+xml" /etc/caddy/ui/user.svg
links {
"Admin Console" /cockpit/ icon "las la-server"
}
}
transform user {
match realm azure
action add role authp/user
}
transform user {
match realm azure
match roles {$ENTRA_ADMIN_GROUP_NAME}
action add role authp/admin
}
}
authorization policy user-policy {
set auth url /auth/
set access_token cookie name AUTHP_ACCESS_TOKEN
crypto key verify {env.JWT_SHARED_KEY}
allow roles authp/admin authp/user
validate bearer header
inject headers with claims
}
authorization policy admin-policy {
set auth url /auth/
set access_token cookie name AUTHP_ACCESS_TOKEN
crypto key verify {env.JWT_SHARED_KEY}
allow roles authp/admin
validate bearer header
inject headers with claims
}
}
}
{$GATEWAY_DOMAIN} {
tls internal
redir / /cockpit/ 302
redir /cockpit /cockpit/ 302
header /auth/assets/* {
Cache-Control "no-store, no-cache, must-revalidate"
Pragma "no-cache"
Expires "0"
defer
}
route /auth* {
authenticate with myportal
}
route /cockpit/* {
header Cache-Control "no-store"
header Pragma "no-cache"
authorize with admin-policy
reverse_proxy localhost:9090 {
header_up Authorization "Bearer {http.request.cookie.AUTHP_ACCESS_TOKEN}"
}
}
# Add user-facing applications here. They can use user-policy to allow
# both admin and user roles.
# route /app/* {
# authorize with user-policy
# reverse_proxy localhost:8080
# }
}
Key points:
- The
orderdirectives register the authenticate and authorize handlers - The identity provider block configures Entra OIDC via the
azuredriver - The portal issues JWTs signed with the shared key
- The portal explicitly lists Cockpit as an application link; AuthCrunch does not discover Caddy routes automatically
transform userblocks assign base roles (authp/user) and promote admin group members toauthp/adminadmin-policyrestricts/cockpit/*toauthp/adminuser-policyis provided for user-facing applications that should allow bothauthp/adminandauthp/usertls internaltells Caddy to issue a certificate from its local CA instead of using public ACME/Let’s Encrypt/redirects to/cockpit/, and/cockpitnormalizes to/cockpit/GATEWAY_DOMAINandENTRA_ADMIN_GROUP_NAMEcome from container environment variables set inconfig.toml
Containerfile
FROM quay.io/fedora/fedora:42
RUN dnf install -y --setopt=install_weak_deps=False \
cockpit-bridge \
cockpit-files \
cockpit-podman \
cockpit-system \
cockpit-ws \
openssh-clients \
podman \
&& dnf clean all
COPY rootfs/ /
RUN chmod 0755 \
/usr/local/bin/cockpit-auth-atomixos \
/usr/local/bin/cockpit-beiboot-bridge \
/usr/local/bin/start-cockpit \
&& mkdir -p /usr/share/cockpit/branding/default /usr/share/cockpit/branding/fedora \
&& ln -sf ../../static/atomixos.css /usr/share/cockpit/branding/default/branding.css \
&& ln -sf ../../static/atomixos.css /usr/share/cockpit/branding/fedora/branding.css \
&& ln -sf ../../static/atomixos-logo.png /usr/share/cockpit/branding/default/logo.png \
&& ln -sf ../../static/atomixos-logo.png /usr/share/cockpit/branding/fedora/logo.png \
&& python3 /usr/local/bin/patch-cockpit.py
CMD ["/usr/local/bin/start-cockpit"]
The custom image adds Cockpit’s bridge and management modules, then starts
cockpit-ws with --local-session. Cockpit itself does not authenticate users;
Caddy’s admin-only OIDC policy protects the route. The startup command writes
/etc/cockpit/cockpit.conf from GATEWAY_DOMAIN, so the example only requires
editing config.toml.
Building and Applying
Package the bundle as a tarball:
# Edit config.toml with your values
tar --zstd -cvf config.tar.zst -C <repo>/example/caddy-oidc .
Apply to the device using the bootstrap server or USB provisioning. See Provisioning for details.
Cockpit-Podman
The Cockpit Podman integration (cockpit-podman) lets operators manage
containers through the Cockpit UI. In this example it is installed into the
Cockpit container and uses the mounted host Podman socket at
/run/podman/podman.sock.
A future NixOS module could make Cockpit a native host service instead of a containerized admin application:
{ pkgs, ... }:
{
environment.systemPackages = [ pkgs.cockpit-podman ];
}
This is outside the scope of the tutorial config bundle and requires rebuilding the AtomixOS base image.
Security Considerations
This tutorial uses HS256 (symmetric) JWT signing for simplicity. For production deployments:
- Use public DNS and public certificates if exposing the device outside a trusted local network. The tutorial intentionally uses Caddy internal TLS for local management, not internet deployment.
- Use asymmetric keys (RS256/ES256) instead of a shared HMAC secret. AuthCrunch supports RSA and ECDSA key pairs.
- Rotate secrets regularly. The
JWT_SHARED_KEYand Azure client secret should be rotated on a schedule. - Use secret files instead of environment variables for sensitive values.
Podman supports
--secretmounts that avoid exposing secrets in Quadlet files on disk. - Pin image tags in production. The tutorial uses
:latestfor convenience; production should pin to specific versions. - Restrict Cockpit access. The
vieweruser should have minimal permissions. Consider using Cockpit’scockpit.conf[Ssh-Login]restrictions.
Hardware Testing
Source:
HARDWARE-TEST-PLAN.md
This chapter provides the physical verification plan for Rock64 hardware testing. These tests cannot be run in QEMU and require a physical Rock64 board with eMMC, serial console, and network connectivity.
Prerequisites
- Rock64 v2 board with 16 GB eMMC module
- USB-to-serial adapter connected to UART2 (1.5 Mbaud)
- USB Ethernet adapter (for eth1/LAN interface)
- Supported USB Ethernet adapter for eth1/LAN (
r8152,ax88179_178a, orcdc_ether) - Built disk image (
atomixos-25.11.img) - Built RAUC bundle (
rock64.raucb) - Network with DHCP and internet access (for WAN/eth0)
- A second device on the LAN subnet for client testing
Phase 1: Provisioning & First Boot
Test 1.1: Flash image and verify U-Boot output
# Flash the image
mise run flash /dev/disk4 # macOS
# or
sudo dd if=atomixos-25.11.img of=/dev/mmcblk0 bs=4M status=progress
# Connect serial console
screen /dev/tty.usbserial-DM02496T 1500000
Pass criteria:
- U-Boot banner appears on serial console
bootflow scanfindsboot.scron boot-a- Kernel loads and prints boot messages
- System reaches
multi-user.target - If
/boot/config.tomlor a USB seed is present,first-boot.servicecompletes provisioning - Without a seed, the bootstrap UI appears on WAN and LAN port
8080, and first boot waits indefinitely for operator input until a valid config is applied
Test 1.2: Verify first-boot service
systemctl status first-boot
[ -f /data/.completed_first_boot ] && cat /data/.completed_first_boot
[ -x "$(command -v rauc)" ] && rauc status
Pass criteria:
- With a seed config present,
first-boot.servicecompleted successfully - Without a seed config, the bootstrap UI is reachable and
first-boot.serviceremains waiting - After provisioning succeeds, the sentinel exists at
/data/.completed_first_boot - On RAUC-enabled images,
rauc statusshows the booted slot as “good” after provisioning succeeds
Phase 2: Kernel & Hardware Detection
Test 2.1: eMMC and core hardware
dmesg | grep -i mmc
dmesg | grep -i dwmac
dmesg | grep -i ehci
dmesg | grep -i watchdog
lsblk
Pass criteria:
- eMMC detected as
/dev/mmcblk1(ormmcblk0depending on boot media) - Ethernet MAC driver (DWMAC/STMMAC) loaded
- USB host controller (EHCI/OHCI/XHCI) initialized
- Watchdog device (
dw_wdt) registered
Test 2.2: USB Ethernet module
modprobe r8152 # or ax88179_178a/cdc_ether for your adapter
ip link show
Pass criteria:
- Supported USB Ethernet module loads without errors
- A second Ethernet interface appears in
ip link - USB WiFi and Bluetooth are not part of the current image contract
Phase 3: Network Configuration
Test 3.1: eth0 is onboard Ethernet
udevadm info /sys/class/net/eth0 | grep ID_PATH
ip addr show eth0
Pass criteria:
eth0matches the onboard GMAC (platform pathplatform-ff540000.ethernet)- eth0 has a DHCP-assigned IP address
Test 3.2: DHCP server on LAN
Connect a client device to eth1 (USB Ethernet adapter).
# On the gateway
systemctl status dnsmasq
journalctl -u dnsmasq | tail -20
# On the LAN client
dhclient eth0 # or equivalent
ip addr show
Pass criteria:
- Client receives an IP in
172.20.30.10-254range - Gateway is
172.20.30.1 - dnsmasq logs the DHCP transaction
Test 3.3: NTP server on LAN
# On the gateway
chronyc tracking
chronyc clients
# On the LAN client
ntpdate -q 172.20.30.1
Pass criteria:
- Chrony is synced to upstream NTP (or using local stratum 10 fallback)
- LAN client can query NTP from
172.20.30.1
Test 3.4: LAN isolation
# On the LAN client
ping -c 3 8.8.8.8 # should fail
curl https://example.com # should fail
ping -c 3 172.20.30.1 # should succeed
Pass criteria:
- LAN client cannot reach any internet address
- LAN client can reach the gateway
Phase 4: Firewall Verification
Test 4.1: WAN baseline port access
From an external machine (or the WAN side):
# These should fail until explicitly provisioned
curl -k https://<wan-ip>:443
nc -uz <wan-ip> 1194
# This should fail (connection refused/timeout)
ssh <wan-ip>
Pass criteria:
- HTTPS (443) is blocked until provisioned
- OpenVPN (1194) is blocked until provisioned
- SSH (22) is blocked
Test 4.2: SSH-on-WAN toggle
# Enable SSH on WAN
touch /data/config/ssh-wan-enabled
systemctl start ssh-wan-reload
# Test from WAN side
ssh admin@<wan-ip> # should now work
# Disable SSH on WAN
rm /data/config/ssh-wan-enabled
systemctl start ssh-wan-reload
# Test from WAN side
ssh admin@<wan-ip> # should fail again
Pass criteria:
- SSH is blocked by default
- Creating the flag file and reloading enables SSH
- Removing the flag file and reloading disables SSH
Phase 5: Services
Test 5.1: Update confirmation
systemctl restart os-verification
journalctl -u os-verification -f
Pass criteria:
- Local service and network checks pass
- 60-second sustained check completes
- Slot is marked as “good”
Phase 6: Authentication
Test 6.1: SSH key authentication
# From an external machine on the LAN
ssh -i ~/.ssh/id_ed25519 admin@172.20.30.1
# Password auth should remain disabled
auth_line="$({ ssh -vv -o PreferredAuthentications=none -o PubkeyAuthentication=no \
-o BatchMode=yes -o NumberOfPasswordPrompts=0 \
-o StrictHostKeyChecking=accept-new \
-o UserKnownHostsFile=/tmp/atomixos-rock64-known_hosts \
-o ConnectTimeout=10 admin@172.20.30.1 true; } \
2>&1 | grep 'Authentications that can continue:' | tail -n 1)"
[ -n "$auth_line" ] && ! printf '%s\n' "$auth_line" | grep -Fq 'password'
Pass criteria:
- Key-based authentication succeeds
- The auth-method probe exits successfully, confirming
passwordis excluded
Test 6.2: Serial root recovery
# On the device
fw_setenv _RUT_OH_ 1
reboot
# `_RUT_OH_` should remain a serial-only recovery path
# On UART2/ttyS2 at 1500000 baud, expect serial root autologin on the next boot.
# From an external machine on the LAN after the reboot
ssh -i ~/.ssh/id_ed25519 admin@172.20.30.1
auth_line="$({ ssh -vv -o PreferredAuthentications=none -o PubkeyAuthentication=no \
-o BatchMode=yes -o NumberOfPasswordPrompts=0 \
-o StrictHostKeyChecking=accept-new \
-o UserKnownHostsFile=/tmp/atomixos-rock64-known_hosts \
-o ConnectTimeout=10 admin@172.20.30.1 true; } \
2>&1 | grep 'Authentications that can continue:' | tail -n 1)"
[ -n "$auth_line" ] && ! printf '%s\n' "$auth_line" | grep -Fq 'password'
# On the device after boot completes
fw_printenv -n _RUT_OH_ # expect: empty / unset
Pass criteria:
_RUT_OH_enables one-shot serial root autologin on UART2 only- SSH behavior on the network is unchanged after the recovery boot
_RUT_OH_is cleared after use
Phase 7: RAUC Update Lifecycle
Test 7.1: RAUC status
rauc status
Pass criteria:
- Shows 4 slots (boot.0, rootfs.0, boot.1, rootfs.1)
- One pair is marked as booted and good
Test 7.2: Bundle install
# Copy bundle to device
scp rock64.raucb admin@172.20.30.1:/data/
# Install
rauc install /data/rock64.raucb
Pass criteria:
- Install completes without errors
rauc statusshows the inactive slot has been writtenBOOT_ORDERreflects the new slot priority
Test 7.3: Boot-count rollback
# After installing to slot B, intentionally corrupt it
dd if=/dev/zero of=/dev/mmcblk1p4 bs=1M count=1
# Reboot 3 times and observe the serial console
reboot
Pass criteria:
- Each boot attempt decrements
BOOT_B_LEFT - After 3 failures, U-Boot falls back to slot A
- Slot A boots successfully with the previous working image
Phase 8: Watchdog
Test 8.1: Hardware watchdog presence
dmesg | grep -i watchdog
ls /dev/watchdog*
Pass criteria:
dw_wdtdriver is loaded/dev/watchdogdevice exists
Test 8.2: Watchdog-triggered reboot
Deferred: active watchdog enforcement is disabled in the current release. Run this only after enabling the deferred
RuntimeWatchdogSec=30starget on a test device.
# Freeze PID 1 (systemd) to stop watchdog kicks
kill -STOP 1
# Wait 30+ seconds -- the hardware watchdog should force a reboot when enabled
Pass criteria:
- With the deferred target enabled, device reboots within ~30 seconds of the SIGSTOP
- Serial console shows watchdog reset
- U-Boot boot-count is decremented for the current slot
Task Checklist
| # | Test | Status |
|---|---|---|
| 1.1 | Flash + U-Boot output | |
| 1.2 | First-boot service | |
| 2.1 | eMMC + core hardware | |
| 2.2 | USB Ethernet module | |
| 3.1 | eth0 is onboard | |
| 3.2 | DHCP server on LAN | |
| 3.3 | NTP server on LAN | |
| 3.4 | LAN isolation | |
| 4.1 | WAN port access | |
| 4.2 | SSH-on-WAN toggle | |
| 5.1 | Update confirmation | |
| 6.1 | SSH key auth | |
| 6.2 | Serial root recovery | |
| 7.1 | RAUC status | |
| 7.2 | Bundle install | |
| 7.3 | Boot-count rollback | |
| 8.1 | Watchdog presence | |
| 8.2 | Watchdog reboot |
NTP Settings
AtomixOS uses chrony as both an upstream NTP client on WAN and an NTP server for LAN clients. LAN clients receive the
gateway address through DHCP option 42 and should query the gateway instead of reaching public NTP servers directly.
Default Upstream
The default upstream is Cloudflare public NTP:
server time.cloudflare.com iburst
Cloudflare is the default because its NTP usage documentation
explicitly describes using time.cloudflare.com, the service is global anycast, and Cloudflare does not leap-smear time.
That non-smearing behavior matches standard NTP and keeps AtomixOS compatible with typical site-local NTP servers and
the NTP Pool.
If WAN is unavailable or upstream sync fails, chrony still serves LAN clients from local stratum 10. This keeps isolated
LAN devices moving forward, but the time is only as accurate as the gateway clock until upstream sync returns.
Leap Smearing Warning
Some public NTP providers, including Google Public NTP, use leap smearing. A leap smear spreads a leap-second adjustment over a window of time instead of exposing the leap second at one instant. That can be useful for large application fleets, but during the smear window the smeared clock intentionally differs from standard UTC.
Do not mix leap-smearing and non-leap-smearing NTP sources in the same chrony configuration. Mixing them can make valid time sources disagree, especially around leap-second events, and chrony may treat that as jitter or source instability.
If an operator chooses Google Public NTP, follow Google’s
configuration guidance and use only Google time sources such as
time1.google.com through time4.google.com. Do not combine those sources with Cloudflare, NTP Pool, DHCP-provided NTP,
or other standard non-smearing servers.
Operator Overrides
For production networks with an enterprise or site-local time service, prefer the local NTP service when it is reliable and managed. Keep all configured upstreams in the same leap-second behavior family: either all standard non-smearing sources or all sources from the same smearing provider.
AtomixOS currently sets the upstream in modules/lan-gateway.nix. After changing the upstream, rebuild and redeploy the
image, then verify synchronization with:
chronyc sources -v
chronyc tracking
timedatectl
Nix Flake Configuration
Source:
docs/src/features/rock64-ab-image/design.md#nix-flake-config
Requirements
ADDED: Flake defines Rock64 NixOS configuration
The flake provides nixosConfigurations.rock64 targeting aarch64-linux (RK3328). The configuration includes all
service modules: systemd, openssh, chrony, dnsmasq, RAUC, nftables, watchdog, and the health-check/update services.
Scenario: Rock64 system evaluates cleanly
- Given the flake is checked with
nix flake check - Then
nixosConfigurations.rock64evaluates without errors - And the system target is
aarch64-linux
ADDED: Produces squashfs rootfs image
The flake builds a compressed squashfs root filesystem via packages.aarch64-linux.squashfs. The image must not exceed
the partition slot size (1 GB).
Scenario: Squashfs image fits within slot
- Given the squashfs is built with
nix build .#squashfs - Then the resulting image is less than or equal to 1 GB
- And it uses zstd compression with 1 MB block size
ADDED: Produces signed RAUC bundle
The flake builds a multi-slot RAUC bundle (.raucb) containing both boot
(kernel + initrd + DTB + boot.scr) and rootfs (squashfs) images,
signed with the project’s CA key.
Scenario: RAUC bundle is valid
- Given the bundle is built with
nix build .#rauc-bundle - Then the
.raucbfile passesrauc info --no-verify - And it contains entries for both
bootandrootfsslots - And it is signed with the development CA certificate
ADDED: Stripped kernel with modular USB support
The kernel is configured with built-in drivers for essential hardware (eMMC, Ethernet, USB host, watchdog, squashfs, f2fs) and loadable modules for selected USB Ethernet and USB-serial peripherals. USB WiFi is unsupported until specific hardware and firmware are selected.
Scenario: Kernel has required drivers
- Given the NixOS configuration is evaluated
- Then the kernel includes
MMC_DW_ROCKCHIP=y,STMMAC_ETH=y,DW_WATCHDOG=y,SQUASHFS=y - And selected USB Ethernet drivers (
USB_RTL8152,USB_NET_AX88179_178A,USB_NET_CDCETHER) are built as modules
Scenario: USB serial works for debugging
- Given a USB-serial adapter is plugged in
- When the
ftdi_sioorcp210xmodule is loaded - Then
/dev/ttyUSB0appears
ADDED: OpenVPN as system service
OpenVPN is included in the rootfs for recovery tunnel access. It does not auto-start; it requires a config file at
/data/config/openvpn/client.conf.
Scenario: OpenVPN service is conditional
- Given no OpenVPN config file exists on
/data - Then
openvpn-recovery.servicedoes not start - When a config file is placed at
/data/config/openvpn/client.conf - And the service is started manually
- Then a
tun0interface appears
ADDED: QEMU testing target
The flake provides nixosConfigurations.rock64-qemu targeting aarch64-virt with virtio block devices. It shares the
full service configuration from base.nix but uses a custom RAUC backend (file-based) instead of U-Boot.
Scenario: QEMU target boots and runs tests
- Given the QEMU configuration is built
- When a test VM is started
- Then all services from
base.nixare present - And RAUC uses the custom file-based backend
Partition Layout Specification
Source:
docs/src/features/rock64-ab-image/design.md#partition-layout
Requirements
ADDED: eMMC A/B layout
The 16 GB eMMC uses a fixed partition layout with raw U-Boot at the beginning, two pairs of A/B slots (boot + rootfs),
and a persistent data partition using all remaining space. The flash image contains slot A only; initrd
systemd-repart creates slot B and /data on first boot.
Scenario: Partition table matches specification
- Given a provisioned eMMC
- Then
sfdisk -dshows 2 GPT partitions in the flashable image (boot-a, rootfs-a) - And the raw region (0-16 MB) contains U-Boot
- And after the first successful boot, initrd
systemd-reparthas created three additional GPT partitions labeledboot-b,rootfs-b, anddata - And
/data(f2fs) uses the remaining space
ADDED: Per-slot boot partitions
Each slot pair has its own boot partition (vfat) containing the kernel, initrd, DTB, and boot script. This ensures boot and rootfs are always consistent for a given slot.
Scenario: Boot partition contents match slot
- Given slot A is active
- Then boot-a contains
Image,initrd,rk3328-rock64.dtb, andboot.scr - And boot-b is absent before first boot or contains the other slot’s kernel afterward
ADDED: Flashable disk image
The build task produces a flashable .img file containing U-Boot, boot slot A, rootfs slot A, and a
remaining-space region reserved for first-boot creation of boot slot B, rootfs slot B, and /data by initrd
systemd-repart.
ADDED: U-Boot at RK3328 offsets
U-Boot is written as raw data (no partition) at the offsets expected by the RK3328 boot ROM:
idbloader.imgat sector 64 (byte offset 32 KB)u-boot.itbat sector 16384 (byte offset 8 MB)
Both come from the custom Rock64 U-Boot package built by this flake.
Scenario: U-Boot loads from eMMC
- Given U-Boot is written at the correct offsets
- When the Rock64 powers on
- Then the serial console shows U-Boot initialization
- And
bootflow scanfindsboot.scron boot-a
ADDED: /data survives updates
The /data partition is never modified by RAUC slot writes or slot switches. Container data, configuration, and
credentials persist across all updates and rollbacks.
Scenario: Persist data survives update
- Given a file exists at
/data/config/test-file - When a RAUC update installs a new image and the device reboots
- Then
/data/config/test-filestill exists with the same content
ADDED: U-Boot env for slot selection
U-Boot environment variables (BOOT_ORDER, BOOT_A_LEFT, BOOT_B_LEFT) control which slot boots and how many attempts
remain. On Rock64 these are stored in SPI flash and are seeded safely from Linux on first boot when missing.
Scenario: Environment survives power loss
- Given
BOOT_ORDERis set to"B A" - When power is lost during env write
- Then U-Boot falls back to its compiled defaults or a previously valid SPI environment
RAUC Integration
Source:
docs/src/features/rock64-ab-image/design.md#rauc-integration
Requirements
ADDED: A/B multi-slot configuration
RAUC system.conf defines two slot pairs (A and B), each containing a boot partition and a rootfs partition. The boot
partition is the parent; the rootfs partition inherits its slot assignment.
Scenario: RAUC sees all slots
- Given the device has booted
- When
rauc statusis run - Then 4 slots are listed: boot.0 (A), rootfs.0 (A), boot.1 (B), rootfs.1 (B)
- And one pair is marked as “booted”
ADDED: U-Boot bootloader backend
RAUC uses the uboot backend to communicate slot priority and boot-count via U-Boot environment variables. On the QEMU
target, a custom backend simulates the same behavior using files.
Scenario: RAUC can switch slots
- Given slot A is active
- When
rauc installwrites a bundle to slot B - Then RAUC sets
BOOT_ORDER=B AandBOOT_B_LEFT=3 - And the next boot loads from slot B
ADDED: Bundle signature verification
RAUC verifies bundle signatures against the CA keyring (dev.ca.cert.pem). Unsigned or invalidly signed bundles are
rejected.
Scenario: Invalid bundle is rejected
- Given a
.raucbbundle signed with a different key - When
rauc installis attempted - Then the install fails with a signature verification error
- And no slot data is modified
ADDED: Writes to inactive slot only
RAUC only writes to the slot pair that is not currently booted. The active slot is never modified during an update.
Scenario: Active slot is protected
- Given slot A is booted
- When
rauc installruns - Then data is written to boot-b and rootfs-b only
- And boot-a and rootfs-a remain unchanged
ADDED: Bundle contains boot and rootfs
Each RAUC bundle contains two images: a vfat boot image (kernel + initrd + DTB + boot.scr) and the squashfs rootfs. Both are installed atomically to the target slot pair.
Scenario: Bundle structure
- Given a bundle is built with
nix build .#rauc-bundle - When
rauc infois run on the bundle - Then it shows an image for
boot(type: raw) and an image forrootfs(type: raw) - And the
compatiblefield isrock64
ADDED: Update polling service
The os-upgrade.service polls an update server on a timer, downloads new bundles, and installs them via RAUC. The
hawkBit path is reserved for future server-push updates.
Scenario: Polling finds new version
- Given the update server has a newer bundle
- When the timer fires
- Then the bundle is downloaded to
/data - And
rauc installis run - And the device reboots into the new slot
ADDED: Swappable with hawkBit
The os-upgrade module has a useHawkbit option that disables the polling service and installs the
rauc-hawkbit-updater package. AtomixOS does not configure an operational hawkBit service in the current image.
Scenario: hawkBit mode
- Given
os-upgrade.useHawkbit = true - Then the
os-upgradepolling timer is not created - And
rauc-hawkbit-updaterpackage is included in the system - And no configured
rauc-hawkbit-updatersystemd service is created by AtomixOS
ADDED: NixOS RAUC module
RAUC is enabled via the upstream NixOS services.rauc module and wired from atomixos.rauc.* options. The rauc
client is available in the system environment.
Scenario: RAUC is available
- Given the device has booted
- When
rauc --versionis run - Then a valid version string is returned
- And
rauc.serviceis active
Boot & Rollback
Source:
docs/src/features/rock64-ab-image/design.md#boot-rollback
Requirements
ADDED: U-Boot tracks boot attempts per slot
U-Boot maintains BOOT_A_LEFT and BOOT_B_LEFT counters (initial value: 3). RAUC bootmeth selects the slot and
decrements the counter before loading boot.scr.
Scenario: Counter decrements on boot
- Given
BOOT_A_LEFT=3and slot A is first inBOOT_ORDER - When the device boots
- Then
BOOT_A_LEFTis decremented to 2 beforeboot.scrloads the kernel - And the SPI flash environment is updated
Scenario: Counter reaches zero
- Given
BOOT_A_LEFT=0 - When U-Boot attempts to boot slot A
- Then slot A is skipped
- And U-Boot tries the next slot in
BOOT_ORDER
ADDED: Boot order reflects RAUC slot priority
When RAUC installs an update to slot B, it sets BOOT_ORDER=B A so the updated slot is tried first. When slot A is
installed, it sets BOOT_ORDER=A B.
Scenario: RAUC sets boot order
- Given slot A is active
- When a RAUC bundle is installed
- Then
BOOT_ORDERchanges to"B A" - And
BOOT_B_LEFTis set to 3
ADDED: Successful boot commits slot
After the health-check service passes, rauc status mark-good resets the boot counter for the current slot. This
prevents further rollback attempts.
Scenario: Health check commits slot
- Given the device booted into slot B with
BOOT_B_LEFT=2 - When
os-verification.servicepasses all checks - Then
rauc status mark-goodis called - And
BOOT_B_LEFTis reset to 3
ADDED: Rollback recovers previous image
After 3 consecutive failed boots (counter reaches 0), U-Boot skips the failing slot and boots the previous working slot. The failed slot’s data is preserved for diagnostics but is not booted.
Scenario: Automatic rollback after 3 failures
- Given slot B was just installed with
BOOT_ORDER=B A - And slot B fails to boot 3 times (kernel panic, hang, or health check failure)
- Then
BOOT_B_LEFTreaches 0 - And U-Boot boots slot A (the previous working image)
- And slot A still has its original content
ADDED: SPI flash U-Boot environment
The U-Boot environment is stored in SPI flash exposed to Linux as /dev/mtd0 at offset 0x140000 with size 0x2000.
AtomixOS does not store redundant U-Boot environment copies on eMMC.
Scenario: Userspace tools address SPI env
- Given the device has booted
- When
/etc/fw_env.configis inspected - Then it points to
/dev/mtd0 0x140000 0x2000 0x1000
Watchdog
Source:
docs/src/features/rock64-ab-image/design.md#watchdog
Requirements
Current status: implementation hooks are present, but the Rock64 runtime watchdog is intentionally disabled during development. The scenarios below define the current release behavior and the deferred target settings.
ADDED: Hardware watchdog target is deferred
The RK3328 hardware watchdog (dw_wdt) target is documented, but systemd manager watchdog settings are not enabled in
the current release.
Scenario: Watchdog triggers on hang
- Given the current Rock64 image boots
- Then AtomixOS leaves
RuntimeWatchdogSecunset - And the deferred target remains
RuntimeWatchdogSec=30s
ADDED: Reboot watchdog
A separate reboot watchdog (RebootWatchdogSec) remains deferred until Rock64 boot reliability validation approves active
watchdog enforcement.
Scenario: Reboot hang recovery
- Given the current Rock64 image boots
- Then AtomixOS leaves
RebootWatchdogSecunset - And the deferred target remains
RebootWatchdogSec=10min
ADDED: Configurable timeouts
The watchdog timeouts are set in modules/watchdog.nix:
systemd.settings.Manager = {
# RuntimeWatchdogSec = "30s";
# RebootWatchdogSec = "10min";
};
- Runtime: 30 seconds – aggressive enough to catch hangs quickly, long enough to avoid false triggers during normal operation
- Reboot: 10 minutes – generous because clean shutdown may need time to stop containers
ADDED: Watchdog interacts with boot-count rollback
A watchdog reboot is indistinguishable from any other abnormal reboot from U-Boot’s perspective. Each watchdog-triggered reboot:
- Decrements the boot counter for the current slot
- If the counter reaches 0, the slot is skipped
- The previous working slot boots instead
This means a systemd hang on a newly updated slot will trigger automatic rollback within 3 watchdog cycles (approximately 90 seconds total).
Scenario: Watchdog-triggered rollback
- Given slot B was just installed
- And slot B causes a systemd hang on every boot
- Then the watchdog reboots 3 times (30s each)
- And
BOOT_B_LEFTdecrements from 3 to 0 - And U-Boot falls back to slot A
Update Confirmation
Source:
docs/src/features/rock64-ab-image/design.md#update-confirmation
Requirements
ADDED: Local health-check service
The os-verification.service runs after multi-user.target on every boot (except the first). It validates that the
system is healthy before committing the RAUC slot. No external network dependency is required for the check itself.
Scenario: Health check runs on update boot
- Given
/data/.completed_first_bootexists (not first boot) - When the device reaches
multi-user.target - Then
os-verification.servicestarts - And it checks service health
ADDED: Sustained health check
After initial checks pass, the service monitors for 60 seconds (checking every 5 seconds) to catch restart loops, transient service failures, network regressions, and required provisioned-unit failures.
Scenario: Restart loop detected
- Given
dnsmasq.servicepasses the initial check - But it crashes and restarts during the 60-second sustain window
- Then the sustained health check fails
- And the slot is not committed
Scenario: Network or required unit regression detected
- Given eth0, eth1, and provisioned required units pass the initial check
- But one check fails during the 60-second sustain window
- Then the sustained health check fails
- And the slot is not committed
ADDED: Successful confirmation commits slot
When all checks pass (services and sustained), the service runs rauc status mark-good to commit the current
slot. This resets the boot counter and prevents further rollback.
Scenario: Slot committed on success
- Given all health checks pass for 60 seconds
- When
rauc status mark-goodis called - Then the booted slot is committed as “good”
- And
BOOT_x_LEFTis reset to the maximum value
ADDED: Failed confirmation leaves slot uncommitted
If any check fails, the service exits non-zero. The slot remains uncommitted, and the boot counter continues to decrement on each subsequent boot until rollback occurs.
Scenario: Gradual rollback on failure
- Given health checks fail on every boot of slot B
- Then each boot decrements
BOOT_B_LEFT - And after 3 boots, U-Boot rolls back to slot A
LAN Gateway
Source:
docs/src/features/rock64-ab-image/design.md#lan-gateway
Requirements
ADDED: Deterministic NIC naming
The onboard RK3328 GMAC is always named eth0 via a systemd .link file matching the platform path
(platform-ff540000.ethernet). USB Ethernet adapters receive kernel-assigned names. USB WiFi dongles are unsupported
until specific hardware and firmware are selected.
Scenario: Onboard Ethernet is eth0
- Given the Rock64 boots with the onboard Ethernet connected
- Then
ip linkshowseth0as the onboard GMAC - Regardless of USB device enumeration order
ADDED: eth0 as WAN (DHCP client)
The WAN interface acquires its address via DHCP v4. IPv6 RA is disabled. The DHCP-provided DNS servers are used.
Scenario: WAN gets DHCP address
- Given eth0 is connected to a network with a DHCP server
- When the device boots
- Then eth0 acquires an IPv4 address
- And DNS resolution works
ADDED: eth1 as LAN (static IP)
The LAN interface has a static IP from the provisioned LAN config. When no provisioned LAN config exists or it is
malformed, the fallback static IP is 172.20.30.1/24. It does not run a DHCP client.
Scenario: LAN has static IP
- Given the device has booted
- And
/data/config/lan-settings.jsoncontainsgateway_ip - Then
ip addr show eth1shows the configuredgateway_ipwith its provisioned prefix
Scenario: LAN uses fallback static IP
- Given the device has booted
- And no valid provisioned LAN config is available
- Then
ip addr show eth1shows172.20.30.1/24
ADDED: IP forwarding disabled
IP forwarding is disabled at the kernel level for both IPv4 and IPv6. The nftables FORWARD chain has a drop policy
with no exceptions. This creates a hard network boundary compliant with EN18031.
Scenario: No packet forwarding
- Given a LAN client sends a packet destined for the internet
- Then the packet is dropped at the gateway
- And it never reaches eth0
ADDED: DHCP server on LAN
dnsmasq runs on eth1 only. It assigns addresses from the provisioned DHCP range with a 24-hour lease and serves gateway-local DNS names without forwarding queries upstream.
Scenario: LAN client gets DHCP lease
- Given a client is connected to eth1
- When it sends a DHCP discover
- Then it receives an address in the provisioned DHCP range
- And the gateway is the provisioned
gateway_ip - And the DNS server is the provisioned
gateway_ip
Scenario: LAN DNS stays local-only
- Given a client on the LAN queries the gateway DNS server
- When the query is for a configured gateway-local name
- Then dnsmasq returns the local gateway address
- And dnsmasq does not forward unknown names to upstream resolvers
ADDED: NTP server on LAN
chrony acts as both an NTP client (syncing from Cloudflare public NTP via WAN) and an NTP server for LAN clients. The
provisioned LAN subnet is allowed to query. When no valid provisioned LAN config exists, the fallback subnet is
172.20.30.0/24.
Scenario: LAN client syncs time
- Given a client on the LAN queries NTP at the provisioned
gateway_ip - Then it receives a valid time response
- And chrony is synced to an upstream NTP pool
Scenario: Offline fallback
- Given the device has no WAN connectivity
- Then chrony uses
local stratum 10as a fallback - And LAN clients still receive time (lower accuracy)
ADDED: nftables firewall
The firewall uses nftables with per-interface rules:
| Interface | Allowed Inbound |
|---|---|
| eth0 (WAN) | established/related; provisioned inbound only |
| eth1 (LAN) | UDP 53, UDP 67-68, UDP 123, TCP 22, TCP 53, TCP 8080, established/related |
| tun0 (VPN) | TCP 22, established/related |
| FORWARD | DROP all |
WAN application and VPN ports are opened only from /data/config/firewall-inbound.json by
provisioned-firewall-inbound.service. SSH on WAN is controlled by a dynamic nftables rule toggled via
/data/config/ssh-wan-enabled. Provisioned lan ports are appended to the platform-required LAN ports instead of
replacing them.
Scenario: WAN ports are closed before provisioning
- Given no provisioned firewall state exists
- When a connection is attempted to eth0 on TCP 443 or UDP 1194
- Then the packet is dropped
Scenario: Provisioned WAN ports are allowed
- Given
/data/config/firewall-inbound.jsonincludes TCP 443 and UDP 1194 - And
provisioned-firewall-inbound.servicehas applied it - Then inbound traffic on eth0 TCP 443 and UDP 1194 is accepted
Scenario: WAN SSH blocked by default
- Given no flag file exists at
/data/config/ssh-wan-enabled - When an SSH connection is attempted from the WAN
- Then the connection is rejected
Scenario: WAN SSH enabled with flag
- Given
/data/config/ssh-wan-enabledis created - And
ssh-wan-reload.serviceis triggered - Then SSH connections from the WAN are accepted
ADDED: WAN SSH toggle is manual only
Enabling SSH on WAN requires creating a flag file on the device (via LAN SSH or physical console). There is no automated mechanism to enable it remotely – this is a deliberate security constraint.
ADDED: Device identity via MAC address
The device is identified by the MAC address of eth0 (the onboard Ethernet). This MAC is stable across reboots and
updates, and is used as the X-Device-ID header when polling for updates.
Scenario: MAC-based identity
- Given
eth0has MACaa:bb:cc:dd:ee:ff - When
os-upgrade.servicepolls for updates - Then the request includes
X-Device-ID: aabbccddeeff
Design Decisions
Source:
docs/src/features/rock64-ab-image/design.md
This chapter documents the key architectural decisions made during the design of AtomixOS, including the rationale, alternatives considered, and known trade-offs.
Context
AtomixOS is a greenfield project for secure, reproducible single-board computer appliances deployed remotely. The initial hardware target is Rock64 (RK3328, aarch64) with 16 GB eMMC storage.
Decision 1: RAUC over SWUpdate
Choice: RAUC for A/B slot management.
Rationale: RAUC has native U-Boot integration, well-documented slot configuration, and a straightforward NixOS module. SWUpdate offers more flexibility (scripted handlers, delta updates) but adds complexity that isn’t needed for the current use case.
Trade-off: RAUC’s update model is image-based (full slot writes), which means no delta updates. A full rootfs write (~300 MB) takes longer than a delta, but is simpler and more reliable.
Decision 2: Squashfs rootfs
Choice: Read-only squashfs root filesystem with OverlayFS (tmpfs upper layer).
Rationale: Squashfs eliminates runtime drift – every boot starts from a known-good state. It compresses well (zstd,
1 MB blocks), fitting the NixOS closure into the 1 GB slot with room to spare. A single OverlayFS (squashfs lower +
tmpfs upper) set up in the initrd provides a unified writable root, which is required for systemd’s mount namespace
sandboxing (PrivateTmp, ProtectHome, etc.) to work correctly. Writable state lives on /data (f2fs).
Trade-off: Any runtime state not explicitly persisted to /data is lost on reboot. This is intentional for an
appliance but requires careful placement of writable directories.
Decision 3: Per-slot boot partitions
Choice: Each A/B slot has its own boot partition (vfat) containing the kernel, initrd, DTB, and boot script.
Rationale: Pairing boot and rootfs in the same slot ensures they are always consistent. If kernel and rootfs were in different slot pairs, a failed update could leave mismatched versions.
Alternative considered: Single shared boot partition with both kernels. Rejected because it creates a single point of failure and complicates the U-Boot boot script.
Decision 4: eMMC partition layout
Choice: Fixed layout: 16 MB raw U-Boot, 128 MB boot A/B, 1 GB rootfs A/B, remaining space for /data.
Rationale: 128 MB per boot slot provides ample space for the kernel (~25 MB compressed), initrd, DTB, and boot
script. 1 GB per rootfs slot gives 2-3x headroom over the current squashfs size (~300-400 MB). The /data partition
(~13.3 GB) holds containers, logs, and configuration.
Risk: If the NixOS closure grows beyond 1 GB, the rootfs slot size must be increased, which reduces /data space
and requires re-provisioning all devices.
Decision 5: U-Boot from nixpkgs
Choice: Use pkgs.ubootRock64 from nixpkgs rather than a custom U-Boot build.
Rationale: The nixpkgs U-Boot package is tested, reproducible, and tracks upstream releases. Custom patches are applied via the kernel config (not U-Boot patches), keeping the build simple.
Trade-off: Limited to the U-Boot version and configuration in nixpkgs. The current version (2025.10) lacks
setexpr, requiring a manual if/elif chain for boot counter decrement.
Decision 6: Watchdog strategy
Choice: defer active systemd hardware watchdog enforcement while keeping 30s runtime / 10min reboot timeouts as the target settings.
Rationale: Rock64 boot reliability validation is not complete. The target values remain documented, but the current
release leaves systemd.settings.Manager = { } to avoid watchdog-triggered reset loops during development.
Integration: Once enabled, watchdog reboots feed directly into the boot-count rollback path.
Decision 7: Local health-check (no phone-home)
Choice: os-verification.service runs local checks only. No external server is contacted for update confirmation.
Rationale: The device must be self-sufficient. If the WAN is down after an update, the device should still be able to commit the slot (or roll back) based on local service health. Phoning home would create a dependency on network availability during the critical confirmation window.
Decision 8: Optional Nixstasis-based remote management
Choice: Move remote web management out of the device image and support Nixstasis as an optional control plane.
Rationale: The Nixstasis client already establishes reverse tunnels and receives short-lived SSH credentials from the server. Hosting remote web management and the auth layer in Nixstasis removes first-boot registry pulls, reduces device complexity, and keeps the device focused on local gateway and update responsibilities.
Trade-off: Remote management now depends on successful Nixstasis enrollment and tunnel establishment. Local recovery falls back to SSH rather than an on-device HTTPS UI.
Decision 9: OpenVPN in rootfs
Choice: Include OpenVPN in the root filesystem (not as a container).
Rationale: OpenVPN provides a recovery tunnel for remote management. If it ran as a container and the container runtime failed, there would be no remote access. Including it in the rootfs ensures it survives container-layer failures.
Decision 10: Network isolation (no IP forwarding)
Choice: Disable IP forwarding at the kernel level. The nftables FORWARD chain drops all packets.
Rationale: EN18031 requires a hard network boundary. LAN devices must not be able to reach the internet. WAN application or VPN ports are opened only by provisioned firewall state. Packet forwarding between WAN and LAN stays disabled.
Decision 11: NIC naming via .link files
Choice: Use systemd .link files for deterministic NIC naming rather than udev rules.
Rationale: .link files are the native systemd-networkd mechanism and are processed earlier in boot than udev
rules. They match on stable platform paths (e.g., platform-ff540000.ethernet for the onboard GMAC), ensuring eth0 is
always the onboard Ethernet regardless of USB enumeration order.
Decision 12: nftables firewall
Choice: nftables with per-interface rules, replacing iptables.
Rationale: nftables is the modern Linux firewall framework with better performance and a cleaner rule syntax. The
NixOS networking.nftables module provides native integration.
Decision 13: hawkBit-ready architecture
Choice: Design the update system to be swappable between polling and hawkBit push models.
Rationale: The initial deployment uses simple HTTP polling (os-upgrade.service). As the fleet scales, migration to
hawkBit can provide centralized update management, rollout policies, and device inventory. The os-upgrade.useHawkbit
option currently reserves this path and installs the package, but AtomixOS does not configure an operational hawkBit
service yet.
Decision 14: QEMU testing target
Choice: Provide a rock64-qemu NixOS configuration that shares the full service stack with the hardware target but
uses virtio devices and a file-based RAUC backend.
Rationale: Hardware testing is slow and requires physical devices. QEMU testing validates all software behavior (RAUC lifecycle, firewall rules, health checks) in CI-friendly VMs. The custom RAUC backend simulates U-Boot’s slot selection using files.
Decision 15: EN18031 authentication
Choice: no default passwords, locked local root password, no built-in operator account, SSH key-only access, serial break-glass recovery, and optional Nixstasis-based remote management.
Rationale: The base image does not host the web management/authentication stack. SSH key-only access and locked passwords prevent brute-force attacks on the device, while Nixstasis handles remote management credentials outside the device image.
Decision 16: Squashfs closure optimization
Choice: Aggressive closure size reduction through overlays, disabled features, and stripped dependencies.
Techniques applied:
crunwithout CRIU (removes python3, saving ~102 MB)- Disabled: documentation, man pages, fonts, XDG, sudo, bash completion
- Emptied
defaultPackagesandfsPackages - Disabled: bcache, kexec, LVM
Result: Approximately 27% reduction in closure size compared to a default NixOS system with the same services.
Decision 17: Two-tier runtime logging model
Choice: Use tmpfs-first journald during runtime for host and container log ingress, then drain it through an
rsyslog RAM queue that appends buffered logs to /data/logs.
Rationale: Making the full journal always persistent would increase steady-state eMMC wear. The selected design keeps
runtime logging memory-first, caps journal memory use, routes Podman logs through the same path, and still retains
broader diagnostics durably on /data/logs in larger sequential batches instead of many small writes.
Trade-off: This is a bounded-loss durability model rather than an always-durable one. Sudden power loss can still drop the newest in-memory journal or rsyslog queue entries, but routine runtime writes remain much friendlier to eMMC than fully persistent journal storage.
Risks and Trade-offs
| Risk | Mitigation |
|---|---|
| eMMC wear from frequent writes | /data uses f2fs (wear-leveling aware); squashfs slots are written only during updates |
| U-Boot env corruption | Single-copy environment storage; corruption is handled through normal recovery and reprovisioning flows |
| 1 GB rootfs slot too small | Current closure is ~300-400 MB; aggressive optimization keeps headroom |
| Missing or empty health-required list | first-boot.service commits only when RAUC is enabled; os-verification uses gateway health checks alone unless /data/config/health-required.json names additional required units |
| Provisioned application failure | OpenVPN in rootfs and SSH key-only access provide alternate recovery paths |
| No delta updates | Full-image updates are ~300 MB; acceptable on broadband WAN connections |
| No automatic WAN SSH | Deliberate security constraint; manual flag file required |
Planned Features
Project Overview
AtomixOS is a secure, reproducible operating system for single-board computers, built on
NixOS with atomic A/B OTA updates, automatic rollback, and a container-based application
deployment model. The system uses a read-only squashfs rootfs and operator-provisioned
Quadlet containers on a persistent /data partition.
Goals
- Ship a complete, reproducible embedded gateway firmware with zero default credentials
- Provide atomic, rollback-safe over-the-air updates for thousands of remote devices
- Allow operators to provision application containers, networks, and volumes via a
single
config.tomlwithout touching the base image - Support EN18031 compliance for network isolation, authentication, and audit
- Support optional Nixstasis-based remote management through enrollment and tunnels
- Deliver a working reference stack (Caddy + AuthCrunch + Cockpit-ws) demonstrating OIDC-authenticated device management through config.toml
Non-Goals
- Desktop or server NixOS distribution
- Multi-architecture support beyond aarch64 (Rock64 RK3328)
- Container orchestration (Kubernetes, Swarm) – Quadlet is the runtime
- Delta OTA updates (full image writes are the current model)
- On-device web management UI in the base image (remote management can be provided through optional Nixstasis integration)
- General-purpose firewall/router functionality (no IP forwarding, ever)
Global Constraints
- 16 GB eMMC with fixed A/B partition layout; rootfs slot is 1 GB max
- Squashfs root is read-only; all mutable state lives on
/data(f2fs) - EN18031: no default credentials, no IP forwarding, key-only SSH
- Provisioned containers must go through the Quadlet safety boundary (rootful=host network, rootless=pasta with loopback publish rewrites)
config.tomlis the single operator input; schema changes must not break existing configs- RAUC bundles are signed; only CA-signed updates are accepted
- Hardware watchdog enforcement is deferred until boot-reliability validation completes
Cross-Cutting Decisions
POST /api/configis the programmatic provisioning endpoint; same validation as the web console- Fresh first-boot
POST /api/configis intentionally tokenless for programmatic provisioning; the bootstrap token is a Boot UI CSRF control for/apply, not operator authentication - Provisioned re-apply requires SSH signature authentication;
/api/validatealso requires SSH authentication - Bootstrap exposure is WAN/LAN before initial provisioning and LAN-only after
successful provisioning; runtime socket rebinding must use
/run/systemd/systemdrop-ins because the rootfs is read-only quadlet-runtime.jsontracks all rendered units (containers, networks, volumes) with mode (rootful/rootless) for sync-quadlet- Network and volume Quadlet units are always rootful
${CONFIG_DIR}and${FILES_DIR}tokens in Quadlet values are substituted at render time to/data/configand/data/config/filesrespectively- Bundle imports support
files/directory for operator payload files - Re-apply uses authentication, not a reset token
- Full
/datawipe is separate from config re-apply - WAN TCP
8080is reserved for bootstrap exposure and cannot be configured as a provisioned WAN inbound rule - The repository development RAUC CA is an explicit development convenience only; production fail-closed keyring enforcement remains planned
Open Questions
- Cockpit-podman host integration:
cockpit-podmanmust be installed on the host (not in the cockpit-ws container) and communicates via cockpit-bridge. On AtomixOS the rootfs is read-only squashfs, so cockpit-podman would need to be in the NixOS closure. This means the base image must include it, which crosses the “no on-device web management” non-goal boundary. Alternative: treat cockpit-podman as an optional NixOS module that operators can enable. - hawkBit integration:
useHawkbitoption exists but no operational service is configured. Needs server configuration, credentials, and verification tests before promotion. - Nixstasis client: Enrollment, tunnel lifecycle, and credential rotation are documented but not implemented.
- USB WiFi: Kernel WiFi/Bluetooth stacks are disabled. Hardware selection needed before enablement.
- Active watchdog enforcement: Deferred pending Rock64 boot-reliability validation.
- Additional
[network]properties: Evaluate addingdns_servers,dns_search_domains,default_gateway, andinterfacesto the[network]section for operator-controlled DNS, default route, and NIC configuration. These keys are not currently consumed but may be needed for multi-NIC or custom DNS setups. - User shell configuration: Allow operators to set
shell = "zsh"orshell = "bash"per user in[users.<name>]. Currently admin users default to/bin/zshand system accounts to/bin/sh, with no config override. - Additional
[activation]options: Evaluate adding activation controls beyondrequired, such astimeout_secondsfor max wait/check windows,rollback_on_failurefor whether to restore previous config,restartfor an explicit ordered service restart list,settle_secondsbefore checking health,allow_degradedfor services allowed to fail without rollback, andstrategy = "rollback" | "keep-failed" | "manual-confirm".
Resolved Questions
- Cockpit-ws authentication boundary: Resolved by placing Cockpit behind
Caddy/AuthCrunch and running cockpit-ws with
--local-session. Caddy is the only public authentication and authorization boundary;/cockpit/*is restricted toauthp/admin. - Provisioning API foundation: Resolved by replacing the monolithic
first-boot provisioner with the
atomixos-provisionPython package, Litestar API service, SSH signature authentication, single-flight apply jobs, live OpenAPI schema, crash-safe config promotion, activation health checks, and rollback handling. Future changes should build on the same validate, render, promote, activate, and rollback pipeline instead of adding parallel mutation paths. - Bootstrap API and UI auth split: Resolved by keeping programmatic first-boot
/api/configunauthenticated while requiring the Boot UI bootstrap token for browser form submission. After provisioning, unauthenticated mutation routes are unavailable and re-apply requires SSH signatures. - Bootstrap exposure lifecycle: Resolved by keeping WAN bootstrap exposure only until initial provisioning completes, then rebinding the bootstrap socket to LAN through runtime systemd drop-ins and preserving WAN exposure while an initial promotion marker is pending.
Feature Map
caddy-authcrunch-cockpit-tutorial
- Status: completed
- Overview: Provides a comprehensive tutorial section in the documentation with a
fully working
config.tomlbundle deploying Caddy with the AuthCrunch plugin for Microsoft Entra OIDC authentication, JWT token generation with OIDC group-to-role mapping, and Cockpit-ws for container management. The tutorial demonstrates the full power of the config.toml provisioning system including containers, networks, volumes, and bundle files. - Requirements:
- Working
config.tomlwith all required sections (users, network, health, containers) - AuthCrunch container (
ghcr.io/authcrunch/authcrunch) as rootful with host networking - Caddyfile configuring Microsoft Entra OIDC provider, authentication portal, and authorization policies
- OIDC group mapping to local roles:
authp/admin(sudoless admin) andauthp/user(generic user) based on Entra security group membership - JWT token generation with configurable lifetime and signing key
- Cockpit-ws container (
quay.io/fedora/fedora) for device/container management, built from a custom Containerfile that adds Cockpit management modules - Caddy-gated Cockpit local session: Caddy restricts
/cockpit/*toauthp/admin, and cockpit-ws runs--local-sessionbehind the proxy – eliminates double authentication - Quadlet
.buildsupport for building custom container images from Containerfiles - Podman module integration so operators can manage provisioned pods from Cockpit
- Quadlet network definition for inter-container communication
- Quadlet volume definition for persistent Caddy state
- Bundle
files/directory with Caddyfile and cockpit.conf - Clear documentation of Azure App Registration prerequisites
- Clear documentation of how to swap the Caddyfile identity provider block for Google or another OIDC provider
- Clear documentation of the authentication flow and role-based access
- Working
- Constraints:
- Must use only config.toml features that exist today or are added as part of this
feature (containers, networks, volumes, builds, bundle files,
${CONFIG_DIR}/${FILES_DIR}tokens) - Caddy must be rootful (needs host network for ports 80/443)
- Cockpit-ws uses
--local-sessionbehind Caddy/AuthCrunch (no double auth) - Must not require changes to the AtomixOS base image or schema beyond
.buildsupport - Tutorial values (tenant ID, client ID, domain) must use obvious placeholders
- Must use only config.toml features that exist today or are added as part of this
feature (containers, networks, volumes, builds, bundle files,
- Non-goals:
- Modifying the AtomixOS base image to include Cockpit or cockpit-podman
- Production-hardening the example (certificate pinning, secret rotation, HA)
- SAML providers (tutorial focuses on OIDC)
- Success criteria:
- An operator can copy the tutorial config, substitute their Azure/domain values, flash a device, and have a working OIDC-authenticated Caddy + Cockpit stack
- The tutorial config passes
first-boot-provision validate - Role mapping is demonstrated: Entra group A gets admin, group B gets user
- The tutorial clearly explains the powerful host socket mounts used by the admin Cockpit container
- Risks and tradeoffs:
- Cockpit local-session risk: Cockpit does not perform a second login. Caddy must
remain the only public entry point and
/cockpit/*must remain admin-only. - AuthCrunch version churn: AuthCrunch/caddy-security evolves rapidly; Caddyfile syntax may change between versions.
- Entra group claim configuration: Requires Azure portal configuration (Token Configuration > Add groups claim) that is outside AtomixOS control.
- Cockpit package drift: Container-installed Cockpit modules may not match host service versions exactly; native host packaging can be added later if needed.
- Cockpit local-session risk: Cockpit does not perform a second login. Caddy must
remain the only public entry point and
- Dependencies:
- Network and volume Quadlet support (completed:
85ec53c) - Bundle file support with
${FILES_DIR}token substitution (completed) - Container, network, volume rendering and sync (completed)
- Quadlet
.buildsupport (completed)
- Network and volume Quadlet support (completed:
- Suggested validation:
first-boot-provision validateon the tutorial config.toml- NixOS VM test importing the tutorial bundle and verifying rendered Quadlet files
- Manual verification with a real Entra tenant (cannot be automated)
- Delivered in:
docs/src/tutorials/oidc-device-management.mdandexample/caddy-oidc/
nixstasis-client
- Status: planned
- Overview: Implement the Nixstasis enrollment client that registers the device with the Nixstasis management server, establishes reverse tunnels, and manages short-lived SSH credentials.
- Requirements:
- Device identifies itself via eth0 MAC address
- Server checks MAC against approved inventory
- Approved devices receive and persist a registration key on
/data - Client establishes reverse tunnel for remote SSH sessions
- Credential rotation for the registration key
- Constraints:
- Must survive container-layer failures (lives in rootfs, not a container)
- Must work with key-only SSH authentication model
- Must not require default credentials
- Non-goals:
- Hosting web management UI on the device
- Fleet orchestration logic (server-side concern)
- Success criteria:
- Device enrolls with Nixstasis server using MAC-based eligibility
- Registration key persists across reboots and updates
- Reverse tunnel enables remote SSH access
- NixOS VM test covers enrollment and tunnel lifecycle
- Risks and tradeoffs:
- Depends on Nixstasis server API being stable and documented
- Tunnel reliability on unstable WAN connections
- Dependencies: None (can start independently)
- Suggested validation:
- NixOS VM test with mock Nixstasis server
- Integration test with real Nixstasis instance
- Suggested first workflow command:
/start-feature nixstasis-client
hawkbit-updates
- Status: planned
- Overview: Configure the
rauc-hawkbit-updaterservice for server-push OTA updates, replacing the simple HTTP polling model for fleet-scale deployments. - Requirements:
- Define hawkBit server configuration and credential provisioning
- Create systemd unit for
rauc-hawkbit-updater - Integrate with existing RAUC slot management
- Add
config.tomlsupport for hawkBit server URL and credentials
- Constraints:
- Must coexist with polling mode (operator chooses one)
- Must not break existing
os-upgrade.servicebehavior - Credentials must not be embedded in the base image
- Non-goals:
- Running a hawkBit server (server-side concern)
- Delta updates
- Success criteria:
- Device registers with hawkBit server and receives push updates
- RAUC install and slot management work identically to polling mode
- NixOS VM test with mock hawkBit server
- Risks and tradeoffs:
- hawkBit server availability becomes a deployment dependency
- Additional credential management complexity
- Dependencies: None
- Suggested validation: NixOS VM test with mock hawkBit DDI API
- Suggested first workflow command:
/start-feature hawkbit-updates
rauc-production-keyring-policy
- Status: planned
- Overview: Make RAUC production images fail closed unless a production keyring is configured, while keeping development and test images explicit about using the repository development CA.
- Requirements:
- Default production behavior must require
atomixos.rauc.keyringCert - Development/test images must explicitly opt into the repository development CA
- VM tests must set the development opt-in where needed
- Documentation must show production and development keyring examples
- Default production behavior must require
- Constraints:
- Must not break local VM development workflows
- Must preserve RAUC signed-bundle verification
- Must keep release image configuration auditable from Nix options
- Non-goals:
- Replacing RAUC
- Managing production CA issuance or rotation server-side
- Success criteria:
- A release image without
keyringCertfails evaluation or build - Development images continue to build only with an explicit dev-keyring opt-in
- Docs clearly state that the repository dev CA is never acceptable for production OTA
- A release image without
- Risks and tradeoffs:
- Existing ad hoc test images may need option updates
- Operators need a documented CA provisioning workflow before release builds
- Dependencies: RAUC module options from provisioning API service foundation
- Suggested validation: Nix evaluation tests for both fail-closed and dev opt-in modes
- Suggested first workflow command:
/start-feature rauc-production-keyring-policy
provisioning-api-privilege-separation
- Status: planned
- Overview: Split the network-facing provisioning API process from privileged host mutation helpers. The web process should run unprivileged and call a narrow, auditable helper for config promotion, service activation, firewall changes, and socket rebinding.
- Requirements:
- Run the Litestar/uvicorn service as an unprivileged user
- Define a minimal privileged helper interface for apply/recover/activate actions
- Preserve single-flight apply semantics and job progress reporting
- Preserve first-boot bootstrap behavior and SSH-signed reapply behavior
- Ensure helper inputs are validated and scoped to
/data/config
- Constraints:
- Must work with read-only rootfs and mutable
/data - Must avoid adding DB, Redis, or heavyweight IPC dependencies
- Must not regress first-boot operator workflow
- Must work with read-only rootfs and mutable
- Non-goals:
- Full multi-tenant authorization model
- Remote fleet orchestration
- Success criteria:
- Compromise of the HTTP process does not directly grant root shell or arbitrary filesystem mutation
- Apply/recover/rollback paths still pass existing Python and Nix VM tests
- Systemd hardening is documented and enforced in the service unit
- Risks and tradeoffs:
- Helper boundary adds implementation and test complexity
- Progress reporting may need a simple IPC contract
- Dependencies: Provisioning API foundation
- Suggested validation: VM test proving unprivileged service can provision via helper
- Suggested first workflow command:
/start-feature provisioning-api-privilege-separation
provisioning-api-live-schema-contract
- Status: planned
- Overview: Treat the live OpenAPI schema exposed by the provisioning service as a supported client contract, not incidental framework output.
- Requirements:
- Keep API routes documented with accurate request bodies, headers, responses, and error shapes
- Exclude Boot UI/static routes from the API schema unless deliberately documented
- Add tests that assert schema coverage for new API endpoints
- Preserve operation IDs and domain tags for client generation
- Constraints:
- Live schema exposure is intentional for online clients
- Must not expose inaccurate write-only implementation routes
- Must keep schema generation dependency-light
- Non-goals:
- Replacing
config.tomlas the canonical import/export artifact - Adding OAuth/JWT solely for docs access
- Replacing
- Success criteria:
- Generated clients can submit config, poll jobs, validate config, and handle errors using the live schema
- CI fails when a new API route lacks schema assertions
- Risks and tradeoffs:
- Litestar defaults may need explicit overrides for raw binary endpoints
- Schema tests add maintenance cost but prevent client drift
- Dependencies: Provisioning API foundation
- Suggested validation: Python tests against
/schema/openapi.json - Suggested first workflow command:
/start-feature provisioning-api-live-schema-contract
typed-partial-provisioning-api
- Status: planned
- Overview: Add typed partial configuration endpoints for common operations while
preserving
config.tomland bundles as the canonical import/export/backup format. Partial changes must always produce a full desired state and reuse the existing validate, render, promote, activate, and rollback pipeline. - Requirements:
- Add typed endpoints for users, network/LAN settings, container services, volumes, and firewall inbound rules in priority order
- Load current desired state, apply the typed patch, validate the full result, render a candidate, promote atomically, activate, and roll back on failure
- Return async jobs with progress just like full config submission
- Preserve config export/backup semantics after partial changes
- Constraints:
- Must not mutate derived files directly under
/data/config - Must not introduce a database or divergent state store
- Must keep full config import behavior authoritative
- Must not mutate derived files directly under
- Non-goals:
- Arbitrary JSON patch over internal rendered state
- Fleet-level orchestration
- Success criteria:
- Partial updates and full config imports converge on the same on-disk desired state
- Failed partial updates roll back identically to failed full imports
- Live OpenAPI accurately documents each typed endpoint
- Risks and tradeoffs:
- More API surface increases schema and validation maintenance
- Some edits may require restart ordering or health semantics not yet modeled
- Dependencies: Provisioning API foundation, live schema contract
- Suggested validation: Python tests for typed patch-to-full-state conversion plus VM tests for at least one user and one container partial update
- Suggested first workflow command:
/start-feature typed-partial-provisioning-api
boot-ui-htmx
- Status: planned
- Overview: Redesign the first-boot Boot UI as a small server-rendered HTMX interface while preserving the current upload/paste provisioning flow and bootstrap CSRF token controls.
- Requirements:
- Keep first-boot UI available only before provisioning completes
- Preserve upload and paste config paths
- Show async job progress using the returned job URL
- Reuse server-rendered fragments; no SPA/Vite dependency
- Maintain Host/Origin/Referer protections and bootstrap token checks
- Constraints:
- Must fit embedded rootfs constraints
- Must not add a separate frontend build pipeline unless justified
- Must not introduce unauthenticated post-provision mutation paths
- Non-goals:
- Full on-device management UI
- Replacing programmatic
/api/config
- Success criteria:
- Operator can provision from desktop and mobile browsers
- UI reflects validation/apply progress and final forwarding URL
- UI tests cover first-boot only exposure and CSRF failure paths
- Risks and tradeoffs:
- More UI affordances increase bootstrap attack surface if not carefully scoped
- HTMX fragments must stay aligned with API/job behavior
- Dependencies: Provisioning API foundation
- Suggested validation: Python route tests and manual browser test in VM
- Suggested first workflow command:
/start-feature boot-ui-htmx
watchdog-enforcement
- Status: deferred
- Overview: Enable hardware watchdog enforcement with
RuntimeWatchdogSec=30sandRebootWatchdogSec=10minon Rock64. - Requirements:
- Complete Rock64 boot-reliability validation
- Enable systemd manager watchdog settings
- Verify watchdog-triggered reboots feed into boot-count rollback
- Constraints:
- Must not cause false-positive reboot loops during normal operation
- Must be validated on physical hardware before enabling
- Non-goals: Software-only watchdog
- Success criteria:
- Watchdog reboots device within 30s of systemd hang
- 3 consecutive watchdog reboots trigger automatic slot rollback
- No false triggers during normal 72-hour soak test
- Risks and tradeoffs:
- Aggressive timeout may cause false triggers on slow boots
- Cannot be fully validated in QEMU
- Dependencies: Physical hardware availability for soak testing
- Suggested validation: 72-hour soak test on physical Rock64
- Suggested first workflow command:
/start-feature watchdog-enforcement
usb-wifi
- Status: deferred
- Overview: Enable WiFi support for selected USB WiFi hardware.
- Requirements:
- Select supported USB WiFi chipset and firmware
- Enable kernel WiFi and Bluetooth stacks
- Add WiFi NIC to systemd
.linknaming - Define WiFi role (WAN backup? LAN extension?)
- Constraints:
- Must not increase rootfs closure beyond 1 GB slot limit
- Firmware must be redistributable
- Non-goals: Access point mode (initially)
- Success criteria: WiFi interface comes up and connects to configured network
- Risks and tradeoffs:
- Firmware blob licensing and size
- WiFi reliability on embedded hardware
- Unclear network role
- Dependencies: Hardware selection
- Suggested validation: Hardware test with selected adapter
- Suggested first workflow command:
/start-feature usb-wifi
config-reapply-improvements
- Status: planned
- Overview: Harden the existing config re-apply path (
POST /api/configon the always-running bootstrap server) with authentication, atomic replacement, and rollback-on-failure. The basic re-apply mechanism already works: any POST overwrites/data/configand triggersquadlet-syncto restart services. - Requirements:
- Authentication guard on the re-apply endpoint (not a reset token)
- Atomic replacement of
/data/config(write to temp, swap on success) - Validate new config before replacing old config
- Rollback to previous config if new config’s services fail to start
- Constraints:
- Must not touch
/dataoutside of/data/config - Must not break the existing unguarded first-provision flow on fresh devices
- Authentication mechanism must work on LAN-local without external dependencies
- Must not touch
- Non-goals:
- Full
/datawipe (separate operation) - Partial config updates (always full replacement)
- Changing the existing provisioning flow for fresh devices
- Full
- Success criteria:
- Unauthenticated POST to
/api/configis rejected on an already-provisioned device - Authenticated POST atomically replaces config and restarts services
- Crash during replacement leaves previous config intact
- Failed service startup triggers automatic rollback to previous config
- Unauthenticated POST to
- Risks and tradeoffs:
- Container state (volumes, data) may be inconsistent after rollback
- Service downtime during re-apply is unavoidable
- Authentication mechanism choice affects operational complexity
- Dependencies: None (existing mechanism works; this is hardening)
- Suggested validation: NixOS VM test with sequential config imports, crash simulation, and rollback verification
- Suggested first workflow command:
/start-feature config-reapply-improvements
Features
This section records feature-level design notes, requirements, and task tracking.
The older OpenSpec change documents have been converted into this mdBook feature spec format so implementation notes, requirements, and task status live with the rest of the project documentation.
Feature: rock64-ab-image
Overview
Why
- Rock64 devices need a robust OTA update model that keeps the previous system bootable while a new image is written and verified.
- The device is also the LAN isolation boundary for downstream legacy equipment, so networking, authentication, and recovery behavior must be explicit parts of the platform design.
- Early exploration included an on-device Cockpit/Traefik management path and password-oriented fallback flows, but the implemented platform moved to a smaller appliance baseline: Podman stays on-device for workloads, while remote web management is expected to live in the Nixstasis environment instead of the device image itself.
What Changes
- Introduce a NixOS flake that builds the Rock64 system, flashable image, signed RAUC bundle, and QEMU test target
- Implement an A/B layout with per-slot boot partitions, squashfs rootfs slots, and
/datacreated on first boot by initrdsystemd-repart - Use RAUC plus U-Boot bootmeth for slot switching and rollback, with
rauc status mark-goodused for confirmation - Keep Podman, OpenVPN, OpenSSH, chrony, dnsmasq, nftables, and the update client in the device image
- Remove the local Cockpit/Traefik management stack from the final device design while preserving application workloads through Podman
- Use SSH-key-only operator access, locked local passwords, and
_RUT_OH_as a physical serial recovery path - Support provisioning-aware first boot through the bounded
config.tomlcontract defined by thefirst-boot-local-provisioningfollow-on change - Keep the update client swappable between the default
os-upgradepolling path and a future hawkBit path
Capabilities
New Capabilities
nix-flake-config: Rock64/QEMU flake outputs, stripped kernel, core runtime services, and build artifactspartition-layout: Flashable image layout with U-Boot raw region, slot A in the image, and slot B plus/dataprovisioned on first bootrauc-integration: RAUC system configuration, signed multi-slot bundle building, slot definitions, and update client integrationboot-rollback: U-Boot boot-count logic and RAUC-driven slot confirmation / rollback behaviorwatchdog: Hardware-watchdog-oriented design, with QEMU validation and hardware re-enablement tracked separatelyupdate-confirmation: Localos-verificationhealth checks before committing updated slotslan-gateway: Deterministic NIC naming, DHCP/NTP on LAN, nftables policy, and no packet forwarding between WAN and LAN
Follow-on Changes
first-boot-local-provisioning: Refines the day-0 and reprovisioning flow, theconfig.tomlcontract, and the/data/config/persistence boundarydurable-journald-logs: Defines the current runtime log durability model and the still-incomplete initrd forensic redesign
Impact
- Affected code:
flake.nix, shared/system modules, image and bundle derivations, U-Boot boot script, RAUC config, first-boot/update services, and QEMU tests - Affected storage layout: raw U-Boot region, per-slot boot/rootfs partitions, and durable operator/runtime state on
/data - Affected operations: flash/build workflow, first boot, update confirmation, rollback, LAN gateway bring-up, and remote recovery
- Security: no embedded login credentials, key-only operator access, WAN SSH disabled by default, and no default packet forwarding between networks
Design
This document is maintained as the current source of truth for the foundational Rock64 A/B image design. Where the implementation diverged from early exploration, the current design is described directly and explicit divergence notes are kept only when they explain an important technical decision.
Context
AtomixOS is a secure, reproducible operating system for single-board computers. The initial target is Rock64 (RK3328, aarch64) hardware. The platform must tolerate failed updates and power loss without bricking the device, while keeping the base image small enough to fit two rootfs slots plus persistent state on 16 GB eMMC.
The implemented platform centers on:
- NixOS built as a read-only squashfs image
- RAUC-managed A/B updates
- U-Boot boot-count rollback
- Podman for on-device application workloads
- LAN gateway services (dnsmasq, chrony, nftables)
- SSH-key-only operator access and a physical serial recovery path
- Nixstasis-oriented remote management rather than a permanent local web-management stack inside the device image
Goals / Non-Goals
Goals
- Atomic A/B updates that only write the inactive slot pair
- Automatic rollback when a new slot fails to boot or cannot stay healthy
- A read-only appliance baseline with durable state isolated under
/data - Deterministic networking and strict LAN/WAN isolation behavior
- A small runtime closure that still supports Podman workloads and recovery access
- A QEMU target that shares the real configuration for rapid iteration and test coverage
Non-Goals
- Running a permanent Cockpit/Traefik management surface directly on the device
- Embedding device credentials or per-device secrets in the base image
- Generic provisioning engines such as cloud-init
- Server-side update infrastructure design
- Full logging durability redesign in this change
Decisions
1. Use RAUC for A/B updates
Decision: Use RAUC as the update framework for multi-slot installs, signature verification, and slot metadata.
Rationale: RAUC fits the NixOS ecosystem well, supports multi-image bundles, and integrates cleanly with the U-Boot boot-count model the device uses.
2. Use a read-only squashfs rootfs with OverlayFS at boot
Decision: The runtime system is a read-only squashfs lower layer combined with a tmpfs-backed OverlayFS upper/work layer assembled in initrd.
Rationale: This keeps the runtime root immutable, avoids drift across boots, and makes the A/B slot boundary easy to
reason about. Mutable state lives outside the rootfs on /data.
3. Partition the eMMC as raw U-Boot + per-slot boot/rootfs + /data
Decision: The flashable image contains raw U-Boot plus boot-a and rootfs-a. On first boot, initrd
systemd-repart creates boot-b, rootfs-b, and the persistent data partition from the remaining space.
Rationale: This keeps the shipped image small and deterministic while still resulting in a full A/B layout on-device.
Per-slot boot partitions avoid a shared /boot single point of failure.
The current target layout is:
0-16 MiB raw U-Boot region
128 MiB boot-a (vfat)
1024 MiB rootfs-a (squashfs/raw)
128 MiB boot-b (vfat, created on first boot)
1024 MiB rootfs-b (raw, created on first boot)
remainder /data (f2fs, created on first boot)
4. Use U-Boot bootmeth plus RAUC mark-good for slot commit
Decision: U-Boot bootmeth handles slot choice and boot-count decrement. Linux confirms a healthy slot with
rauc status mark-good.
Rationale: This matches the current Rock64 implementation and keeps the rollback model aligned with RAUC’s slot view.
Technical note: Earlier investigation explored alternatives because raw eMMC env writes were risky on this hardware
path. The current implementation relies on SPI-backed environment handling plus rauc status mark-good, which is what
the live system and tests now use.
5. Keep the device image small and remove local Cockpit/Traefik management
Decision: Keep Podman on-device for application workloads, but do not ship a local Cockpit/Traefik management stack as part of the final base image.
Rationale: The local web-management path added closure size and operational complexity that no longer matches the intended remote-management model. The current design expects remote web access to be hosted from the Nixstasis side, while the device itself remains focused on SSH, update logic, LAN gateway behavior, and workload runtime support.
6. Keep OpenVPN in the rootfs as a recovery path
Decision: OpenVPN remains a rootfs service for recovery-oriented remote access.
Rationale: It provides a durable management path independent of application containers and is useful when WAN SSH is disabled by policy.
7. Make first boot provisioning-aware
Decision: A valid provisioning import is part of the production first-boot contract, not a post-boot manual step.
Rationale: A device that boots Linux but lacks operator credentials and required workload intent is not actually ready
for deployment. The detailed contract is defined in the first-boot-local-provisioning follow-on change, but the core
foundational design is now provisioning-aware:
- fresh-flash detection happens in initrd
- first boot can import from
/boot/config.toml, USB media, or a LAN-local bootstrap UI - imported operator intent persists under
/data/config/ - first boot calls
rauc status mark-goodonly after provisioning import/validation succeeds
8. Use local health confirmation for updated slots
Decision: os-verification.service validates device-local health before committing an updated slot.
Rationale: Slot confirmation should not depend on external connectivity. The implemented checks are intentionally bounded:
dnsmasq.serviceis activechronyd.serviceis activeeth0has a WAN IPv4 addresseth1is172.20.30.1- each unit listed in
/data/config/health-required.jsonis active - the checks stay healthy for a sustained 60-second window
If those checks pass, os-verification.service calls rauc status mark-good.
9. Enforce deterministic networking and strict LAN/WAN separation
Decision: The onboard GMAC is always eth0, the USB LAN adapter becomes eth1, and packet forwarding stays off.
Rationale: The device identity, WAN policy, and LAN gateway behavior all depend on stable interface roles.
The effective network model is:
eth0: WAN DHCP clienteth1: LAN gateway at172.20.30.1/24- dnsmasq serves DHCP only on LAN
- chrony serves NTP only on LAN
- nftables allows WAN HTTPS/OpenVPN, LAN DHCP/NTP/SSH/bootstrap UI, and no forwarding
- WAN SSH stays off unless
/data/config/ssh-wan-enabledexists
10. Use SSH-key-only operator access with physical serial recovery
Decision: Operator accounts are declared by config under [users.<name>], remain password-locked, and use SSH keys
from /data/config/ssh-authorized-keys/<user>. Root is also locked by default. _RUT_OH_ is a physical serial-only
recovery path, not a network authentication mode.
Rationale: This matches the implemented security posture and removes ambiguity around password-based operator access.
11. Keep the update client hawkBit-ready, but default to simple polling
Decision: os-upgrade.timer is the default update client. The design still reserves a future hawkBit path through a
configuration switch, but the current implementation keeps the simple polling path as the active one.
Rationale: The device-side architecture should not block future fleet-management integration, but the default runtime should stay small and directly testable.
12. Keep runtime log durability simple and bounded
Decision: Runtime logs use volatile journald plus buffered rsyslog writes to /data/logs, with a slot-local forensic
ring for key lifecycle events.
Rationale: This keeps the general logging path lightweight while still persisting important state transitions.
Technical note: The initrd forensic persistence path is currently disabled pending redesign. That follow-on work is
tracked in durable-journald-logs.
Risks / Trade-offs
- Watchdog on hardware is still staged: the design includes the hardware watchdog path, but live Rock64 enablement is still gated on stable hardware validation.
- Provisioning is now part of the first-boot success contract: this is correct for production, but it means invalid provisioning blocks slot confirmation.
- No local web-management stack in the base image: this reduces closure size and appliance complexity, but shifts remote-management responsibility to Nixstasis-hosted services.
- Full image updates consume more bandwidth than delta approaches: acceptable for the current phase.
- Initrd forensic durability is incomplete: runtime durability exists, but the earliest boot persistence path still needs a safer redesign.
Follow-on Changes
first-boot-local-provisioningrefines the provisioning contract, source-order logic, and/data/config/layoutdurable-journald-logsrefines the runtime log-durability model and tracks the incomplete initrd redesign
Requirements
boot-rollback
ADDED Requirements
Requirement: U-Boot tracks boot attempts per slot
U-Boot SHALL maintain a boot-attempt counter for each slot (BOOT_A_LEFT, BOOT_B_LEFT). On each boot attempt, the
counter for the selected slot SHALL be decremented. If the counter reaches zero, U-Boot SHALL fall back to the other
slot on the next boot.
Scenario: Boot counter decrements on each boot
- WHEN the device boots and the active slot has
BOOT_A_LEFT=3 - THEN U-Boot decrements the slot counter before attempting the boot
Scenario: Slot switches when counter reaches zero
- WHEN the active slot’s boot counter reaches
0 - THEN U-Boot selects the other slot on the next boot
Requirement: U-Boot boot order reflects the next slot priority
U-Boot SHALL use BOOT_ORDER to determine slot priority, and RAUC installation SHALL make the newly written inactive
slot the next slot to attempt.
Scenario: RAUC install changes the preferred slot
- WHEN RAUC installs a bundle to slot B while slot A is active
- THEN the next boot attempts slot B before slot A
Requirement: Successful confirmation commits the slot with RAUC
After successful first-boot validation or post-update confirmation, Linux SHALL call rauc status mark-good for the
booted slot.
Scenario: First boot commits the slot after valid provisioning
- WHEN
first-boot.servicesuccessfully imports and validates provisioning state - THEN it calls
rauc status mark-goodfor the booted slot
Scenario: Updated slot is committed after local verification
- WHEN
os-verification.serviceconfirms the booted slot is healthy - THEN it calls
rauc status mark-goodfor the booted slot
Requirement: Rollback preserves the previous working slot
If a newly installed slot cannot boot successfully or never reaches a committed state, U-Boot SHALL eventually fall back to the previous working slot.
Scenario: Failed update triggers automatic rollback
- WHEN a new image is installed to slot B and slot B fails repeatedly until its boot counter is exhausted
- THEN U-Boot falls back to slot A
Scenario: Previous slot remains intact
- WHEN the device rolls back from slot B to slot A
- THEN slot A still contains the previously working image because updates only write the inactive slot pair
Requirement: Rock64 uses the active U-Boot environment path supported by the platform
The Rock64 rollback design SHALL use the platform’s active U-Boot environment path together with RAUC’s U-Boot backend rather than relying on ad hoc slot bookkeeping in Linux.
Scenario: Linux and U-Boot agree on slot identity
- WHEN Linux determines the booted slot and calls
rauc status mark-good - THEN the same slot identity is used by the U-Boot / RAUC rollback path
lan-gateway
ADDED Requirements
Requirement: Network interfaces are named deterministically
The NixOS configuration SHALL disable systemd predictable interface names and use systemd-networkd .link files to
assign deterministic names: onboard RK3328 GMAC SHALL be eth0, USB ethernet adapters SHALL be eth1, eth2, etc.,
The onboard NIC SHALL be matched by its hardware platform path (platform-ff540000.ethernet). USB WiFi dongles are
not part of the current Rock64 support contract.
Scenario: Onboard NIC is always eth0
- WHEN the Rock64 boots with or without USB network adapters plugged in
- THEN the onboard RK3328 GMAC is named
eth0regardless of USB device enumeration order
Scenario: USB NIC is assigned sequential ethN name
- WHEN a USB ethernet adapter is plugged into the Rock64
- THEN it is assigned the next available
ethNname (e.g.,eth1)
Requirement: eth0 is configured as WAN interface
eth0 (onboard NIC) SHALL be configured as a DHCP client to obtain a WAN address from the upstream network.
Scenario: eth0 obtains WAN address
- WHEN the Rock64 boots and eth0 is connected to a network with a DHCP server
- THEN eth0 obtains an IP address via DHCP
Requirement: eth1 is configured as LAN interface with static IP
eth1 (USB NIC) SHALL be configured with the provisioned LAN gateway IP and prefix. If no valid provisioned LAN config
exists, it SHALL fall back to 172.20.30.1/24.
Scenario: eth1 has correct static address
- WHEN the Rock64 boots with a USB NIC plugged in and a valid LAN config is present
- THEN eth1 has the provisioned gateway IP and prefix
Scenario: eth1 falls back to default static address
- WHEN the Rock64 boots with no valid provisioned LAN config
- THEN eth1 has IP address
172.20.30.1with netmask255.255.255.0
Requirement: IP forwarding is disabled
The kernel parameter net.ipv4.ip_forward SHALL be set to 0. No packet-level routing SHALL occur between any
interfaces. This provides the EN18031 compliance boundary for legacy LAN devices.
Scenario: No traffic is routed between interfaces
- WHEN a device on the LAN (172.20.30.x) sends a packet destined for a WAN address
- THEN the packet is dropped by the Rock64 kernel and never reaches eth0
Requirement: DHCP server runs on LAN interface
dnsmasq SHALL be configured to serve DHCP on eth1 (LAN) only. The DHCP pool SHALL use the provisioned LAN DHCP range, reserving lower addresses for static assignments.
Scenario: LAN device obtains IP via DHCP
- WHEN a device is connected to the switch on the LAN
- THEN it receives an IP address in the provisioned DHCP range from the Rock64’s DHCP server
Scenario: DHCP only serves LAN
- WHEN a DHCP request arrives on eth0 (WAN)
- THEN dnsmasq does not respond to it
Requirement: NTP server runs on LAN interface
chrony SHALL be configured as both an NTP client (syncing from WAN NTP servers via eth0) and an NTP server (serving time
to LAN devices on eth1). NTP service SHALL accept clients from the provisioned LAN subnet. When no valid provisioned LAN
config exists, it SHALL accept clients from the fallback 172.20.30.0/24 subnet.
Scenario: Rock64 syncs time from WAN
- WHEN the Rock64 boots with WAN connectivity
- THEN chrony synchronizes time from upstream NTP servers via eth0
Scenario: LAN device syncs time from Rock64
- WHEN a LAN device queries NTP at the provisioned LAN gateway IP
- THEN chrony responds with the current time
Scenario: NTP rejects non-LAN clients
- WHEN an NTP request arrives from a source outside the provisioned LAN subnet
- THEN chrony does not respond
Requirement: nftables firewall restricts traffic per interface
nftables SHALL be configured with the following rules:
eth0 (WAN) inbound: ALLOW established/related, DROP all else by default. Provisioned firewall state MAY add
application or VPN ports from /data/config/firewall-inbound.json under the wan scope. TCP/22 (SSH) is allowed only
if the flag file /data/config/ssh-wan-enabled exists.
eth1 (LAN) inbound: ALLOW all inbound traffic by default. If provisioned firewall state includes a lan scope with
TCP or UDP ports in /data/config/firewall-inbound.json, the provisioned LAN ports SHALL be appended to the
platform-required LAN ports instead of replacing them.
tun0 (VPN) inbound: ALLOW TCP/22 (SSH), ALLOW established/related, DROP all else.
FORWARD chain: DROP all (no inter-interface routing).
Scenario: WAN application ports are provisioned
- WHEN
/data/config/firewall-inbound.jsoncontainswan.tcp = [443] - AND
provisioned-firewall-inbound.serviceapplies the provisioned state - THEN HTTPS connections to eth0 on port 443 are accepted
Scenario: LAN application ports are provisioned
- WHEN
/data/config/firewall-inbound.jsoncontainslan.tcp = [443] - AND
provisioned-firewall-inbound.serviceapplies the provisioned state - THEN inbound connections to eth1 on TCP 443 are accepted
Scenario: LAN remains open without explicit LAN scope
- WHEN
/data/config/firewall-inbound.jsondoes not contain alanscope with any ports - THEN inbound connections to eth1 remain accepted by the default LAN-open rule
Scenario: Provisioned LAN ports append to required platform ports
- WHEN
/data/config/firewall-inbound.jsoncontainslan.tcp = [443] - AND
provisioned-firewall-inbound.serviceapplies the provisioned state - THEN inbound connections to eth1 on TCP 443 are accepted
- AND the platform-required LAN ports remain accepted
Scenario: WAN application ports are closed before provisioning
- WHEN no provisioned firewall state allows TCP/443 or UDP/1194
- THEN new inbound connections to eth0 on TCP/443 and UDP/1194 are dropped
Scenario: SSH is blocked on WAN by default
- WHEN an SSH connection is attempted to eth0 on port 22 and
/data/config/ssh-wan-enableddoes not exist - THEN the connection is dropped
Scenario: SSH is allowed on WAN when flag is set
- WHEN an SSH connection is attempted to eth0 on port 22 and
/data/config/ssh-wan-enabledexists - THEN the connection is accepted
Scenario: SSH is always allowed on LAN
- WHEN an SSH connection is attempted to eth1 on port 22
- THEN the connection is accepted
Scenario: DNS is allowed on LAN
- WHEN a DNS query is sent to eth1 on TCP/53 or UDP/53
- THEN the packet is accepted
Scenario: Bootstrap UI is allowed on LAN by default
- WHEN a connection is made to eth1 on TCP/8080
- THEN the connection is accepted
Scenario: SSH is allowed over VPN
- WHEN an SSH connection is attempted via the tun0 interface on port 22
- THEN the connection is accepted
Scenario: No forwarding between interfaces
- WHEN any packet arrives that would be forwarded between interfaces
- THEN the packet is dropped by the FORWARD chain
Requirement: WAN SSH toggle is manual only
SSH access on eth0 (WAN) SHALL be controlled by the presence of the flag file /data/config/ssh-wan-enabled. In the
production design, this flag is an explicit operator-controlled toggle rather than an automatically managed runtime rule.
Scenario: Flag file enables WAN SSH
- WHEN
/data/config/ssh-wan-enabledis created - THEN the nftables rule for SSH on eth0 becomes active on the next firewall reload or reboot
Scenario: Flag file removal disables WAN SSH
- WHEN
/data/config/ssh-wan-enabledis removed - THEN SSH connections to eth0 are dropped on the next firewall reload or reboot
Requirement: Device identity uses eth0 MAC address
The device identity SHALL be derived from the MAC address of eth0 (onboard NIC). This address SHALL be readable from
/sys/class/net/eth0/address, normalized to compact lowercase 12-hex format, and used as the unique device identifier
for update confirmation, fleet management, and device registration.
Scenario: Device ID is consistent across reboots
- WHEN the device reboots or updates to a new slot
- THEN the device identity (eth0 MAC) remains the same
nix-flake-config
ADDED Requirements
Requirement: The flake defines Rock64 and QEMU system configurations
The flake SHALL define nixosConfigurations.rock64 for the real Rock64 hardware target and
nixosConfigurations.rock64-qemu for the shared QEMU test target.
Scenario: Flake evaluates successfully
- WHEN
nix flake checkis run against the repository - THEN the flake evaluates without errors
- AND both
nixosConfigurations.rock64andnixosConfigurations.rock64-qemuare present
Scenario: System configuration targets aarch64-linux
- WHEN the Rock64 configuration is built
- THEN its system outputs target
aarch64-linux
Requirement: The flake produces a read-only squashfs root filesystem
The flake SHALL produce a squashfs image as packages.aarch64-linux.squashfs. The image SHALL contain the system
closure required for the appliance baseline and SHALL be sized to fit within the 1 GiB rootfs slot.
Scenario: Squashfs image builds successfully
- WHEN
nix build .#squashfsis run - THEN a squashfs image is produced
Scenario: Squashfs image fits the slot budget
- WHEN the squashfs image is built
- THEN the resulting image is no larger than the configured 1 GiB slot limit
Requirement: The flake produces a signed RAUC bundle and flashable image
The flake SHALL expose a signed RAUC bundle as packages.aarch64-linux.rauc-bundle and a flashable device image as
packages.aarch64-linux.image.
Scenario: RAUC bundle builds successfully
- WHEN
nix build .#rauc-bundleis run - THEN a signed
.raucbfile is produced that passesrauc info
Scenario: Flashable image builds successfully
- WHEN
nix build .#imageis run - THEN a flashable Rock64 disk image is produced containing U-Boot,
boot-a, androotfs-a
Requirement: The configuration uses a stripped kernel with modular USB peripheral support
The Rock64 configuration SHALL use a stripped kernel profile with RK3328-required storage, networking, USB host, watchdog, squashfs, f2fs, and overlay support built in. Selected USB Ethernet and USB serial support SHALL be available as modules. USB WiFi support is not part of the current Rock64 image until specific hardware and firmware are selected.
Scenario: Kernel boots on Rock64 hardware
- WHEN the built kernel and DTB are loaded by U-Boot on a Rock64 board
- THEN the kernel boots and detects the required Rock64 hardware path
Scenario: Optional supported USB peripherals load on demand
- WHEN a supported USB Ethernet or USB serial device is connected
- THEN the matching kernel module can be loaded without rebuilding the image
Requirement: The device image includes the core appliance runtime
The Rock64 configuration SHALL include systemd, Podman, OpenSSH, OpenVPN, chrony, dnsmasq, nftables, and the services required for the A/B update system, first-boot provisioning flow, and LAN gateway role.
Scenario: Core runtime services are available after boot
- WHEN the Rock64 boots the built image
- THEN systemd is PID 1
- AND Podman is available for application workloads
- AND SSH, LAN gateway, and update services are present in the system configuration
Requirement: Local web management is not part of the base image
The base Rock64 image SHALL NOT require Cockpit or Traefik to be built into the system closure.
Scenario: Appliance baseline excludes local management stack
- WHEN the Rock64 image is built
- THEN the core platform remains bootable and manageable without a local Cockpit/Traefik stack in the image itself
Requirement: The flake exposes a QEMU testing target that shares the core configuration
The rock64-qemu target SHALL reuse the shared base system configuration while swapping only the hardware-specific
pieces needed for aarch64-virt test execution.
Scenario: QEMU target boots successfully
- WHEN the QEMU test target is built and run
- THEN the system boots with the shared AtomixOS runtime and test harness overrides
Scenario: QEMU target stays close to hardware target
- WHEN the Rock64 and QEMU configurations are compared
- THEN they share the same core service, firewall, and update logic while differing only in hardware/test-specific details
partition-layout
ADDED Requirements
Requirement: eMMC partition layout supports A/B boot and root filesystem slots
The eMMC SHALL be partitioned with the following layout: a raw U-Boot region in the first 16 MB, two vfat boot
partitions (slot A and slot B, 128 MB each), two squashfs root filesystem partitions (slot A and slot B, 1 GB each),
and an f2fs /data partition consuming the remaining space. The flashable image contains slot A only (boot-a and
rootfs-a); slot B and /data are created on first boot by initrd systemd-repart.
Scenario: Partition table matches specification
- WHEN the flashable image is inspected before first boot
- THEN the GPT contains
boot-aandrootfs-a, with U-Boot written at the RK3328 boot ROM expected offset in the raw pre-partition region - AND WHEN the device completes its first boot
- THEN initrd
systemd-repartcreatesboot-b,rootfs-b, and an f2fs/datapartition in the remaining space
Requirement: Per-slot boot partitions contain kernel and DTB
Each boot slot (A and B) SHALL contain the Linux kernel image and device tree blob for that slot’s corresponding rootfs. RAUC SHALL update the boot partition and rootfs partition atomically as part of a single bundle install.
Scenario: Boot partition matches its rootfs slot
- WHEN the device boots from slot A
- THEN U-Boot loads the kernel and DTB from boot slot A, which matches the kernel version in rootfs slot A
Scenario: Boot partition is updated atomically with rootfs
- WHEN a RAUC bundle is installed
- THEN both the boot partition and rootfs partition for the target slot are written as a single operation
Requirement: Flashable image and flash workflow deploy the initial slot-A system
The build outputs SHALL include a flashable image that writes U-Boot to the correct raw offset, populates boot-a with
the first kernel/initrd/DTB payload, writes the first squashfs image to rootfs-a, and leaves the remaining eMMC space
unallocated so initrd systemd-repart can create boot-b, rootfs-b, and /data on first boot.
Scenario: First boot after flashing
- WHEN the flashable image has been written to the device and the system reboots
- THEN U-Boot loads the kernel from boot slot A, mounts rootfs slot A as the root filesystem, and the system reaches multi-user.target
Scenario: Flash workflow warns before overwriting an existing target
- WHEN the operator invokes the flashing workflow against an already-populated target device
- THEN the workflow requires explicit operator confirmation before overwriting the target
Requirement: U-Boot is written at correct RK3328 offset
The provisioning script SHALL write U-Boot (idbloader.img and u-boot.itb) to the eMMC at the offsets required by the
RK3328 boot ROM (sector 64 for idbloader, sector 16384 for u-boot.itb). U-Boot SHALL be sourced from the nixpkgs
ubootRock64 package.
Scenario: U-Boot loads from eMMC
- WHEN the Rock64 powers on with the provisioned eMMC
- THEN the RK3328 boot ROM finds and executes U-Boot from the expected eMMC offsets
Requirement: /data partition survives updates
The /data partition SHALL NOT be modified by RAUC updates or rootfs slot switches. It SHALL persist across all updates and rollbacks.
Scenario: Data survives an A/B slot switch
- WHEN a file is written to /data, then an update switches the active slot from A to B
- THEN the file is still present and unmodified on /data after the slot switch
Requirement: Boot configuration uses U-Boot environment for slot selection
U-Boot SHALL use RAUC bootmeth environment variables (BOOT_ORDER, BOOT_A_LEFT, BOOT_B_LEFT) to select the next boot
slot. RAUC bootmeth SHALL provide the selected boot and root partition identities to boot.scr.
Scenario: U-Boot selects correct slot pair
- WHEN U-Boot reads
BOOT_ORDER=A BandBOOT_A_LEFT=3 - THEN RAUC bootmeth selects slot A before loading
boot.scr - AND
boot.scrloads kernel and DTB from boot slot A and passes slot A’s lower-device identity to initrd
rauc-integration
ADDED Requirements
Requirement: RAUC is configured with A/B multi-slot definitions
The RAUC system configuration (system.conf) SHALL define two slot pairs: boot slot A + rootfs slot A, and boot
slot B + rootfs slot B. Each slot pair SHALL be mapped to its respective eMMC partitions. The configuration SHALL
specify U-Boot as the bootloader backend.
Scenario: RAUC recognizes all slots
- WHEN
rauc statusis run on the device - THEN the output lists boot slot A, boot slot B, rootfs slot A, and rootfs slot B with their partition device paths, and one slot pair is marked as active
Requirement: RAUC uses U-Boot bootloader backend
The RAUC configuration SHALL specify bootloader=uboot and configure the appropriate U-Boot environment variable names
for slot selection and boot-count tracking.
Scenario: RAUC install selects the newly written slot for the next boot
- WHEN a RAUC bundle is installed to the inactive slot pair
- THEN the newly written slot becomes the next slot attempted on reboot
Requirement: RAUC verifies bundle signatures before installation
RAUC SHALL be configured with a CA certificate (keyring) and SHALL reject any bundle not signed by a key trusted by
that CA. Unsigned or incorrectly signed bundles SHALL NOT be installed.
Scenario: Valid signed bundle installs successfully
- WHEN a bundle signed with a key trusted by the configured CA is provided to
rauc install - THEN RAUC verifies the signature, writes both boot and rootfs images to the inactive slot pair, and reports success
Scenario: Invalid signature is rejected
- WHEN a bundle with an invalid or untrusted signature is provided to
rauc install - THEN RAUC refuses to install and returns a signature verification error
Requirement: RAUC writes to the inactive slot pair only
RAUC SHALL always write updates to the slot pair that is NOT currently booted. It SHALL never overwrite the running boot partition or root filesystem.
Scenario: Update targets inactive slot pair
- WHEN the device is booted from slot pair A and
rauc installis run - THEN RAUC writes the new boot image to boot slot B and the new rootfs image to rootfs slot B (and vice versa)
Requirement: RAUC bundle contains boot and rootfs images
Each RAUC bundle (.raucb) SHALL contain two images: a boot partition image (kernel + DTB) and a rootfs image
(squashfs). RAUC SHALL write both images to their respective partitions in the target slot pair.
Scenario: Bundle contains both images
- WHEN the RAUC bundle is inspected with
rauc info - THEN the bundle manifest lists both a boot image and a rootfs image
Requirement: Default update client polls for new bundles
A systemd timer (os-upgrade.timer) SHALL periodically poll the update server for new RAUC bundles. When a new bundle
is available, the service SHALL download it and invoke rauc install.
Scenario: New bundle is detected and installed
- WHEN the update server has a bundle with a version newer than the currently installed version
- THEN the polling service downloads the bundle and triggers
rauc install
Scenario: No update available
- WHEN the update server reports no newer version
- THEN the polling service exits cleanly and waits for the next timer interval
Scenario: Download failure is handled gracefully
- WHEN the download of a new bundle fails (network error, partial download)
- THEN the polling service logs the error, does not invoke
rauc install, and retries at the next interval
Requirement: Update client is swappable with hawkBit
The design SHALL reserve a configuration switch for a future hawkBit-based update client while keeping the simple polling path as the implemented default.
Scenario: Simple polling is enabled by default
- WHEN the device boots with default configuration
- THEN
os-upgrade.timeris active - AND the simple polling path is the active update client
Scenario: hawkBit client can be enabled
- WHEN the NixOS configuration flag for hawkBit is set to true
- THEN the configuration reserves the hawkBit path instead of the default polling path
Requirement: NixOS RAUC module is enabled in configuration
The NixOS configuration SHALL enable the RAUC service via services.rauc with the appropriate compatible string and
CA certificate path.
Scenario: RAUC service is active after boot
- WHEN the device boots
- THEN the
raucsystemd service is running andrauc statusreturns valid slot information
update-confirmation
ADDED Requirements
Requirement: os-verification.service validates local post-update health
A systemd oneshot service (os-verification.service) SHALL run after boot on systems that have already completed the
separate first-boot provisioning flow. It SHALL perform device-local health checks and SHALL NOT depend on external
network reachability for slot confirmation.
Scenario: Gateway services are validated
- WHEN
os-verification.serviceruns after boot on a pending slot - THEN it checks that
dnsmasq.serviceandchronyd.serviceare active - AND it checks that
eth0has a WAN IPv4 address - AND it checks that
eth1matches the provisioned LAN gateway IP from/data/config/lan-settings.json - AND it falls back to
172.20.30.1when no valid provisioned LAN settings exist
Scenario: Service exits early for already-good slots
- WHEN the device boots a slot that RAUC already reports as good
- THEN
os-verification.serviceexits without re-running the confirmation flow
Requirement: Provisioned health requirements come from /data/config/health-required.json
If /data/config/health-required.json exists, os-verification.service SHALL read it as the list of provisioned units
that must be active before the slot can be committed.
Scenario: Required provisioned units are active
- WHEN
/data/config/health-required.jsonlists one or more provisioned units - THEN
os-verification.servicechecks that each corresponding${name}.serviceis active
Scenario: Required provisioned unit is missing or inactive
- WHEN any unit named in
/data/config/health-required.jsonis not active - THEN
os-verification.serviceexits with a non-zero status - AND the slot remains uncommitted
Scenario: No explicit provisioned health requirements exist
- WHEN
/data/config/health-required.jsonis absent or empty - THEN
os-verification.serviceuses the gateway health checks alone
Requirement: Sustained health check catches unstable services
After the initial checks pass, os-verification.service SHALL continue checking health for a sustained 60-second window
using a 5-second interval.
Scenario: Health remains stable for the sustained window
- WHEN all confirmation checks continue to pass for 60 seconds
- THEN the slot is eligible to be committed
Scenario: A required service becomes unhealthy during the sustained window
- WHEN
dnsmasq.service, a required provisioned unit, or another required check fails during the 60-second window - THEN
os-verification.serviceexits with a non-zero status - AND the slot remains uncommitted
Requirement: Successful confirmation commits the slot with RAUC
When the confirmation checks succeed, os-verification.service SHALL call rauc status mark-good for the booted slot.
Scenario: Slot is committed after successful checks
- WHEN all required checks pass for the sustained confirmation window
- THEN
os-verification.servicecallsrauc status mark-good - AND the booted slot becomes committed
Requirement: Failed confirmation leaves the slot pending rollback
If confirmation fails, the system SHALL NOT commit the slot.
Scenario: Repeated failed confirmation leads to rollback
- WHEN the device repeatedly boots an updated slot that never passes confirmation
- THEN the slot remains uncommitted
- AND the U-Boot / RAUC rollback path can eventually fall back to the previous working slot
Requirement: First boot uses a separate provisioning-aware commit path
Initial first boot SHALL be handled by first-boot.service, not os-verification.service.
Scenario: First boot is gated on valid provisioning
- WHEN the device boots for the first time after flash or reprovisioning
- THEN
first-boot.serviceowns the provisioning import and validation flow - AND the initial slot is committed only after valid provisioning state exists
watchdog
ADDED Requirements
Requirement: Hardware watchdog target is defined but deferred
The NixOS configuration SHALL keep RK3328 hardware watchdog manager settings disabled for the current release while
Rock64 boot reliability is validated. The deferred target settings are RuntimeWatchdogSec=30s and
RebootWatchdogSec=10min.
Scenario: Watchdog fires on system hang
- WHEN the current release boots
- THEN
RuntimeWatchdogSecis not set by the AtomixOS watchdog module
Scenario: Normal operation does not trigger watchdog
- WHEN boot-stability validation approves active watchdog enforcement
- THEN the deferred target is to set
RuntimeWatchdogSec=30s
Requirement: Reboot watchdog target is deferred
The NixOS configuration SHALL not set RebootWatchdogSec in the current release. The deferred target is 10min.
Scenario: Hung reboot is recovered
- WHEN the current release boots
- THEN
RebootWatchdogSecis not set by the AtomixOS watchdog module
Requirement: Watchdog timeout is configured appropriately
The deferred target values SHALL remain documented as RuntimeWatchdogSec=30s and RebootWatchdogSec=10min.
Scenario: Watchdog configuration values are applied
- WHEN the device boots the current release
- THEN the watchdog module leaves
systemd.settings.Manager = { }
Requirement: Watchdog reset interacts with boot-count rollback
When the hardware watchdog triggers a reset, the subsequent reboot SHALL go through U-Boot’s normal boot sequence, which decrements the boot attempt counter. This means repeated watchdog-triggered resets on a bad image lead to automatic rollback.
Scenario: Watchdog reset leads to rollback after repeated failures
- WHEN a new image causes the system to hang on every boot attempt, triggering the watchdog each time
- THEN U-Boot’s boot counter decrements to zero and the device rolls back to the previous working slot
Source Metadata
schema: spec-driven
created: 2026-04-09
Source
Converted from openspec/changes/rock64-ab-image/ during the OpenSpec-to-feature-spec migration.
Tasks: rock64-ab-image
1. Flake Bootstrap and Project Structure
- 1.1 Create
flake.nixwith nixpkgs input, aarch64-linux system, andnixosConfigurations.rock64output stub - 1.2 Create a shared NixOS module (
modules/base.nix) for configuration shared between hardware and QEMU targets - 1.3 Configure base NixOS system: systemd as init, locale, timezone, hostname, and minimal users
- 1.4 Enable core services: podman (
virtualisation.podman), openssh (services.openssh) - 1.5 Verify flake evaluates with
nix flake check(cross-compile or native aarch64) — verified in Lima VM (aarch64-linux), all outputs evaluate cleanly
2. Stripped Kernel Configuration
- 2.1 Create a custom kernel configuration for the RK3328 with built-in drivers: eMMC (dw_mmc), ethernet (stmmac), USB host (dwc2/xhci), watchdog (dw_wdt), squashfs, f2fs
- 2.2 Configure USB WiFi drivers (rtlwifi, ath9k_htc, mt76), Bluetooth (btusb), and USB serial (ftdi, cp210x) as
modules (
=m) - 2.3 Include the RK3328 Rock64 device tree blob (
rk3328-rock64.dtb) - 2.4 Verify stripped kernel boots on Rock64 hardware and detects eMMC, ethernet, USB, and watchdog — verified via serial console: kernel 6.19.11 boots on Rock64, eMMC detected (mmcblk1 14.5 GiB, HS200 mode), ethernet (rk_gmac-dwmac + RTL8211F PHY), USB host controllers (dwc2, xhci, ehci, ohci), hardware watchdog (dw_wdt /dev/watchdog0, 30s timeout). Required fixes: initrd for MMC_BLOCK=m, partition offset fix (boot-a at 16 MiB), PARTLABEL root=, rootwait, ramdisk_addr_r override to 0x08000000
3. Remote Management Direction and OpenVPN
- 3.1 Keep Podman available in the device image as the application runtime while removing the local Cockpit/Traefik management path from the final design
- 3.2 Enable OpenVPN in the NixOS configuration as a systemd service for VPN recovery access
- 3.3 Shift remote web management toward Nixstasis-hosted services and document the enrollment / short-lived SSH model
4. Squashfs Image Build
- 4.1 Add a squashfs image derivation that packages the NixOS system closure (including kernel modules, Podman, OpenVPN, chrony, dnsmasq) into a read-only squashfs image with 1 MB block size
- 4.2 Expose the squashfs image as
packages.aarch64-linux.squashfsin flake outputs - 4.3 Verify the built squashfs image is under 1 GB — most recently 203 MiB after later image trimming work
- 4.4 Add a CI-friendly size check (script or assertion) that fails the build if squashfs exceeds 1 GB
4b. Flashable Disk Image and Build Tasks
- 4b.1 Create
nix/image.nixderivation that assembles a flashable eMMC.img(GPT, U-Boot, boot-a vfat, rootfs-a squashfs) using mtools (no loop devices/mount needed in Nix sandbox) - 4b.2 Create
scripts/build-image.shtemplate with@variable@placeholders for Nix substitute - 4b.3 Expose the image as
packages.aarch64-linux.imagein flake outputs - 4b.4 Add mise build tasks:
check,build:squashfs,build:rauc-bundle,build:boot-script, andbuild(retains rooted artifacts and supports optional image copy-out) - 4b.5 Create the flash/build workflow around
.gcroots/images/image.1and.mise/tasks/flashfor safe device flashing from the latest built image - 4b.6 Verify all flake outputs evaluate cleanly with
nix flake check --no-build - 4b.7 Verify
nix build .#imageproduces a valid disk image — GPT partition table correct, U-Boot at sectors 64/16384, boot-a vfat contains Image (63 MB) + DTB + boot.scr, rootfs-a has valid squashfs (hsqs magic, 334 MB)
5. NIC Naming and Network Interface Configuration
- 5.1 Disable systemd predictable interface names (
networking.usePredictableInterfaceNames = false) - 5.2 Create systemd-networkd
.linkfile matching the RK3328 GMAC platform path (platform-ff540000.ethernet) to name iteth0 - 5.3 Create
.linkfiles for USB ethernet (driver match →ethN) and WiFi dongles (type=wlan →wlanN) - 5.4 Configure eth0 as DHCP client (WAN)
- 5.5 Configure eth1 with static IP 172.20.30.1/24 (LAN)
- 5.6 Verify on hardware: onboard NIC is always eth0 regardless of USB devices plugged in
- 5.7 Verify device identity:
/sys/class/net/eth0/addressreturns the onboard MAC — validated from repeated serial-consoleip addoutput showing the same stableeth0MAC across boots (92:a2:18:4f:57:42)
6. LAN Gateway Services
- 6.1 Configure dnsmasq as DHCP server on eth1 only, pool 172.20.30.10-172.20.30.254, gateway 172.20.30.1
- 6.2 Configure chrony as NTP client (WAN servers via eth0) and NTP server (LAN clients on 172.20.30.0/24 via eth1)
- 6.3 Explicitly disable IP forwarding (
net.ipv4.ip_forward = 0) - [!] 6.4 Verify DHCP: connect a device to the LAN switch, confirm it gets an IP in the correct range
- [!] 6.5 Verify NTP: query 172.20.30.1 from a LAN device, confirm time response
- 6.6 Verify isolation: confirm a LAN device cannot reach any WAN address
7. Firewall Configuration
- 7.1 Configure nftables with WAN inbound rules: ALLOW tcp/443, ALLOW udp/1194 (OpenVPN), ALLOW established/related, DROP all else
- 7.2 Add conditional SSH rule for WAN: ALLOW tcp/22 only if
/data/config/ssh-wan-enabledexists - 7.3 Configure LAN inbound rules: ALLOW udp/67-68 (DHCP), ALLOW udp/123 (NTP), ALLOW tcp/22 (SSH), ALLOW established/related, DROP all else
- 7.4 Configure VPN (tun0) inbound rules: ALLOW tcp/22, ALLOW established/related, DROP all else
- 7.5 Configure FORWARD chain: DROP all
- 7.6 Create a systemd service or nftables hook that checks for the SSH-on-WAN flag file at boot and on firewall reload
- 7.7 Verify: HTTPS works on WAN, SSH blocked on WAN by default, SSH works on LAN and VPN, no forwarding between interfaces
8. eMMC Partition Layout and Provisioning
- 8.1 Create the provisioning/image path that produces a flashable eMMC layout with raw U-Boot, boot A, and rootfs A, leaving slot B and /data to initrd systemd-repart on first boot.
- 8.2 Add U-Boot writing step: dd idbloader.img to sector 64 and u-boot.itb to sector 16384 using
ubootRock64from nixpkgs - 8.3 Create vfat filesystem on boot slot A, copy kernel image and DTB
- 8.4 Write the initial squashfs image to rootfs slot A partition
- 8.5 Configure systemd-repart to create f2fs /data partition on first boot (zero closure cost — binary already in systemd)
- 8.6 U-Boot environment defaults handled by boot.cmd script (lines 17-19:
BOOT_ORDER=A B,BOOT_A_LEFT=3,BOOT_B_LEFT=3when env unset) - 8.7 Add idempotency check: detect if eMMC is already provisioned and prompt for confirmation before overwriting
- 8.8 Test provisioning script: device boots from eMMC into slot A and reaches multi-user.target
9. U-Boot Configuration and Boot-Count Logic
- 9.1 Verify
ubootRock64from nixpkgs produces idbloader.img and u-boot.itb suitable for RK3328 boot ROM (confirmed: idbloader.img 137 KiB, u-boot.itb 940 KiB, plus u-boot-rockchip.bin combined blob) - 9.2 Write U-Boot boot script that reads
BOOT_ORDERandBOOT_X_LEFTvariables, decrements the counter, and selects the appropriate boot slot and rootfs partition - 9.3
Configure redundant U-Boot environment storage— CHANGED: Rock64 U-Boot (rk3328_defconfig) does not enableCONFIG_ENV_REDUNDANT. Single 32 KB env at0x3F8000. FAT flag file approach mitigates power-loss risk. - 9.4 Test boot-count logic: simulate 3 consecutive failed boots on slot B and verify U-Boot falls back to slot A — BLOCKED: requires flashing and testing the latest image with FAT flag file support
10. RAUC System Configuration
- 10.1 Create RAUC system.conf defining two slot pairs (boot A + rootfs A, boot B + rootfs B) with eMMC partition
device paths and
bootloader=uboot - 10.2 Enable the NixOS RAUC module:
services.rauc.enable = true, setcompatible = "rock64", configure CA certificate path - 10.3 Generate a development CA keypair and signing key for RAUC bundle signing (store in
certs/with .gitignore for private keys) - 10.4 Verify
rauc statusruns on device and shows all four slots (boot A, boot B, rootfs A, rootfs B) with correct partition paths — validated on hardware:boot.0=/dev/mmcblk1p1,rootfs.0=/dev/mmcblk1p2,boot.1=/dev/mmcblk1p3,rootfs.1=/dev/mmcblk1p4
11. RAUC Multi-Slot Bundle Building
- 11.1 Create a RAUC bundle derivation in the flake that wraps both the boot image (kernel + DTB) and the squashfs
rootfs image into a single
.raucbfile, signed with the project CA key - 11.2 Expose the bundle as
packages.aarch64-linux.rauc-bundlein flake outputs - 11.3 Verify the bundle with
rauc info— signature valid (dev CA), manifest lists boot.vfat (134 MB) and rootfs.squashfs (350 MB), compatible=rock64, version=0.1.0 - 11.4 Test installing the bundle on device:
rauc installwrites both boot and rootfs to inactive slot pair, updates U-Boot env, device reboots into new slot
12. Watchdog Configuration
- 12.1 Add NixOS configuration for systemd watchdog:
systemd.watchdog.runtimeTime = "30s"andsystemd.watchdog.rebootTime = "10min" - 12.2 Verify the RK3328 watchdog kernel driver loads on boot (
/dev/watchdogexists) — validated viarauc-watchdogE2E test: i6300esb driver loads,test -c /dev/watchdogpasses,lsmod | grep i6300esbpasses. Hardware driver (dw_wdt) to be confirmed on Rock64 hardware. - 12.3 Verify systemd is kicking the watchdog:
systemctl show -p RuntimeWatchdogUSecreports 30s — validated viarauc-watchdogE2E test:systemctl show -p RuntimeWatchdogUSecconfirms watchdog active, kernel log showsWatchdog running with a hardware timeout of 10s(test uses 10s for speed; production uses 30s) - 12.4 Test watchdog: trigger a simulated hang and verify the device reboots within the timeout window — validated
via
rauc-watchdogE2E test:gateway.crash()simulates watchdog-triggered reboot twice, boot-count decrements from 2→1→0, rollback to slot A occurs, slot B marked bad - 12.5 Re-enable hardware watchdog on Rock64 — currently disabled in
modules/watchdog.nixpending stable boot confirmation on hardware. RestoreRuntimeWatchdogSec = "30s"andRebootWatchdogSec = "10min".
13. Update Confirmation Service (os-verification)
- 13.1 Create
os-verification.servicesystemd oneshot unit that runs aftermulti-user.target - 13.2 Implement slot status check: query
rauc statusto determine if current slot is pending; if already marked good, exit immediately - 13.3 Implement system health checks: verify eth0 has WAN address, eth1 is 172.20.30.1, dnsmasq running, chronyd running
- 13.4 Simplify confirmation to local gateway health checks only so slot confirmation does not depend on app containers or remote management services
- 13.5 Implement sustained health check: check every 5 seconds for 60 seconds and fail on local service instability
- 13.6 On any failure: exit non-zero, slot stays uncommitted
- 13.7 On sustained success: call
rauc status mark-goodto commit the slot - 13.8 Add the confirmation service to the NixOS configuration
14. Update Polling Service (os-upgrade, hawkBit-Ready)
- 14.1 Create
os-upgrade.timerandos-upgrade.servicesystemd units for periodic update polling - 14.2 Implement polling logic: query update server for latest bundle version, compare against currently installed version
- 14.3 On new version available: download the
.raucbbundle to a temp location on /data, invokerauc install - 14.4 Handle download failures gracefully: log error, clean up partial downloads, wait for next timer interval
- 14.5 Add
rauc-hawkbit-updateras a disabled service in the NixOS configuration - 14.6 Create a NixOS configuration option to toggle between simple polling and hawkBit client (mutually exclusive)
- 14.7 Verify default:
os-upgrade.timeractive,rauc-hawkbit-updaterinactive — verified in systemd-nspawn: os-upgrade.timer active (waiting), no hawkbit service present
15. QEMU Testing Target
- 15.1 Create
nixosConfigurations.rock64-qemuthat imports the shared base module but targetsaarch64-virt - 15.2 Configure QEMU-specific overrides: virtual block devices for slots, software watchdog, virtual network interfaces
- 15.3 Expose a VM runner script via flake outputs (e.g.,
nix build .#rock64-qemu-vm && ./result/bin/run-vm) - 15.4 Verify QEMU VM boots with the shared base system, firewall, network configuration, RAUC plumbing, and Podman available for application workloads — validated via systemd-nspawn: multi-user.target reached, nftables loaded, chronyd running, networkd running, podman available. dnsmasq/sshd expected failures in container (no eth1, host port 22 conflict)
- 15.5 Verify RAUC slot logic works in QEMU with virtual block devices — validated via
nix build .#checks.aarch64-linux.rauc-slots: VM boots with 4 virtio disks, RAUC service starts (D-Bus),rauc statusreports all 4 slots (boot.0/1, rootfs.0/1) with correct device paths (/dev/vdb-vde)
16. End-to-End Integration Testing
- 16.1 Flashable image boots on Rock64 and reaches multi-user.target after first-boot repartitioning creates the inactive slot and /data
- 16.2 Update test: build a v2 bundle, serve it from a test HTTP server, verify polling service downloads and
installs it, device reboots into new slot with new kernel and rootfs — validated via
nix build .#checks.aarch64-linux.rauc-update: builds signed test bundle (dev certs), copies into QEMU VM,rauc installwrites boot.vfat to /dev/vdc and rootfs.img to /dev/vde, primary switches from A to B. Prerequisite: added custom bootloader backend (bootloader=customin hardware-qemu.nix) that simulates U-Boot env via files in /var/lib/rauc - 16.3 Confirmation test: verify os-verification.service checks system health and marks the slot good after
successful update — validated via
nix build .#checks.aarch64-linux.rauc-confirm: boots QEMU VM with RAUC + dnsmasq- chronyd + dummy eth1 (172.20.30.1), creates first-boot sentinel, runs os-verification service which checks all
services/IPs, waits 60s sustained check, then calls
rauc status mark-goodto commit slot A
- chronyd + dummy eth1 (172.20.30.1), creates first-boot sentinel, runs os-verification service which checks all
services/IPs, waits 60s sustained check, then calls
- 16.4 Hardware confirmation test: install an update on Rock64 and verify the local-only confirmation path commits the slot on real hardware
- 16.5 Rollback test: deploy a deliberately broken image, verify boot-count exhaustion triggers automatic rollback
to previous slot pair — validated via
nix build .#checks.aarch64-linux.rauc-rollback: installs bundle to slot B, marks B bad, re-activates A as primary, verifies A=good/primary and B=bad - 16.6 Watchdog rollback test: deploy an image that causes a hang, verify watchdog fires and eventually triggers
rollback — validated via
nix build .#checks.aarch64-linux.rauc-watchdog: boots VM with i6300esb watchdog + RAUC custom backend, verifies watchdog device present and systemd kicking at 10s, installs bundle to slot B with boot-count=2, simulates two watchdog reboots via crash()/start(), verifies boot-count decrement (2 -> 1 -> 0), rollback to A, and slot B marked bad - 16.7 Power-loss simulation: interrupt an update mid-write (pull power during
rauc install), verify device boots from the previous good slot pair — validated vianix build .#checks.aarch64-linux.rauc-power-loss: installs 64 MB bundle, crashes VM mid-write viamachine.crash(), reboots and verifies slot A still intact and RAUC functional - 16.8 Network isolation test: verify LAN devices get DHCP and NTP but cannot reach WAN addresses — validated via
nix build .#checks.aarch64-linux.network-isolation: 2-node VLAN test (gateway + lan client, redesigned from 3-node to avoid OOM under TCG). Gateway runs dnsmasq (bind-dynamic on eth2) + chrony, LAN client gets DHCP lease in 172.20.30.0/24, gateway NTP reachable, WAN isolation verified via ip_forward=0 + unreachable WAN host ping - 16.9 Firewall test: verify WAN allows only HTTPS and VPN, LAN allows SSH/DHCP/NTP, no forwarding between
interfaces — validated via
nix build .#checks.aarch64-linux.firewall: 2-node VLAN test (gateway + probe with vlans=[1,2], redesigned from 3-node to avoid OOM under TCG). Uses inline nftables rules (eth1=WAN, eth2=LAN) with eth0 backdoor passthrough. Verifies port-level allow/deny from both WAN and LAN sides using ncat listeners - 16.10 SSH-on-WAN toggle test: create/remove flag file, verify SSH access on WAN is enabled/disabled accordingly
— validated via
nix build .#checks.aarch64-linux.ssh-wan-toggle: creates /data/config/ssh-wan-enabled, reloads ssh-wan-reload service, verifies SSH reachable from WAN; removes flag, reloads, verifies SSH blocked again
17. Remote Access Architecture
- 17.1 Evaluate the initial local Cockpit/Traefik management path and prove out the Rock64 bring-up flow
- 17.2 Remove the local Cockpit/Traefik stack from the device image once the design shifted toward Nixstasis-hosted remote access
- 17.3 Document the Nixstasis-oriented remote access model: approved MAC-based enrollment, registration key persisted on /data, reverse tunnel, and short-lived SSH credentials
- 17.4 Keep Podman on-device for application workloads even though remote management is no longer hosted locally
17b. First-Boot Initialization
- 17b.1 Create
modules/first-boot.nix— systemd oneshot service withConditionPathExists=!/data/.completed_first_bootthat runs on first boot only - 17b.2 Create
scripts/first-boot.shto confirm the current slot, seed development-only auth helpers when enabled, and write/data/.completed_first_boot - 17b.3 Add
ConditionPathExists=/data/.completed_first_boottoos-verification.serviceso it skips on first boot (before sentinel exists) - 17b.4 Remove first-boot dependence on local management containers so initial boot completes without image pulls
- 17b.5 Verify in test environments that
first-boot.servicecreates the sentinel andos-verification.serviceremains skipped until subsequent boots
18. Authentication Provisioning
- 18.1 Persist imported operator SSH keys under
/data/config/ssh-authorized-keys/<user>through the provisioning importer - 18.2 Enforce SSH-key-only operator access with
rootand config-managed operator users password-locked by default - 18.3 Validate imported provisioning state before first boot commits the slot
- 18.4 Verify on hardware that admin SSH key auth works, password auth remains rejected, and
_RUT_OH_stays a physical serial recovery path rather than a normal operator login mode - 18.5 Verify no credentials exist in the squashfs image itself (EN18031 compliance) — verified via source audit:
hashedPasswordFilereads from/dataat runtime (modules/base.nix:130), SSH authorized keys loaded from/data(modules/base.nix:161), nohashedPassword/password/initialPasswordattributes anywhere, TLS certs and OpenVPN configs all reference/data, squashfs derivation (nix/squashfs.nix) packs only the NixOS system closure viaclosureInfo. The only crypto material in the image is the RAUC CA public certificate (required for bundle verification). RAUC signing private keys are build-time-only derivations, never in the system closure.
Feature: first-boot-local-provisioning
Overview
Why
The current image and first-boot flow still reflect a development-oriented model: first boot can commit the slot before
production access credentials and application stack configuration exist, and provisioning is split across ad hoc files
manually copied into /data/config/ after boot. This makes fresh installs, reprovisioning, and future Nixstasis
integration less coherent than they need to be.
This change defines a single local first-boot provisioning contract based on config.toml, so a freshly flashed or
reprovisioned device can acquire its managed operator users and Quadlet-managed application stack from a well-defined
source order without baking per-device secrets into the base image.
What Changes
- Add a first-boot local provisioning flow that discovers a single
config.tomlseed from/boot, then USB mass storage, then a local bootstrap web console - Define a bounded
config.tomlschema for provisioning managed users, explicit activation requirements, and structured Quadlet unit definitions - Persist the imported provisioning state under
/data/config/, including the sourceconfig.tomland rendered Quadlet units - Distinguish initial fresh-flash provisioning from reprovisioning using
boot-b absentas the discriminator so/boot/config.tomlis a day-0 seed source only - Redefine the first-boot path so production slot confirmation happens after provisioning import and validation rather than unconditionally after Linux boots
- Introduce a constrained local bootstrap web console that can upload or paste an existing
config.tomlwhen no seed file is found
Capabilities
New Capabilities
first-boot-local-provisioning: Source discovery, import, validation, reprovisioning behavior, and bootstrap UI for local provisioning fromconfig.toml
Modified Capabilities
partition-layout: Clarify that provisioned operator configuration persists under/data/config/, and that reprovisioning is driven by wiping/datawhile preserving the slot layoutupdate-confirmation: Change the first-boot contract so production slot confirmation depends on successful local provisioning and explicit health requirements rather than unconditional first-boot commit
Impact
- Affected code:
first-boot.service,scripts/first-boot.sh, provisioning tasks/scripts, bootstrap web management path, and the runtime consumers of/data/config/ - Affected storage layout:
/data/config/becomes the canonical home for importedconfig.tomland rendered Quadlet files - Affected operational workflows: fresh flash, reprovisioning, bench setup, and future Nixstasis alignment
- Security: preserves the current no-secrets-in-image stance while making first-boot bootstrap behavior an explicit part of the device trust model
Design
Context
The current first-boot path is intentionally permissive: it commits the provisioned slot before production credentials,
application stack configuration, and health requirements necessarily exist. That was a pragmatic development step, but
it leaves production provisioning split across ad hoc files copied into /data/config/ after boot and does not align
cleanly with the later Nixstasis direction.
The current first-boot-local-provisioning work converged on a bounded local provisioning contract:
- a single
config.tomlartifact - operator SSH keys as imported access material
- structured Quadlet definitions as the application runtime contract
- explicit health requirements in the same document
/data/config/as the durable home of imported operator configuration
The platform already has a natural day-0 discriminator, but it is only visible in the initrd: the flashed image
contains only slot A, so boot-b absent before initrd systemd-repart runs means a fresh flash. By the time the
switched-root first-boot.service executes, initrd repartitioning has already created boot-b, rootfs-b, and
/data, so the fresh-flash vs reprovision distinction must be detected in initrd and persisted for later consumption.
Goals / Non-Goals
Goals:
- Define a single local provisioning contract for fresh flash and reprovisioning
- Keep the base image generic and free of per-device secrets
- Persist all imported provisioning state under
/data/config/ - Use structured TOML to describe Quadlet units without raw multiline blobs or arbitrary output filenames
- Make first-boot production slot confirmation depend on provisioning import and validation
- Keep the bootstrap UI constrained to provisioning upload/generation rather than general management
- Preserve a future path where Nixstasis delivers the same logical payload remotely
Non-Goals:
- Implement remote Nixstasis provisioning in this change
- Introduce a generic provisioning engine such as cloud-init
- Make compose the device runtime contract
- Turn the bootstrap web console into a long-lived management surface
- Solve every future application stack abstraction beyond the bounded
config.tomlcontract
Decisions
1. Use a single config.toml provisioning contract
Decision: Local provisioning is driven by a single config.toml contract. The importer and bootstrap path may also
accept a supported archive bundle that carries that same config.toml plus optional files/ payloads.
Rationale: The device needs a narrow contract, not a generic provisioning engine. Keeping one logical contract makes fresh flash, USB reprovisioning, bootstrap upload, and eventual Nixstasis delivery easier to reason about, while the optional archive wrapper gives the importer a safe way to carry auxiliary files.
Alternatives considered:
- Cloud-init / NoCloud: rejected as too broad and too open-ended for an appliance-oriented provisioning boundary
- Multiple independent files under
/data/config/: rejected because it encourages drift and weakens validation - Compose file as the primary artifact: rejected because it does not cover credentials or health expectations and is not the preferred runtime primitive
2. Treat /boot as a day-0 seed source only
Decision: Fresh-flash detection happens in initrd before systemd-repart creates slot B. Initrd persists a marker
that the switched-root provisioning path consumes later. First boot searches provisioning sources in this order on a
fresh flash: /boot/config.toml, then USB, then bootstrap web console. Reprovisioning skips /boot entirely.
Rationale: /boot is convenient on a freshly flashed device because only slot A exists and the operator can place a
seed file alongside the flashed image. But replaying /boot/config.toml on every later reprovision would make stale
seed material unexpectedly authoritative. Using boot-b absent as the discriminator still gives the desired behavior,
but the check must happen in initrd where that condition is actually true.
Alternatives considered:
- Always search
/bootfirst: rejected because stale seeds could silently replay after/datais wiped - USB first always: rejected because it makes fresh flash more cumbersome than necessary
- Sentinel file to detect fresh flash: rejected because
boot-b absentis simpler and based on the actual disk layout rather than mutable state
3. Represent Quadlet as structured TOML, not raw embedded blobs
Decision: config.toml uses the canonical shape containers.container.<name>.<section>, with [containers.container.<name>]
declaring privileged = true|false and the OS owning the rootful vs rootless runtime details.
Rationale: This preserves the Quadlet runtime model while keeping the provisioning artifact structured and validatable. The device can deterministically derive the rendered filename, active path, and runtime mode from the container name plus its privilege setting. Arrays in TOML cleanly map to repeated Quadlet directives.
Alternatives considered:
content = """..."""raw Quadlet blobs: rejected because multiline embedded INI text is harder to validate and less appliance-friendly- Generic file-write envelope: rejected because it reintroduces arbitrary paths and permissions into the contract
- Compose as canonical config: rejected because the device runtime should stay systemd + podman + Quadlet oriented
4. Keep health requirements explicit in the provisioning artifact
Decision: Health expectations are explicitly declared in config.toml rather than inferred from every rendered
Quadlet unit.
Rationale: Not every declared unit is necessarily health-critical, and implicit inference creates ambiguity for future helper units, one-shot setup units, or optional services. Explicit health requirements make the first-boot and update confirmation paths testable and predictable.
Alternatives considered:
- Infer health targets from all declared units: rejected because it couples runtime shape too tightly to health policy and makes optional/helper units awkward
5. Use /data/config as the canonical persisted provisioning boundary
Decision: All imported provisioning-derived operator configuration lives under /data/config/, including the source
config.toml, admin access material, and rendered Quadlet files.
Rationale: This keeps the purpose of /data legible: /data/config for operator/provisioner state, /data/logs
for diagnostics, /data/containers for runtime container state, and /data/rauc for lifecycle/update state.
Alternatives considered:
- Scatter files across
/data: rejected because it weakens the reprovision boundary and makes imported state harder to reason about - Keep Quadlet units outside
/data/config/: rejected because they are provisioned operator intent, not transient runtime state
5a. Sync rendered Quadlet units into the active Quadlet paths at boot
Decision: Provisioning renders canonical Quadlet files under /data/config/quadlet/, then a dedicated boot-time
sync path copies rootful units into /etc/containers/systemd/ and rootless app units into the managed app user’s
Quadlet path, reloads systemd, and starts rendered services.
Rationale: /data/config/quadlet/ remains the durable operator-intent boundary, while the standard rootful and
rootless Quadlet discovery paths are still what the running system consumes. This keeps imported configuration
persistent without treating the active runtime paths as the source of truth.
Alternatives considered:
- Write only to
/etc/containers/systemd/: rejected because it hides provisioned state outside the canonical/data/config/persistence boundary - Invent a custom generator input path: rejected because the standard system path already exists and keeps the change smaller
6. Make production first-boot commit provisioning-aware
Decision: Production first boot should import and validate provisioning state before the slot is marked good.
Rationale: A device that merely boots Linux but has no valid admin credentials or application stack is not actually ready for production use. This change redefines the production first-boot path from “Linux came up” to “minimum provisioned state exists and is coherent.”
Imported SSH keys and rendered Quadlet state live under /data/config/, and both can be consumed in the same boot.
That keeps the first-boot flow smaller and avoids introducing an extra reboot boundary before the device becomes
debuggable.
Alternatives considered:
- Keep unconditional first-boot commit forever: rejected because it bakes a development convenience into the production lifecycle contract
- Require remote phone-home before commit: rejected because local provisioning and confirmation should not depend on external availability
7. Narrow the bootstrap endpoint after provisioning
Decision: Before initial provisioning completes, the bootstrap web console listens on WAN and LAN interfaces. After
the first valid config is applied, LAN settings rebind the systemd socket to the configured LAN gateway address and port
8080; subsequent recovery and reprovisioning use authenticated API calls on the LAN endpoint only.
Rationale: Fresh devices may not have a known LAN address yet, so day-0 provisioning must be reachable on either interface. Once the operator-provided LAN config is active, narrowing the socket to the LAN gateway address removes WAN exposure from the long-lived recovery surface.
Alternatives considered:
- Always bind only on LAN: rejected because it makes first provisioning brittle before LAN settings are known
- Require operators to open the firewall manually: rejected because it makes the fallback path brittle during unprovisioned boot
7a. Applied configs should be shown and downloadable
Decision: When the bootstrap web console applies an uploaded or pasted config.toml, the UI should show the final
applied config.toml back to the operator and offer a direct download action for that exact artifact.
Rationale: The applied file is the actual operator-facing provisioning contract. Showing the final TOML immediately after apply makes the bootstrap flow easier to audit, makes the state reusable for later devices or reprovisioning, and avoids treating the form submission as a write-only UX.
Alternatives considered:
- Apply silently and show only success/failure: rejected because it hides the final contract the device actually accepted
- Offer download only as an optional later enhancement: rejected because the applied artifact is useful immediately in the same provisioning session
Risks / Trade-offs
- [Risk] Bootstrap UI becomes a second management plane -> Keep it narrowly scoped to upload/paste/apply during unprovisioned state only
- [Risk] Structured TOML diverges from raw Quadlet capabilities -> Start with a bounded supported subset and render deterministically; expand only when real needs appear
- [Risk] Reprovisioning may surprise operators by ignoring
/boot/config.toml-> Document the fresh-flash vs reprovision distinction clearly and prefer USB/web for reprovision workflows - [Risk] Existing first-boot assumptions in docs/tests drift from the new contract -> Update specs, docs, and tests together as part of the implementation change
- [Trade-off] Operators cannot paste raw Quadlet files verbatim -> Accept the translation cost in exchange for a structured, validatable provisioning contract
Migration Plan
- Introduce initrd fresh-flash detection and persist the result for the switched-root provisioning path.
- Introduce the new
config.tomlprovisioning schema and source-order logic behind the first-boot provisioning path. - Persist imported state under
/data/config/, render the Quadlet files there, and sync them into the active rootful and rootless Quadlet paths. - Update first-boot validation and confirmation behavior so production slot commit depends on successful provisioning import.
- Add bootstrap UI support as the final fallback when no local seed file exists, then rebind to the LAN bootstrap address after provisioning.
- Update docs and provisioning workflows to describe initrd fresh-flash detection,
/bootinitial seeding, USB reprovisioning, and/datawipe as the reprovision reset boundary.
Rollback remains straightforward during development: remove the new provisioning-aware commit gate and fall back to the
current unconditional first-boot path. Operationally, reprovisioning remains wipe /data plus reboot, after which the
device searches USB seed sources first and then falls back to the local bootstrap console without replaying
/boot/config.toml.
Open Questions
- What exact subset of Quadlet sections and directives should the first implementation support?
Requirements
first-boot-local-provisioning
ADDED Requirements
Requirement: First boot discovers a local provisioning seed in priority order
On an unprovisioned device, the provisioning flow SHALL search for a config.toml seed in a fixed priority order. The
system SHALL detect the fresh-flash case in initrd before systemd-repart creates slot B. On a fresh flash where
boot-b is absent at that stage, the system SHALL search /boot/config.toml first, then an attached USB mass storage
device containing config.toml, and finally fall back to a local bootstrap web console if no seed is found.
Scenario: Fresh flash uses boot partition seed first
- WHEN the device boots for the first time after a fresh flash and initrd detects that
boot-bis absent before repartitioning - THEN the provisioning flow checks
/boot/config.tomlbefore searching removable USB storage or starting the bootstrap web console
Scenario: USB seed is used when no boot seed exists
- WHEN the device is unprovisioned,
boot-bis absent, and/boot/config.tomlis missing - THEN the provisioning flow searches attached USB mass storage for
config.tomlbefore starting the bootstrap web console
Scenario: Bootstrap console starts when no seed is found
- WHEN the device is unprovisioned and no
config.tomlis found on either/bootor attached USB mass storage - THEN the device starts a local bootstrap web console for interactive provisioning
Requirement: Reprovisioning skips boot partition seed replay
If the device is reprovisioned by wiping /data after the slot layout already exists, the provisioning flow SHALL
distinguish that state from a fresh flash by using the initrd-detected fresh-flash marker. When the marker indicates
that slot B already existed before repartitioning, /boot/config.toml SHALL NOT be used as a provisioning seed, and
reprovisioning SHALL use USB seed discovery followed by the bootstrap web console.
Scenario: Reprovisioned device ignores boot partition seed
- WHEN the device boots with
/dataempty after a reprovision reset and the initrd fresh-flash marker indicates that slot B already existed - THEN the provisioning flow skips
/boot/config.tomland searches USB mass storage before starting the bootstrap web console
Scenario: Wiping /data returns the device to provisioning mode
- WHEN
/datais wiped or reformatted on a device whose slot layout already includesboot-b - THEN the next boot re-enters the local provisioning flow rather than treating the device as already provisioned
Requirement: config.toml defines bounded provisioning data
The local provisioning artifact SHALL be a single config.toml file containing only the bounded appliance provisioning
contract: managed users, provisioned LAN and WAN firewall inbound policy, optional LAN/NTP settings, optional OS
upgrade settings, explicit activation requirements, and structured Quadlet definitions. The accepted structure SHALL be
defined by a machine-readable schema that the import path validates before semantic normalization.
Scenario: Minimum valid config.toml includes admin access and stack definition
- WHEN a
config.tomlfile is accepted for import - THEN it includes at least one admin user SSH key, optional
[network.firewall.inbound]tables, at least one Quadlet-defined application or service unit, and explicit activation requirements
Requirement: Provisioning renders firewall and LAN runtime state
The device SHALL render accepted provisioning input into JSON runtime state under /data/config/. Firewall inbound state
SHALL be written to /data/config/firewall-inbound.json as optional wan and lan objects, each containing optional
tcp and udp arrays of integer ports in 1..65535. If the lan object is omitted or contains no ports, LAN remains
open by default. If the lan object contains any ports, those ports SHALL be appended to the platform-required LAN
ports. LAN state SHALL be written to /data/config/lan-settings.json with the validated gateway CIDR, gateway IP,
subnet CIDR, netmask, DHCP range, DNS domain, hostname pattern, and gateway aliases. Optional OS upgrade state SHALL be
written to /data/config/os-upgrade.json when [os_upgrade] is present.
Scenario: Firewall inbound config is bounded
- WHEN provisioning imports
[network.firewall.inbound.wan]or[network.firewall.inbound.lan]with TCP or UDP ports - THEN the persisted firewall JSON contains only normalized integer port arrays under those scopes
- AND
provisioned-firewall-inbound.serviceapplies those ports to the matching interface rules for WAN and LAN
Scenario: LAN range excludes gateway
- WHEN provisioning imports
[network.dnsmasq]with a gateway CIDR and DHCP range - THEN the DHCP range is rejected unless it is inside the gateway
/24, ordered, and excludes the gateway IP
Requirement: config.toml expresses containers as structured TOML
The config.toml format SHALL represent Quadlet units as structured TOML tables rather than raw embedded multiline
Quadlet blobs. The canonical identity shape SHALL be containers.container.<name>.<section>, with a required
[containers.container.<name>] table that declares privileged = true|false. The device SHALL derive the rendered filename,
active path, and runtime mode from the container name plus that privilege flag.
Scenario: Structured Quadlet tables map to rendered unit files
- WHEN the provisioning flow reads
[containers.container.traefik.Container] - THEN it treats that table as the
Containersection of the renderedtraefik.containerQuadlet unit under the canonical/data/config/quadlet/path
Scenario: TOML arrays render to repeated Quadlet directives
- WHEN a structured Quadlet table contains an array value such as
Network = ["frontend", "backend"] - THEN the rendered Quadlet unit contains repeated directives for that key rather than a single joined value
Requirement: Imported provisioning state persists under /data/config
All persisted provisioning-derived configuration SHALL live under /data/config/. This SHALL include the imported
config.toml, the admin SSH authorized key material, and the rendered Quadlet unit files.
Scenario: Imported provisioning state is stored under /data/config
- WHEN the provisioning flow successfully imports a
config.tomlseed - THEN the resulting durable operator configuration is written under
/data/config/
Requirement: Newly imported provisioning is usable without an extra reboot
When a new config.toml is imported on an unprovisioned boot, the device SHALL continue first boot without requiring an
extra reboot before relying on the imported SSH authorized keys or other persisted provisioning-derived runtime state.
Scenario: Imported config is usable in the same boot
- WHEN the device imports a new
config.tomlfrom/boot, USB, or the bootstrap web console - THEN it applies the imported SSH keys and runtime configuration during that same boot
Requirement: Rendered Quadlet configuration is activated through the standard system path
Rendered Quadlet files SHALL remain canonically stored under /data/config/quadlet/, and the boot process SHALL sync
them into the active Quadlet path for their runtime mode before starting provisioned services. Rootful units SHALL sync
under /etc/containers/systemd/, while rootless application units SHALL sync under the managed app user’s Quadlet path.
Scenario: Imported Quadlet files are synced into the active system path
- WHEN provisioning has rendered Quadlet files under
/data/config/quadlet/ - THEN the system syncs rootful and rootless units into their respective active Quadlet paths, reloads systemd, and starts the rendered services from those active paths
Requirement: First boot blocks on required runtime apply steps
The first-boot completion gate SHALL require a discovered config.toml to import and validate successfully, then apply
the rendered runtime state before committing the RAUC slot. lan-gateway-apply.service and
provisioned-firewall-inbound.service failures SHALL prevent the completion sentinel and RAUC mark-good. Quadlet sync
failure SHALL be fatal when /data/config/health-required.json names required provisioned units.
Scenario: Runtime apply failure prevents slot commit
- WHEN a discovered
config.tomlimports and validates successfully - AND LAN gateway apply or provisioned firewall apply fails
- THEN first boot does not write the completion sentinel
- AND the RAUC slot is not marked good
Scenario: Required Quadlet failure prevents slot commit
- WHEN a discovered
config.tomlimports and validates successfully - AND Quadlet sync fails while health requirements name provisioned units
- THEN first boot does not write the completion sentinel
- AND the RAUC slot is not marked good
Requirement: Quadlet runtime constraints are explicit
Provisioned containers SHALL be rendered into canonical Quadlet files before activation. Rootful containers require
privileged = true and SHALL be forced to Network=host. Rootless containers SHALL run as the managed app user, use
Network=pasta, and have non-loopback published ports rewritten to 127.0.0.1. Runtime metadata SHALL be persisted to
/data/config/quadlet-runtime.json.
Scenario: Rootless published ports bind to loopback
- WHEN a rootless provisioned container declares
PublishPort = ["10080:80"] - THEN the rendered Quadlet contains
PublishPort=127.0.0.1:10080:80
Scenario: Privileged containers use host networking
- WHEN a provisioned container declares
privileged = true - THEN the rendered Quadlet contains
Network=host
Requirement: Bootstrap web console supports config upload
When no provisioning seed file is found, the device SHALL start a constrained local bootstrap web console. The console
SHALL support uploading an existing config.toml or supported config bundle.
Scenario: Applied config is shown back to the operator after apply
- WHEN an operator uploads or pastes a valid
config.toml - THEN the bootstrap UI shows the final applied
config.tomlcontent back to the operator
Scenario: Applied config can be downloaded after apply
- WHEN an operator uploads or pastes a valid
config.toml - THEN the bootstrap UI offers a direct download for that final applied
config.toml
Requirement: Bootstrap endpoint supports programmatic config import
The bootstrap service SHALL expose a constrained local API endpoint that accepts a complete config.toml payload or a
supported config bundle for programmatic local import using the same validation and persistence path as the web console.
The programmatic endpoint SHALL be POST /api/config and return a JSON async job response for accepted submissions, or
a JSON validation-error response for rejected submissions. First-boot programmatic clients SHALL NOT require the Boot UI
CSRF token; provisioned reapply clients SHALL use SSH signature authentication.
Scenario: Programmatic upload returns an async job
- WHEN a local client POSTs
config.tomlto/api/config - THEN the bootstrap service validates the payload and accepts an apply job
- AND the response is JSON containing
job_id, initialstate, andjob_url
Requirement: Bootstrap endpoint narrows after initial provisioning
Before initial provisioning completes, the bootstrap API socket SHALL be reachable on WAN and LAN interfaces so operators can provision a device before LAN settings are known. After a valid provisioning config is applied, the service SHALL rebind to the configured LAN gateway IP and remain available only from the LAN interface for authenticated local recovery or reprovisioning. The first-boot web console SHALL be hidden after provisioning.
Scenario: Bootstrap console listens on WAN and LAN before provisioning
- WHEN the device starts the bootstrap web console
- THEN it is reachable on the bootstrap port from WAN and LAN interfaces until initial provisioning completes
Scenario: Bootstrap API remains available after provisioning
- WHEN a valid provisioning config has already been applied
- THEN the bootstrap API continues listening on the LAN bootstrap endpoint for authenticated recovery or reprovisioning
- AND the bootstrap API is no longer reachable from WAN
Scenario: Bootstrap console is hidden after provisioning
- WHEN a valid provisioning config has already been applied
- THEN the unauthenticated first-boot console is not served
Scenario: Existing config.toml can be uploaded through the bootstrap console
- WHEN an operator opens the bootstrap web console on an unprovisioned device
- THEN the console allows uploading an existing
config.tomlor supported config bundle for local import
Scenario: Programmatic client can upload config.toml directly
- WHEN a local client POSTs a complete
config.tomlpayload or supported config bundle to the bootstrap API endpoint - THEN the bootstrap service validates and imports it through the same path used by the web console upload flow
partition-layout
MODIFIED Requirements
Requirement: /data partition survives updates
The /data partition SHALL NOT be modified by RAUC updates or rootfs slot switches. It SHALL persist across all
updates and rollbacks. Provisioned operator configuration SHALL be stored under /data/config/, and wiping /data
SHALL reset the device to an unprovisioned state without removing the existing slot layout.
Scenario: Data survives an A/B slot switch
- WHEN a file is written to /data, then an update switches the active slot from A to B
- THEN the file is still present and unmodified on /data after the slot switch
Scenario: Wiping /data preserves slot layout but resets provisioning state
- WHEN
/datais reformatted on a device whoseboot-bandrootfs-bpartitions already exist - THEN the device retains its slot layout but re-enters the unprovisioned first-boot provisioning flow on the next boot
update-confirmation
MODIFIED Requirements
Requirement: Manifest-driven container health checks
If /data/config/config.toml exists, the confirmation service SHALL treat the explicit health requirements imported
from that provisioning artifact as the source of truth for required application units. The confirmation service SHALL
verify that each required container or service reaches its expected healthy running state before the slot can be
committed. If no valid provisioning state exists on a production first boot, the slot SHALL remain uncommitted.
Scenario: Provisioned health requirements define required units
- WHEN the confirmation service runs on a provisioned device with a valid imported
config.toml - THEN it reads the explicit health requirements derived from that provisioning state to determine which units must be healthy before committing the slot
Scenario: Missing provisioning state blocks production first-boot commit
- WHEN the device is in the production first-boot path and no valid local provisioning state has been imported
- THEN the slot remains uncommitted rather than being marked good unconditionally
Source Metadata
schema: spec-driven
created: 2026-04-27
Source
Converted from openspec/changes/first-boot-local-provisioning/ during the OpenSpec-to-feature-spec migration.
Tasks: first-boot-local-provisioning
1. Provisioning Contract
- 1.1 Define the supported
config.tomlschema for managed users, activation requirements, and structured container/Quadlet data - 1.2 Define the TOML-to-Quadlet rendering rules, including how arrays map to repeated Quadlet directives
- 1.3 Define the canonical persisted layout under
/data/config/, including the importedconfig.tomland rendered Quadlet unit files
2. First-Boot Source Discovery
- 2.1 Add initrd fresh-flash detection that checks whether
boot-bis absent before repartitioning and persists a marker for the switched-root provisioning path - 2.2 Implement provisioning source search in fresh-flash order:
/boot/config.toml, then USB mass storage, then bootstrap web console - 2.3 Implement reprovision source search in reset order: USB mass storage, then bootstrap web console
3. Import And Validation
- 3.1 Import a discovered
config.tomlinto durable state under/data/config/ - 3.2 Render structured Quadlet definitions from
config.tomlinto canonical files under/data/config/quadlet/ - 3.3 Validate the minimum provisioning contract: at least one admin SSH key, at least one Quadlet-defined service, and explicit health requirements
4. First-Boot Commit Behavior
- 4.1 Change the production first-boot path so slot confirmation happens only after successful provisioning import and validation
- 4.2 Update the confirmation/health path to consume explicit health requirements from imported provisioning state
- 4.3 Preserve a development-safe fallback strategy for existing development-mode workflows while the new production gate is introduced
5. Bootstrap Web Console
- 5.1 Add a constrained local bootstrap web console for unprovisioned devices when no seed file is found
- 5.2 Support uploading an existing
config.tomlthrough the bootstrap console - 5.3 Support pasting a valid
config.tomland applying it locally - 5.4 Support programmatic local import of a complete
config.tomlthrough the bootstrap endpoint - 5.5 After apply, show the final applied
config.tomlin the bootstrap UI and offer a download action for that artifact - 5.6 Apply minimal AtomixOS branding to the bootstrap UI, including the logo and cobalt-blue theme styling aligned with the mdBook visual language
6. Reprovisioning And Documentation
- 6.1 Define and implement reprovisioning behavior so wiping
/datareturns the device to provisioning mode without replaying/boot/config.toml - 6.2 Update OpenSpec/docs to describe
/bootinitial seeding, USB reprovisioning, bootstrap UI fallback, and the/data/config/persistence boundary - 6.3 Add focused validation coverage for fresh flash, reprovisioning, seed-source precedence, and TOML-to-Quadlet rendering
Feature: durable-journald-logs
Overview
Why
- The current root filesystem uses a tmpfs-backed overlay, so host logs disappear on reboot and power loss. That makes boot failures, failed updates, watchdog resets, and rollback events hard to reconstruct after the device recovers.
- At the same time, the image intentionally favors tmpfs-backed runtime state to reduce eMMC write amplification. The logging design needs to preserve that wear benefit while still keeping the most important forensic breadcrumbs durable.
What Changes
- Keep general host journald tmpfs-first during runtime as the log ingress point, without allowing journald itself to write routine logs directly to persistent media
- Add an
rsyslogRAM queue behind volatile journald so general host and container logs are collected in memory and written to/data/logsin large, infrequent, sequential batches rather than in many small writes - Flush the in-memory buffered log queue to
/data/logsduring orderly shutdown so the last clean shutdown captures the latest buffered host diagnostics - Align Podman container logging with the same journald-plus-rsyslog buffering policy so routine application logs also remain memory-first during runtime while following the same large-batch persistent append path
- Document retention, durability guarantees, and the boundary between durable host forensics and buffered general host/application logs
Capabilities
Modified Capabilities
durable-journald-logs: Tmpfs-first host journald feeding an in-memoryrsyslogbatch queue, with large sequential appends to/data/logs, orderly shutdown flush, and Podman logging aligned to the same buffering model
Impact
- Affected code: boot/update services, host logging configuration in the
base system, and the
/datalog export path - Affected docs: partition layout, boot/update architecture, and operational debugging guidance
- Operational impact: Devices keep normal host and application logging
memory-first during runtime and append broader diagnostics to
/data/logsin large, sequential batches plus orderly shutdown flushes
This change resolves the broader tmpfs-first journald plus /data durable
logging path explicitly as a RAM-queued rsyslog batch append path to
/data/logs rather than timer-driven journal checkpoints.
Design
Context
The Rock64 image runs from a read-only squashfs with a tmpfs-backed overlay for mutable root state. That keeps the runtime root clean on every boot and minimizes steady-state eMMC wear, but it also means host logs are currently volatile unless they are copied elsewhere.
The failures this project cares about most are lifecycle failures: boot regressions, failed update confirmation, watchdog resets, rollback decisions, networking bring-up issues, and first-boot provisioning problems. Those events need post-reboot forensic visibility, especially when a device recovers only after a power cycle or slot rollback.
The device already has two different writable surfaces with different roles:
/bootis mounted from the active slot’s FAT boot partition and is available as part of the boot/update path/datais the main persistent mutable partition for provisioned state and application data
Applications are expected to run in Podman, with container storage already
rooted on /data. That means this change does not need to invent a general
persistent application-data model. The problem here is host-platform forensics
under tight write-wear constraints.
Goals / Non-Goals
Goals:
- Preserve a bounded record of critical host lifecycle events across reboot, rollback, and as many power-loss cases as practical
- Keep normal host logging memory-first so the image retains the write-reduction benefits of its tmpfs-backed runtime model
- Reserve a fixed forensic budget on each boot slot with deterministic retention behavior
- Make the durability guarantees explicit: critical mirrored events are durable-first, general runtime logs are not
Non-Goals:
- Shipping a remote log collection pipeline or external log forwarding
- Making the full host journal power-loss durable
- Defining long-term application log retention policy beyond establishing the boundary with Podman logging
- Changing the overall partition layout or introducing a dedicated log partition
Decisions
1. Use a three-tier logging model
Choice: Split logging into three tiers with different durability and wear profiles.
Tier 0: Slot-local forensic black box on /boot
Tier 1: Volatile host journal in memory
Tier 2: RAM-queued batched log appends to /data/logs
Tier 0 exists for the small set of events that must survive reboot and should
be made as power-loss resistant as practical. Tier 1 keeps normal host logging
in RAM so the device does not continuously write routine logs to eMMC. Tier 2
captures the bounded /data export path for richer general diagnostics,
including Podman log traffic that is routed through journald and then queued in
RAM before being appended to /data/logs in large sequential batches.
Alternatives considered:
- Single persistent journald store: simpler conceptually, but undermines the memory-first wear model by making all host logging durable.
- No distinction between forensic and buffered logs: blurs the durability boundary and makes scope creep likely.
2. Reserve 28 MiB per boot slot for Tier 0 forensic storage
Choice: Dedicate up to 28 MiB of each 128 MiB boot partition to bounded forensic storage.
Rationale: The current kernel/initrd/DTB payload is well below the full
boot partition size, and the user explicitly wants a slot-local forensic
reserve that survives with the slot. 28 MiB is large enough for a substantial
lifecycle event history while still preserving generous headroom for future
kernel and initrd growth.
This storage should behave like a black box, not a general-purpose filesystem for arbitrary logs. A fixed budget makes retention deterministic and prevents forensic artifacts from crowding out boot assets.
Alternatives considered:
- Store Tier 0 only on
/data: simpler long-term store, but loses the advantage of slot-local, early-available forensic state. - Use a smaller budget: safer for boot growth but less useful for field debugging.
- Use a larger budget: possible, but starts trading too much future boot payload headroom for logs.
3. Runtime durability should come from buffered /data/logs appends
Choice: Keep journald volatile and use rsyslog with an in-memory queue
to append larger batches to /data/logs.
Rationale: The active runtime goal is to preserve the eMMC-wear benefits of
tmpfs-first logging while still making broader host and container diagnostics
available after clean operation and orderly shutdown. Buffered appends to
/data/logs provide a simpler path than a separate slot-local Tier 0 store and
match the logging model now used by the runtime services.
Alternatives considered:
- Make journald persistent directly: simpler pipeline, but increases write amplification during steady-state operation.
- Keep a separate slot-local forensic ring: more durable for a narrow set of events, but adds another logging path and extra boot-partition complexity.
with optional fields such as:
resulttarget_slotreasonversiondeviceserviceattemptdetail
The boot_id + seq pairing is intentional:
boot_idgroups events by boot session for cleaner forensic readingseqprovides strict ordering even when timestamps are coarse, identical, or corrected later by NTP
This is stronger than timestamps alone and simpler than a single global persistent sequence spanning all boots.
Alternatives considered:
- JSON lines: more machine-friendly, but more verbose and more awkward to emit robustly from shell-heavy boot paths.
- Binary records: more deterministic, but far less debuggable in the field.
- Timestamp-only ordering: too weak for early boot and near-simultaneous events.
5. Mirror only critical lifecycle events into Tier 0
Choice: Limit Tier 0 to a narrow event taxonomy instead of trying to mirror the full journal.
Rationale: The goal is post-failure reconstruction, not durable storage for all log chatter. A smaller event vocabulary keeps write volume low and makes the black-box log more useful during triage.
The initial event taxonomy should include these stage names:
initrdbootfirstbootraucverifyrollbackwatchdogshutdown
The initial event set should include:
- initrd and boot progression markers
- active slot and rootfs selection markers
/datamount success or failurefirst-bootstart and completion- RAUC install start, success, and failure
- update-confirmation start, success, and failure
- rollback detection
- watchdog-related reset markers or inferred reboot cause markers
- orderly shutdown flush begin and end
Representative event names include:
boot-startlowerdev-selectedrootfs-mount-okrootfs-mount-faileduserspace-startdata-mount-okdata-mount-failedboot-completestartcompletefailedinstall-startinstall-completeinstall-failedmark-good-startmark-good-completemark-good-faileddetectedslot-fallbackboot-attempt-exhaustedarmedreboot-inferredflush-beginflush-endreboot-requestedpoweroff-requested
Alternatives considered:
- Mirror the whole journal: too write-heavy and defeats the purpose of volatile-first logging.
- Log only RAUC events: too narrow; boot and watchdog failures would still be opaque.
6. Tier 0 events should be written with durable-first semantics
Choice: Treat each Tier 0 write as an immediate durability event and flush it explicitly.
Rationale: The whole point of Tier 0 is surviving the cases where Tier 1 volatile logs disappear. Each critical event should therefore be written and flushed in a way that minimizes exposure to power loss.
This does not create a theoretical guarantee against every possible corruption mode, but it does create the strongest practical durability semantics in the current storage model.
Alternatives considered:
- Batch writes for efficiency: lower write overhead, but directly weakens the power-loss guarantee.
- Rely on periodic journal export only: leaves exactly the most important events vulnerable.
7. Keep Tier 1 journald tmpfs-first with bounded loss
Choice: Keep normal host journald storage in tmpfs during runtime, bound
its runtime usage with an explicit cap, and treat it as the ingestion point for
an in-memory rsyslog queue rather than as the persistent log store.
Rationale: This preserves the eMMC-wear benefits of the tmpfs-backed system while reducing the blast radius of abrupt power loss for general diagnostics. Tier 0 still carries the always-durable lifecycle breadcrumbs, while Tier 1 provides the live message stream without allowing journald itself to emit many small persistent writes.
Alternatives considered:
- Fully persistent journald on
/data: simpler durability story, but keeps routine host logging write-heavy all the time. - Purely volatile journald forever: preserves wear benefits, but discards too much general diagnostic history on power loss and reboot.
- No journald cap: risks memory pressure from noisy services.
8. Use rsyslog as the Layer 2 RAM queue and batch writer
Choice: Introduce rsyslog behind volatile journald and configure it with
an in-memory queue that appends buffered log data to /data/logs in large,
infrequent, sequential writes.
Rationale: The point of the durable host log path is not merely to keep
logs in RAM longer. It is to transform many small writes into much larger,
more sequential writes that are friendlier to eMMC wear characteristics.
rsyslog provides a mature queueing model for this that journald alone does
not expose as clearly.
The batch queue should remain RAM-backed during normal operation. The
persistent path should be append-oriented, size-bounded, and rotated in larger
chunks under /data/logs. Orderly shutdown should flush queued log data so the
latest clean shutdown preserves the most recent buffered diagnostics.
This change does not introduce log2ram; the system already runs on a
tmpfs-backed overlay, so adding another general /var/log RAM shim would be
redundant complexity for this design.
Alternatives considered:
- Timer-driven journal checkpoints: better than fully persistent journald, but still less explicit about batching policy and sequential write behavior.
- Direct continuous journald persistence: simpler, but worse for write amplification.
log2ramplus ad hoc file syncing: redundant with the existing overlay model and less targeted than an explicit queued logging layer.
9. Align Podman logging with the buffered journald policy
Choice: Set Podman’s container log_driver to journald so routine
application stdout and stderr follow the same tmpfs-first journald ingestion,
RAM-queued rsyslog buffering, large sequential /data/logs append path, and
shutdown-flush behavior as host logs.
Rationale: Applications run in Podman and their durable state already lives
on /data, but application stdout/stderr retention is a different question
from host lifecycle forensics. Pinning the log driver avoids drift in Podman
defaults and keeps application log behavior aligned with the explored logging
boundary instead of creating a separate file-backed log path with different
durability semantics.
Alternatives considered:
- Make app logs part of Tier 0: too broad and too write-heavy.
- Ignore app logs entirely: leaves an important design boundary undocumented.
Tier 1 / Tier 2 Resolution
The recovered explore session was most settled on the Tier 0 /boot forensic
model. The broader journald-to-/data path was clearly part of the intended
architecture, but some details remained less fully pinned down at explore time.
This change resolves the main Layer 2 open question explicitly:
- use volatile journald as the entry point
- use an in-memory
rsyslogqueue as the batching layer - append to
/data/logsin large sequential writes - keep shutdown flush as a secondary durability improvement
The remaining implementation-time decisions are narrower:
- exact queue sizing and dequeue batch thresholds
- exact rotation and retention policy under
/data/logs - exact journald filtering and rate-limiting thresholds
- whether
/datashould also gain mount options such asnoatime
Risks / Trade-offs
- FAT boot storage is not a perfect forensic medium -> Mitigate by keeping the Tier 0 format simple, bounded, append-oriented, and tolerant of a torn final line.
- Immediate durable writes still create some wear -> Acceptable because Tier 0 is intentionally tiny and event-limited.
- Slot-local boot logs may not follow the active slot after rollback -> This is partly a feature, because each slot preserves its own recent history; docs should make that mental model clear.
- Application log volume could pressure the buffered journal path ->
Mitigate by pinning Podman to
journald, keeping the host runtime cap explicit, and bounding thersyslogRAM queue plus/data/logsrotation budget. - An extra logging daemon increases moving parts -> Acceptable because it provides explicit queueing and batching behavior that directly serves the eMMC longevity goal.
- Metadata corruption could obscure the active segment -> Keep metadata minimal and recoverable by scanning segment files if needed.
Post-Review Hardening
The initial implementation satisfied the core change goals, but a later review found a small set of durability and correctness gaps that were fixed before closing validation.
- The active-slot forensic mount helper now verifies an actual mount via
findmntinstead of assuming directory existence implies durable boot storage. This prevents Tier 0 writes from silently falling back to tmpfs. - The initrd forensic helper now fails explicitly on missing boot-device or mount prerequisites instead of silently succeeding. This keeps early lifecycle markers aligned with the change’s durable-first intent.
- The update-confirmation path now logs
mark-good-completeonly on real success and logsmark-good-failedplusverify failedon failure in both the first-boot fallback and post-health-check confirmation paths. - RAUC status parsing was corrected to read the keyed slot structure returned
by
rauc status --output-format=json, avoiding false “already good” or incorrect current-version decisions. - Routine polling and “no update” style upgrade chatter was removed from Tier 0 so the durable forensic budget stays focused on high-value lifecycle evidence.
- Regression coverage was extended to include mount-selection behavior and an
explicit negative
mark-goodconfirmation path inrauc-confirm, including test harness steps needed to avoid stale cached RAUC state across phases.
Migration Plan
Existing devices pick up the new configuration on the next deployed image. The boot partitions gain a reserved forensic directory within the existing slot budget, and host lifecycle services begin mirroring critical events into that bounded store.
Rollback is straightforward: remove the Tier 0 writer and return to the prior general journald policy. No data migration is required for Tier 0 because the forensic store is bounded, slot-local, and self-contained.
Final Scope Notes
- The explored design did not choose always-persistent journald on
/data. Instead it chose tmpfs-first journald feeding a RAM-queuedrsyslogbatch writer to/data/logs, plus orderly shutdown flush, while relying on the bounded Tier 0/bootrecorder for the critical always-durable lifecycle evidence. - Tier 0 remains the power-loss-first forensic layer.
- Tier 1 remains memory-first during runtime.
- Tier 2 trades some immediate durability for much lower write amplification by
appending buffered logs to
/data/logsin larger sequential writes. - Podman logging is pinned to
journaldso container log traffic follows the same buffered host logging path. - Queue sizing, rate limits, and
/data/logsretention remain tunable implementation details rather than core architecture changes.
Requirements
durable-journald-logs
ADDED Requirements
Requirement: Host journald is tmpfs-first during runtime
The system SHALL configure host journald to keep general runtime logs in volatile storage during normal runtime so routine host logging remains memory-first rather than continuously writing to persistent media.
Scenario: Runtime host logs stay memory-first
- WHEN the device writes a non-critical host journal entry during normal runtime
- THEN that entry is written into the volatile runtime journal rather than
directly to persistent journal storage on
/data
Requirement: Runtime journal usage is explicitly bounded
The system SHALL apply an explicit runtime journal size cap so memory-first logging does not grow without bound.
Scenario: Runtime journal stays within the configured cap
- WHEN runtime journal usage reaches the configured storage cap
- THEN journald rotates or removes older runtime journal data before exceeding that cap
Requirement: General logs are written to /data/logs in large sequential batches
The system SHALL use a RAM-queued batching layer behind volatile journald so
general host log data is appended to persistent storage under /data/logs in
large, infrequent, sequential writes rather than in many small direct writes.
Scenario: Buffered host logs are appended during runtime buffering flushes
- WHEN the device continues normal runtime logging and the buffering layer reaches its configured write threshold or flush interval
- THEN buffered general host journal data is appended to persistent storage
under
/data/logsin a large sequential write
Requirement: Buffered general logs are flushed to /data/logs on orderly shutdown
The system SHALL flush the current buffered general log state to persistent
storage under /data/logs during orderly shutdown so the latest clean shutdown
retains the most recent buffered host diagnostics.
Scenario: Orderly shutdown persists buffered host logs
- WHEN the device performs an orderly reboot or poweroff
- THEN the buffered general log queue is flushed to persistent storage
under
/data/logsbefore shutdown completes
Requirement: Container logs follow the same buffered journald boundary
The system SHALL configure Podman to use the journald log driver so routine
container stdout and stderr are recorded through journald instead of file-based
container logs, and SHALL keep those logs inside the same tmpfs-first,
RAM-queued, batched-append, and shutdown-flushed pipeline as other non-Tier 0
logs.
Scenario: Container logs are sent to journald
- WHEN a container writes to stdout or stderr during normal operation
- THEN that log traffic is emitted through journald and follows the same
buffered runtime retention policy and batched
/data/logsappend path as other non-Tier 0 logs
forensic-log-durability
ADDED Requirements
Requirement: Critical lifecycle events are mirrored to slot-local forensic storage
The system SHALL mirror critical host lifecycle events into a bounded forensic store on the active boot slot so they remain available after reboot, slot rollback, and as many power-loss scenarios as practical. The initrd portion of this path remains incomplete until the early-boot persistence design is revised to avoid fragile direct boot-partition mounts during normal initrd execution.
Scenario: Critical boot event is retained after reboot
- WHEN the device records a critical boot or update lifecycle event and then reboots
- THEN that event remains available from the slot-local forensic store after the reboot
Scenario: Failed update leaves forensic evidence
- WHEN an update attempt fails and the device later rolls back to a previous slot
- THEN the affected slot retains its recent mirrored lifecycle events for forensic inspection
Requirement: Slot-local forensic storage is strictly bounded
The system SHALL cap slot-local forensic storage at 28 MiB per boot slot.
The system SHALL represent that budget as seven 4 MiB segment files plus
minimal metadata, and SHALL rotate or overwrite the oldest forensic records
when that limit is reached.
Scenario: Forensic store reaches capacity
- WHEN new mirrored lifecycle events would exceed the
28 MiBstorage budget on a boot slot - THEN the system retains newer events and removes or overwrites the oldest retained forensic records within that slot
Scenario: Segment rollover preserves bounded retention
- WHEN the active
4 MiBsegment fills during normal operation - THEN the system advances to the next segment, reuses the oldest segment
when necessary, and continues writing without exceeding the
28 MiBslot budget
Requirement: Tier 0 records use boot-scoped ordering
The system SHALL encode each Tier 0 forensic record as a single-line key/value
record. Each record SHALL include boot_id, seq, ts, slot, stage, and
event. The system SHALL reset seq at the start of each new boot_id.
Scenario: Events within a boot are strictly ordered
- WHEN multiple Tier 0 events are written during the same boot session
- THEN their
seqvalues increase monotonically within thatboot_id
Scenario: New boot starts a new forensic sequence
- WHEN the device reboots into a new boot session on the same slot
- THEN the device writes records with a new
boot_idand restartsseqfrom the beginning for that boot session
Requirement: Tier 0 event scope is limited to high-value lifecycle records
The system SHALL limit slot-local forensic storage to high-value lifecycle
events. Allowed Tier 0 stages SHALL include initrd, boot, firstboot,
rauc, verify, rollback, watchdog, and shutdown. Tier 0 events SHALL
cover boot progression, slot selection, /data mount outcome, update
lifecycle events, update-confirmation outcome, rollback detection,
watchdog-related reset markers, orderly shutdown flush markers, and managed
reboot or poweroff request markers where those flows are part of the system.
The initrd stage specifically requires redesign before this requirement can be
considered complete.
Scenario: Noisy routine logs are excluded from Tier 0
- WHEN ordinary service or application log traffic is emitted during normal runtime
- THEN that traffic is not mirrored wholesale into the slot-local forensic store
Scenario: Failed slot keeps its own forensic history
- WHEN the device boots into an updated slot, fails, and later rolls back to the previous slot
- THEN the failed slot retains its own recent Tier 0 forensic records on its boot partition for later inspection
Requirement: General host logging remains memory-first outside Tier 0
The system SHALL keep general host journald logging memory-first during runtime and SHALL reserve the slot-local forensic store for critical lifecycle evidence rather than for general-purpose log persistence.
Scenario: Runtime host logs are not automatically durable
- WHEN a non-critical host log entry is written only to the general runtime journal
- THEN that entry is not guaranteed to survive an abrupt power loss
Scenario: Tier 0 remains focused on critical evidence
- WHEN routine host or application log traffic is emitted during normal operation
- THEN that traffic is handled through the general volatile-journald plus RAM-queued batch logging path rather than being mirrored wholesale into the slot-local forensic store
Source Metadata
schema: spec-driven
created: 2026-04-25
Source
Converted from openspec/changes/durable-journald-logs/ during the OpenSpec-to-feature-spec migration.
Tasks: durable-journald-logs
1. Tier 0 Forensic Storage Design
- 1.1 Define the slot-local forensic storage layout under
/bootwith a hard28 MiBper-slot retention budget - 1.2 Implement the Tier 0 metadata and segment layout as
metaplus seven4 MiBsegment files per boot slot - 1.3 Implement Tier 0 records as single-line key/value records with
required fields
boot_id,seq,ts,slot,stage, andevent - 1.4 Implement per-boot
boot_id + seqordering soseqresets for each new boot session - 1.5 Implement segment rollover that reuses the oldest segment without exceeding the
28 MiBslot budget - 1.6 Define durable-write semantics for Tier 0 records so critical lifecycle events are flushed immediately and torn final lines are tolerated during readback
2. Critical Event Coverage
- 2.1 Redesign initrd Tier 0 forensic logging so early-boot markers are captured without relying on ad hoc boot-partition mounts during normal initrd execution
- 2.2 Wire
/datamount outcome,first-boot, RAUC install, update confirmation, and shutdown flush or managed reboot markers into Tier 0 forensic logging using the defined stage/event taxonomy - 2.3 Record concrete slot-transition, rollback, and watchdog lifecycle markers on the real device path rather than only in test scaffolding
- 2.4 Ensure Tier 0 captures enough information to reconstruct failed update and rollback flows without mirroring the whole journal
3. Buffered Runtime Logging Boundary
- 3.1 Keep general host journald tmpfs-first during runtime and set an explicit bounded runtime size cap
- 3.2 Add
rsyslogbehind volatile journald with a RAM-backed queue for general host logging - 3.3 Append buffered general host logs to
/data/logsin large, infrequent, sequential batches rather than many small direct writes - 3.4 Flush the buffered general log queue to
/data/logsduring orderly shutdown - 3.5 Pin Podman container logging to
journaldso application stdout and stderr follow the same buffered journald path - 3.6 Validate Podman’s logging path and retention behavior under the chosen journald-plus-rsyslog buffering model
- 3.7 Define
/data/logsrotation, retention, and append-file layout for the large-batch persistent path - 3.8 Evaluate whether
/datashould gain mount options such asnoatimeto further reduce metadata writes on the persistent log path - 3.9 Document the boundary between durable Tier 0 host forensics and buffered general host/application logs
4. Validation and Documentation
- 4.1 Verify the redesigned initrd Tier 0 path records the intended early-boot evidence without leaving failed initrd units on successful boots
- 4.2 Verify the bounded retention model overwrites old records without exceeding the per-slot
28 MiBbudget - 4.3 Verify critical Tier 0 events remain available after simulated failed update or rollback scenarios using the real forensic implementation rather than test-only stubs
- 4.4 Verify per-boot
boot_id + seqordering behaves correctly across reboot, slot switch, and rollback scenarios - 4.5 Verify the
rsyslogRAM queue appends buffered logs to/data/logsin large sequential batches during normal runtime - 4.6 Verify orderly shutdown flush persists the latest buffered general
logs to
/data/logs - 4.7 Verify Podman/application logs follow the intended memory-first and batched persistent retention path
- 4.8 Update architecture and operational docs to describe the three-tier logging model and the power-loss durability boundary
- 4.9 Record post-review hardening fixes and regression coverage for the redesigned initrd forensic path, mount selection, RAUC confirmation failure handling, and Tier 0 event filtering
Config Reapply Improvements
Summary
Harden the existing config.toml re-apply path and formalize the configuration contract. The feature keeps first-time
provisioning local and unblocked, but requires authenticated, validated, atomic replacement for already-provisioned
devices. It also moves OS/device settings into explicit top-level sections and nests all container-related Quadlet config
under [containers].
Project Plan Source
This feature is seeded from docs/src/planned-features.md entry config-reapply-improvements, plus the feature request
to restructure config.toml around [users], [network], and [containers], and to introduce an official schema.
Goals
- Reject unauthenticated
POST /api/configrequests on already-provisioned devices. - Validate
config.tomlagainst an official schema before any persistent state is replaced. - Replace
/data/configatomically enough that crashes do not leave partially imported state. - Roll back to the previous config when service activation fails after re-apply.
- Reserve top-level
config.tomlsections for OS/device configuration. - Move all container, network, volume, and build Quadlet configuration under
[containers]. - Introduce structured top-level
[users]and[network]sections. - Manage local users declared under
[users.<name>]. - Preserve the fresh-flash provisioning path from
/boot, USB, and the bootstrap UI.
Non-Goals
- Full
/datawipe or factory reset behavior. - Partial config updates; re-apply remains a full replacement operation.
- Changing the A/B update model or RAUC slot confirmation semantics beyond checking re-applied services.
- Adding remote fleet management or Nixstasis integration.
- Making
config.tomla general-purpose Linux distribution configuration format.
Current Behavior
scripts/first-boot-provision.py owns config parsing, import, bootstrap UI, and POST /api/config. The current format
uses top-level [admin], [firewall], [activation], optional [lan], optional [os_upgrade], and top-level Quadlet
tables such as [container.<name>], [network.<name>], [volume.<name>], and [build.<name>].
The existing import path writes derived state under /data/config, including:
config.tomladmin-signersssh-authorized-keys/adminfirewall-inbound.jsonlan-settings.jsonos-upgrade.jsonquadlet/quadlet-runtime.json
The base image currently sets users.mutableUsers = false and declares only fixed service users such as appsvc.
OpenSSH already reads authorized keys from /data/config/ssh-authorized-keys/%u, but arbitrary config-declared
users do not exist unless a runtime apply step materializes them.
The planned feature states that basic re-apply already works by accepting a POST, overwriting /data/config, and running
Quadlet sync. This feature narrows that behavior into a safer state-machine.
Config Contract
Top-Level Sections
Top-level sections are reserved for OS/device configuration:
[users][network][activation][os_upgrade][containers]
The prior top-level [admin], [firewall], [lan], [container], [network] as Quadlet networks, [volume], and
[build] tables are rejected by the new schema. The schema uses version = 2 for this intentionally breaking config
shape. AtomixOS is still unreleased and in design/testing, so this feature does not need a compatibility or migration
path for earlier test configs. Existing examples and docs must be updated in the same unit of work.
Users
[users] contains named local users. The implementation must manage declared users, including creating or updating local
accounts and their SSH authorized keys from the config.
Example:
[users]
[users.admin]
isAdmin = true
ssh_key = "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCt5v7m8X9Zl5n"
[users.guest]
isAdmin = false
ssh_key = ""
Rules:
isAdmindefaults tofalse.ssh_keydefaults to an empty string.- At least one admin user with a non-empty SSH public key is required before first boot can complete.
- Empty SSH keys are ignored, not written as authorized key lines.
- Admin users are members of
wheelandpodman(when those groups exist); non-admin users are not. - Removed users from a re-applied config are disabled or locked rather than silently retaining access.
- Usernames must be validated against a narrow safe pattern and must not collide with protected system users or unmanaged existing local accounts.
- The existing password-locked, key-only SSH model remains mandatory for all managed users.
Because the root filesystem is an immutable squashfs with an ephemeral overlay, managed users must be derived from
persisted config on every boot or re-apply. The import path should write normalized user state under /data/config, and
a dedicated runtime apply step should materialize those users and groups before SSH access is expected. The apply step
must preserve fixed system users such as root and appsvc; operator accounts, including an optional admin username,
come from [users.<name>] config.
Network
[network] contains device networking, DNS, dnsmasq, and firewall configuration.
The implemented schema covers the LAN gateway and firewall controls this feature wires into runtime services:
- dnsmasq enablement and dnsmasq LAN configuration.
- upstream NTP servers for chrony, defaulting to Cloudflare NTP.
- Firewall rules equivalent to the current provisioned firewall model.
DNS servers, DNS search domains, arbitrary interface configuration, and default gateway configuration are deferred until runtime support is implemented.
The default network behavior remains the current LAN gateway design:
eth0is WAN.eth1is LAN.- dnsmasq is enabled by default.
- LAN gateway defaults to
172.20.30.1/24. - DHCP serves the existing
172.20.30.10through172.20.30.254range unless overridden. - DHCP option 3, 6, and 42 point at the gateway IP.
- DNS remains gateway-local by default.
- NTP is served to LAN clients by chrony.
- IP forwarding remains disabled.
Containers
[containers] is the only top-level section for operator-provisioned Quadlet config. It contains nested sections for
container units and supporting units.
The canonical structure should be:
[containers.container.example]
privileged = false
[containers.container.example.Container]
Image = "docker.io/library/nginx:latest"
[containers.network.app]
[containers.network.app.Network]
Subnet = "10.89.0.0/24"
[containers.volume.data]
[containers.volume.data.Volume]
Driver = "local"
[containers.build.custom]
[containers.build.custom.Build]
File = "${FILES_DIR}/Containerfile"
ImageTag = "localhost/custom:latest"
Rules:
- Container units continue to use the existing rootful/rootless safety boundary.
- Network and volume Quadlet units remain rootful.
${CONFIG_DIR}and${FILES_DIR}substitution behavior remains unchanged.quadlet-runtime.jsonremains the authoritative runtime metadata for sync.
Official Schema
The repository already has schemas/config.schema.json and a small in-repo schema validator in
scripts/first-boot-provision.py. This feature must replace that schema with the new canonical config.toml contract and
keep the in-repo validator unless implementation proves it cannot express a required rule. The schema should produce clear
path-specific errors such as network.interfaces.eth1.address must be a CIDR string.
Schema requirements:
- Validate allowed and required keys.
- Validate types, defaults, enums, and port ranges.
- Validate cross-field constraints, such as DHCP range matching the LAN subnet.
- Validate that required service names reference rendered Quadlet units.
- Validate that at least one admin SSH key exists.
- Reject legacy top-level config sections rather than silently accepting or migrating them.
- Be usable by
first-boot-provision validateand by the bootstrap API before persistent writes.
Avoid adding a third-party schema dependency unless the in-repo validator cannot support a required rule within a small, auditable implementation.
Reapply Flow
Fresh provisioning remains unauthenticated because the device has no prior operator credential. Re-apply on an already provisioned device must require authentication before accepting config bytes.
Proposed flow:
- Receive
config.tomlor supported config bundle. - If
/data/config/config.tomlalready exists, require LAN-local authentication. - Unpack and validate the candidate config in a temporary directory outside active
/data/configstate. - Render all derived state into a candidate config directory.
- Snapshot or rename the previous
/data/configinto a rollback location. - Atomically promote the candidate directory into
/data/config. - Apply LAN, firewall, and Quadlet sync using the same services as boot.
- Confirm required services become healthy.
- Delete or age out rollback state only after successful apply.
- Restore the previous config and re-apply it if activation fails.
Authentication uses an SSH-key challenge-response with an existing admin SSH key. The device issues a nonce for a short validity window, and the operator signs a request-bound message containing the nonce, target path, and SHA-256 digest of the raw request body. The device verifies the signature against active admin signer keys before accepting or processing request content. This keeps re-apply LAN-local, avoids default credentials, and reuses the existing key-only operator trust model.
Failure Handling
- Invalid TOML or schema errors return a non-2xx response and leave active config untouched.
- Failed candidate rendering leaves active config untouched.
- Failed authentication returns a non-2xx response before parsing or processing request content.
- Crash before promotion leaves active config untouched.
- Crash after promotion but before confirmation must be recoverable on next boot or next apply by detecting incomplete re-apply state.
- Failed service activation restores previous config and reports the failed services.
- Rollback must not delete container volumes or arbitrary
/datacontent.
Documentation Impact
Likely affected pages:
docs/src/provisioning.mddocs/src/provisioning/lan-range.mddocs/src/data-flow.mddocs/src/runtime-boundaries.mddocs/src/tutorials/oidc-device-management.mddocs/src/specs/lan-gateway.mddocs/src/specs/update-confirmation.mddocs/src/code-reference/scripts.mddocs/src/code-reference/modules.mddocs/src/features/caddy-authcrunch-cockpit-tutorial/design.mdschemas/config.schema.jsonmodules/base.nixmodules/first-boot.nixexample/caddy-oidc/config.toml
Validation Plan
- Unit tests for schema validation, defaults, and path-specific error messages.
- Unit tests for legacy top-level config tables being rejected.
- Tests for
[users]admin key extraction and empty-key handling. - Tests for managed user creation/update/disable behavior.
- Tests for managed users being re-materialized from
/data/configafter reboot. - Tests for
[network]defaults matching current LAN gateway behavior. - Tests for SSH-key challenge-response authentication success and failure paths.
- Tests for candidate config rendering without touching active
/data/config. - VM test for authenticated re-apply success.
- VM test for unauthenticated re-apply rejection on an already-provisioned device.
- VM test for invalid config preserving previous state.
- VM test for activation failure rolling back to previous config.
Risks
- Restructuring
config.tomlintentionally breaks earlier test configs; examples and docs must be updated with the code. - Runtime user management conflicts with the current
users.mutableUsers = falseposture unless implemented as an explicit apply service that safely materializes/data/configuser state on each boot. - Authentication design can become too complex for local recovery if it depends on external services.
- Atomic directory replacement on
/datamust be implemented carefully on f2fs. - Service rollback can restore config files but cannot guarantee application-level container data consistency.
- Adding a third-party schema dependency may increase image closure size.
Config Reapply Improvements Tasks
T000 - Review and confirm feature spec
- Confirm the canonical
config.tomltop-level sections:[users],[network],[activation],[os_upgrade], and[containers]. - Reject legacy top-level
[admin],[firewall],[lan],[container],[network],[volume], and[build]config without migration because AtomixOS is unreleased. - Use SSH-key challenge-response with an existing admin key for re-apply authentication.
- Manage declared
[users.<name>]local users in this feature.
T010 - Define the official config schema
- Replace
schemas/config.schema.jsonwith the new canonical schema used by validation and documentation. - Define
[users]schema with defaultisAdmin = falseand default emptyssh_key. - Add username validation and reserved-system-user rejection.
- Define
[network]schema for dnsmasq and firewall rules. - Define
[network]schema for DNS servers, search domains, interfaces, and default gateway once runtime support is implemented. - Define
[containers]schema for nested container, network, volume, and build Quadlet units. - Add cross-field validation for admin SSH keys, LAN subnet, DHCP range, port ranges, and required service references.
- Ensure schema errors include precise config paths and actionable messages.
T020 - Implement config parser restructure
- Update
first-boot-provision.pyto parse[users]instead of top-level[admin]. - Persist normalized managed user state under
/data/configfor boot-time and re-apply materialization. - Render managed user state and SSH authorized keys for all declared users.
- Add a runtime user apply service that materializes managed users from persisted config on boot and re-apply.
- Lock or disable managed users removed during config re-apply.
- Refuse to mutate unmanaged existing local accounts during runtime user apply.
- Update LAN settings parsing to consume
[network]while preserving current defaults. - Update firewall parsing to consume firewall rules under
[network]. - Update Quadlet rendering to consume
[containers.container],[containers.network],[containers.volume], and[containers.build]. - Keep rendered persistent outputs compatible with existing runtime services unless those services are intentionally updated.
T030 - Harden re-apply authentication
- Detect already-provisioned devices by active persisted config state.
- Add nonce issuance for short-lived re-apply authentication challenges.
- Verify request-bound SSH signatures against active admin user keys before accepting candidate config bytes.
- Require authentication for mutating bootstrap POST paths when active config exists.
- Keep first provisioning unauthenticated for fresh devices without existing operator credentials.
- Add tests for unauthenticated rejection and authenticated acceptance.
T040 - Implement atomic candidate apply
- Validate and render candidate config in a temporary candidate directory.
- Prevent candidate validation/rendering from mutating active
/data/config. - Promote candidate config to
/data/configwith a crash-safe directory replacement strategy. - Preserve the previous config in a rollback location until apply is confirmed.
- Clean up stale candidate and rollback state safely.
T050 - Implement rollback on failed activation
- Apply LAN settings, firewall state, and Quadlet sync after candidate promotion.
- Confirm required services reach the expected active state.
- Restore previous config if apply or service confirmation fails.
- Re-apply previous LAN, firewall, and Quadlet state after rollback.
- Preserve failed-candidate managed users through rollback long enough for the restored apply to lock them.
- Return clear API errors describing validation or activation failures.
T060 - Update examples and operator docs
- Update provisioning docs for the new
config.tomlstructure. - Update data-flow and runtime-boundary docs for candidate apply and rollback state.
- Update LAN/network docs for
[network]defaults and overrides. - Update Caddy/AuthCrunch/Cockpit tutorial config and docs to use
[containers]. - Update code-reference docs for parser, rendered files, and API behavior.
T070 - Add automated validation
- Add unit tests for schema defaults and invalid key rejection.
- Add unit tests for users/admin SSH key extraction.
- Add unit tests for managed user creation/update/disable behavior.
- Add boot or VM coverage proving managed users are materialized from
/data/configafter reboot (deferred: requires persistent VM disk). - Add unit tests for network defaults, dnsmasq defaults, and firewall rule rendering.
- Add unit tests for nested
[containers]Quadlet rendering. - Add unit tests for SSH-key challenge-response authentication.
- Add VM or integration test for successful authenticated re-apply.
- Add VM or integration test for invalid config preserving active state.
- Add VM or integration test for activation failure rollback.
T999 - Final verification and release readiness
- Run the repository’s relevant formatting, unit, and Nix checks.
- Verify docs, examples, specs, and implementation all describe the same
config.tomlcontract. - Verify no unauthenticated re-apply path remains on already-provisioned devices.
- Verify first provisioning still works from
/boot/config.toml, USB, and bootstrap UI. - Record any intentionally deferred compatibility or migration work before merging.
Deferred work recorded at merge time:
[network]schema fields for DNS servers, search domains, interfaces, and default gateway are deferred until runtime support is implemented.- Managed user reboot materialization VM test deferred; requires persistent VM disk configuration.
- Provisioned-device re-apply authentication applies to all mutating bootstrap POST paths:
/api/config,/apply, and/generate. - Existing test devices with pre-
version = 2state must be reprovisioned; no/data/config/ssh-authorized-keys/adminmigration is included for unreleased config shapes.
Feature: provisioning-api-service
Overview
Build the provisioning implementation as a long-lived Litestar API service rather
than a one-off first-boot importer. The config bundle and config.toml remain the
bootstrap, backup, restore, and clone format, but runtime configuration changes
should increasingly flow through a typed API surface backed by the same validation,
candidate rendering, atomic promotion, activation, health-check, and rollback
pipeline.
The first step was replacing the monolithic first-boot-provision.py with the
atomixos-provision Python package, Litestar + uvicorn, module-level tests, SSH
signature authentication, async jobs, and structured deployment progress. The next
step is to harden the package into a service foundation that can support dynamic
partial reconfiguration without creating divergent mutation paths.
Source
docs/src/planned-features.md — originally tracked as “Bootstrap provisioning
subproject” / provision-restructure. Reframed as provisioning-api-service
after comparing the implementation against the Litestar fullstack reference
application at /Users/DeRoseR/workspace/personal/litestar-fullstack.
Goals
- Keep
config.tomland config bundles as the canonical import/export format for first boot, backups, restore, and cloning deployments. - Treat the running provisioning service as the canonical control plane for future dynamic changes.
- Ensure every mutation path uses the same state machine:
- load current desired state
- apply a full config import or typed partial change
- validate the resulting full desired state
- render candidate state
- promote atomically
- activate runtime services
- report structured job progress
- roll back on activation or required-health failure
- Keep the current Litestar + uvicorn foundation, SSH-signature authentication, first-boot auth bypass, socket activation, and single-flight job execution.
- Move from raw route functions returning open-ended dictionaries toward typed controllers, services, schemas, and exception handling suitable for a larger API.
- Preserve the current device constraints: small closure, read-only rootfs, F2FS
/data, no default credentials, and no unnecessary database/Redis dependency.
Non-Goals
- Replacing
config.tomlor config bundles as bootstrap/backup/clone artifacts. - Adding a database, Redis, SAQ, OAuth, JWT, or fleet-management dependency.
- Modifying the NixOS module interface beyond what’s needed for the new package.
- Boot UI redesign or HTMX integration (planned as follow-up feature).
- Multi-device orchestration.
- Implementing every partial config API in this feature. This feature establishes the service architecture those APIs should use.
Constraints
- Must fit within the existing 1 GB squashfs rootfs closure.
- Litestar + uvicorn must be available in nixpkgs or trivially packageable.
- Must preserve systemd socket activation (uvicorn accepts inherited fd via
LISTEN_FDS/LISTEN_PIDenvironment variables, matching current behavior). - Must preserve the SSH signature authentication contract:
GET /api/nonceissues a single-usesecrets.token_urlsafe(32)nonce (TTL 300s).- Signed message format:
"atomixos-reapply-v1\nnonce:{nonce}\npath:{request_path}\nsha256:{payload_sha256_hex}\n" - Headers:
X-AtomixOS-Nonce+X-AtomixOS-Signature(base64 SSH sig blob). - Verification via
ssh-keygen -Y verifyagainst{config_root}/admin-signers.
- Must preserve the first-boot provisioning flow without SSH signatures. The Boot UI
form includes an in-memory bootstrap token to prevent cross-site form posts; this
is a CSRF control, not operator authentication. Programmatic
/api/configsubmissions do not require the Boot UI token before initial provisioning. - No default credentials in any state.
- Python 3.11+ (uses
tomllibfrom stdlib). - Litestar + uvicorn are now part of the provisioning package closure; future service-foundation changes must avoid adding heavyweight runtime dependencies unless they solve a concrete device requirement.
Architecture
Current Package Layout
scripts/atomixos_provision/
├── pyproject.toml
├── src/
│ └── atomixos_provision/
│ ├── __init__.py
│ ├── app.py # Litestar application factory, route wiring
│ ├── auth.py # SSH signature verification guard + nonce manager
│ ├── config.py # config.toml parsing and schema validation
│ ├── config_builder.py # Build config TOML from structured inputs (future use)
│ ├── quadlet.py # Quadlet unit rendering (container, network, volume, build)
│ ├── quadlet_sync.py # Copy rendered units to rootful/rootless target dirs
│ ├── activation.py # Activation script runner + service health checks + rollback
│ ├── jobs.py # Async job manager (single-flight, status tracking)
│ ├── provision.py # First-boot and re-apply orchestration
│ ├── bundle.py # Bundle import (tar extraction, file placement, tokens)
│ ├── ui.py # Boot UI HTML routes (/, /apply) — sync adapters
│ └── server.py # Uvicorn entry point, sd_listen_fds socket activation
├── tests/
│ ├── conftest.py
│ ├── test_auth.py
│ ├── test_config.py
│ ├── test_config_builder.py
│ ├── test_quadlet.py
│ ├── test_activation.py
│ ├── test_jobs.py
│ ├── test_provision.py
│ └── test_bundle.py
└── README.md # Developer notes (not user-facing docs)
Target Service Layout
The package should evolve toward explicit domain modules. Avoid the full
litestar-fullstack auto-discovery/plugin stack for now; explicit route wiring is
smaller, easier to audit, and better suited to an appliance. Adopt the separation
of concerns, not the whole dependency stack.
scripts/atomixos_provision/src/atomixos_provision/
├── app.py # explicit Litestar app factory and route registration
├── server.py # CLI + uvicorn + systemd socket activation
├── settings.py # small env/default settings object
├── deps.py # dependency providers for settings, services, state
├── exceptions.py # domain errors -> HTTP responses
├── domain/
│ ├── auth/
│ │ ├── controller.py # nonce/auth-related API routes
│ │ ├── service.py # nonce and SSH signature verification helpers
│ │ └── schemas.py # NonceResponse, auth errors if needed
│ ├── config/
│ │ ├── controller.py # /api/config, /api/validate, future partial APIs
│ │ ├── service.py # import/export/patch orchestration facade
│ │ └── schemas.py # typed request/response DTOs
│ ├── jobs/
│ │ ├── controller.py # /api/jobs/{id}
│ │ ├── service.py # single-flight job manager facade if needed
│ │ └── schemas.py # JobResponse, JobEvent
│ └── system/
│ ├── controller.py # /api/health and system status
│ └── schemas.py
├── provision.py # core candidate/promote/activate orchestration
├── activation.py # activation hook, service status, rollback
├── config.py # config parser and validation
├── config_builder.py # config generation from form/API inputs
├── quadlet.py # render Quadlet desired state
├── quadlet_sync.py # sync rendered Quadlet units
├── bundle.py # config bundle extraction/import/export helpers
└── ui.py # Boot UI routes until HTMX/server components are split out
The target layout should remain intentionally smaller than the Litestar reference application. Domain auto-discovery, SQLAlchemy repositories, SAQ/Redis workers, OAuth, Vite, and email plugins are not part of this foundation.
HTTP Endpoints
| Method | Path | Auth | Response | Description |
|---|---|---|---|---|
| GET | / | none | HTML | Boot UI page |
| GET | /api/nonce | none | JSON | Issue single-use nonce for auth |
| GET | /api/health | none | JSON | Liveness check |
| GET | /api/jobs/{id} | job UUID | JSON | Poll async job status |
| GET | /assets/atomixos.png | none | image | Static logo |
| POST | /api/config | SSH sig (provisioned) / none (first-boot) | JSON | Submit config, returns job ID (async) |
| POST | /api/validate | SSH sig | JSON | Validate config without applying |
| POST | /apply | bootstrap token (first-boot only) | HTML | Form upload → sync apply → result page |
Future dynamic API endpoints should be typed resource operations that reuse the same config service and job pipeline, for example:
| Method | Path | Description |
|---|---|---|
| GET | /api/config/current | Return normalized current desired state |
| GET | /api/config/export | Export current config bundle for backup/clone |
| PATCH | /api/config/users/{name} | Apply a typed user change through candidate promotion |
| PATCH | /api/config/network | Apply typed network changes through candidate promotion |
| PATCH | /api/config/containers/{name} | Apply typed container changes through candidate promotion |
Endpoint Architecture
All endpoints share a common core:
/api/config ─→ parse raw body ─→ jobs.submit(provision.apply) ─→ JSON {job_id}
/apply ─→ parse multipart ─→ provision.apply(sync) ─→ HTML result
/api/configuses the async job manager; returns immediately with job ID./applycalls the provision core synchronously for first-boot upload/paste only.
POST /api/config returns 202 Accepted with job_id, initial state,
job_url, and a Location header pointing at /api/jobs/{id}. Clients must
poll the job resource for final success, failure, deployment progress, rollback
status, and forwarding URL.
Control-Plane Model
The service should have one mutation engine. Full config imports and future partial API calls differ only in how the desired state is produced.
POST /api/config
-> parse bundle/config.toml
-> validate full desired state
-> render/promote/activate/rollback
PATCH /api/config/users/admin
-> load active desired state
-> apply typed patch
-> validate full desired state
-> render/promote/activate/rollback
Do not allow dynamic API calls to directly mutate derived files under /data/config
or runtime systemd/Quadlet state. The rendered files remain derived state, not the
primary API model.
Reconciliation Bookends
The API and config bundle paths must round-trip through the same desired-state model. Any future partial API must include these reconciliation points:
- Import bookend: Convert
config.tomlor a config bundle into normalized desired state before validation and rendering. - Patch bookend: Apply typed API changes to the normalized desired state, not directly to rendered files.
- Validation bookend: Validate the complete resulting desired state after any import or patch.
- Export bookend: Export the active desired state back to
config.tomlor a config bundle so backups and deployment cloning remain equivalent to API-managed state. - Drift bookend: Treat files under
/data/config/as derived from the active desired state. If a future API detects derived-state drift, it should report it and re-render through the normal candidate pipeline rather than patching files in place.
Typed API Schemas
Job and API responses should be explicit typed schemas rather than ad hoc
dict[str, Any] values. At minimum, define typed models for:
NonceResponseSubmitConfigResponseValidateConfigResponseProvisionResultJobResponseJobEventServiceDeployEventServiceStatusEvent
The current job response shape is:
{
"id": "...",
"state": "running | succeeded | failed",
"current_step": "service-status",
"events": [
{
"step": "service-status",
"elapsed_seconds": 32.71,
"message": "caddy-gateway.service (rootful) is running",
"service": "caddy-gateway.service",
"mode": "rootful",
"status": "running"
}
]
}
This response shape should be preserved and formalized with schemas so clients do not parse human-readable strings.
Job Lifecycle
SUBMITTED → RUNNING → SUCCEEDED
↘ FAILED (+ rollback_status: completed | failed | skipped)
- Only one job at a time; concurrent submissions return 409 Conflict.
- Job state persists in memory (lost on restart; acceptable — single-request model).
- Client polls
GET /api/jobs/{id}for completion.
Structured job events should distinguish provisioning steps from service deployment state:
preparerecovervalidatewrite-candidatepromoteservice-deployactivateservice-statushealth-checkrollbackcleanupcomplete
Service events should include service, mode, and status fields. Status values
currently include building, starting, running, failed, and unknown. True
live pulling status is deferred until the activation path can stream journal,
Podman events, or direct Podman operations.
Settings And Dependencies
Add a small settings layer rather than scattering constants through handlers and services. This should stay simple and environment-backed:
@dataclass(frozen=True)
class AppSettings:
config_root: Path = Path("/data/config")
host: str = "172.20.30.1"
port: int = 8080
app_runtime_user: str = "appsvc"
max_source_bytes: int = MAX_SOURCE_BYTES
Use Litestar dependency providers for settings and service facades once controllers are introduced. This keeps route handlers thin and makes CLI/background paths use the same service code as HTTP paths.
Exception Handling
Introduce a small exception module that maps domain errors to consistent HTTP responses:
ProvisionError->400 Bad Request- auth missing/invalid ->
401 Unauthorized - permission denied ->
403 Forbiddenif needed - busy job ->
409 Conflict - unknown job/resource ->
404 Not Found - unexpected error ->
500 Internal Server Error
The goal is consistent JSON error bodies for API clients while preserving useful HTML errors for Boot UI routes.
API Schema Hygiene
Keep operation IDs, tags, summaries, and typed response models on controllers so the API contract remains explicit in code and live OpenAPI schema routes can be used by online clients. Suggested tags:
SystemAuthConfigJobsProvisioning
This is useful for client generation, API discovery, and tests as the control-plane API grows.
Future Dynamic API Direction
The current API intentionally keeps config bundle import as the only mutation surface. Future typed partial APIs must be designed around normalized desired state, not direct edits to rendered runtime artifacts.
Planned read/export bookends:
GET /api/config/currentreturns the normalized current desired state loaded from/data/config/config.tomlplus any API-managed fields once those exist.GET /api/config/exportreturns a backup/clone config bundle generated from normalized desired state and managed files, preserving the config bundle as the portable artifact.
Planned partial mutation examples:
PATCH /api/config/users/{name}applies typed user changes.PATCH /api/config/networkapplies typed LAN, DNS, NTP, and firewall changes.PATCH /api/config/containers/{name}applies typed container changes.
Every partial mutation must run the same safety pipeline as full config import:
- Load current normalized desired state.
- Apply the typed patch in memory.
- Validate the full resulting desired state.
- Render candidate state under the candidate config root.
- Promote atomically through the existing F2FS-safe promotion path.
- Activate runtime services and report job progress.
- Roll back on activation or required health-check failure.
Partial APIs must not directly mutate files under /data/config/quadlet/, sync
systemd/Quadlet search paths, or edit runtime systemd state. Import/export
round-trip tests must land before implementing partial mutation endpoints so
API-managed state can always be backed up or cloned as a config bundle. Drift
detection should report differences between normalized desired state and rendered
files under /data/config/, but drift reports are read-only and must not repair
state outside the safe apply pipeline.
Activation Model
Two-phase activation (preserving current behavior):
- Activation script: External script path from
ATOMIXOS_BOOTSTRAP_ACTIVATIONenv var, run with 300s timeout. - Health checks: Read
health-required.json, check each required service viasystemctl is-active(rootful) orrunuser -u appsvc -- systemctl --user is-active(rootless). - Rollback: On any failure, restore rollback directory → active, re-run activation with old config.
Socket Activation
Uvicorn accepts the systemd-passed file descriptor. Current code already parses
LISTEN_FDS/LISTEN_PID and wraps fd 3 into a socket. The new server.py will pass
this fd to uvicorn via --fd 3 or programmatic server configuration.
The socket unit (atomixos-bootstrap.socket) initially listens on
0.0.0.0:8080 for first provisioning. After LAN settings are applied,
lan-gateway-apply.py writes a socket override for the configured
gateway_ip, then schedules a delayed restart of the socket/service. The delay
lets clients poll the original apply job before following the result’s
forwarding_url to the configured LAN endpoint.
Dependencies
- New: Litestar
- New: uvicorn (pure Python mode, no uvloop)
- New (dev): pytest, httpx (test client), ruff
- Existing: tomllib (stdlib 3.11+), openssh (ssh-keygen), gzip, zstd, systemd, util-linux (runuser)
Parallelization and execution model:
- Mutating apply jobs remain single-flight per device to protect
/data/configand runtime activation ordering. - Read-only operations such as health, nonce issuance, job polling, validation, and future export/status reads may run concurrently.
- Future partial mutation endpoints must submit work through the same job manager or an equivalent single-flight mutation gate.
Explicitly avoid adding these until there is a concrete need:
- SQLAlchemy / database repository stack
- Redis / SAQ
- OAuth/JWT auth stack
- Vite/SPA integration
- domain auto-discovery plugin
Risks and Tradeoffs
- Migration risk: Behavioral regressions from the first-phase rewrite or the controller/service split. Mitigated by pytest covering each module and existing NixOS VM integration tests continuing to pass.
- Closure size: Adding first-ever third-party Python packages. Litestar + uvicorn add runtime dependencies. Must verify after integration that rootfs stays within 1 GB.
- Socket activation with uvicorn: Uvicorn supports
--fdfor inherited sockets. Needs verification on aarch64. Current code already does sd_listen_fds parsing, so the pattern is proven. - Async complexity: Limited to the job manager path. HTML routes remain synchronous. Core provision logic is synchronous — the job manager wraps it in a background task.
- First-party dep risk: Moving from zero deps to Litestar creates an upstream dependency. Litestar must be pinned and available in nixpkgs.
- Over-abstracting too early: The Litestar fullstack example includes many layers we do not need. Mitigate by adopting typed controllers/services/settings/errors only, and keeping app assembly explicit.
- Divergent mutation paths: Partial APIs could accidentally bypass the safe import/reapply pipeline. Mitigate by forcing every mutation through the same config service and candidate promotion flow.
- API/bundle drift: API-managed state could stop exporting to the same
config.toml/bundle contract. Mitigate with import/export round-trip tests and by making normalized desired state the source for both API patches and exports.
Affected Files and Modules
scripts/first-boot-provision.py— compatibility entry point / legacy wrapper behavior aligned with the new packagemodules/first-boot.nix— updated to reference new package, add Python depsnix/tests/first-boot-provision.nix— must continue passingnix/tests/first-boot-source-discovery.nix— must continue passingdocs/src/planned-features.md— mark feature complete when donedocs/src/provisioning.md— first-boot and runtime provisioning behaviordocs/src/data-flow.md— persisted state and re-apply flowdocs/src/runtime-boundaries.md— API semantics and config/runtime boundarydocs/src/reference/project-structure.md— package layoutdocs/src/code-reference/scripts.md— runtime scripts and provisioning CLI notesdocs/src/testing.md— unit, lint, VM, and manual validation commands
Success Criteria
scripts/atomixos_provision/owns the provisioning implementation while the existingfirst-boot-provisioncommand remains available for scripts, tests, and operators.pyproject.tomldefines the package with all dependencies.- Existing HTTP endpoint paths and authentication semantics are preserved, with documented response-shape changes for the async job API.
- Async job API works: POST
/api/configreturns job ID, GET/api/jobs/{id}returns status, concurrent submissions return 409. - pytest suite passes with >80% coverage on module boundaries.
- Existing NixOS VM integration tests pass unchanged.
- Rootfs closure stays within 1 GB.
ruff checkandruff formatpass.- API response shapes for jobs and validation are typed and documented.
- The package has an explicit path for future partial API operations that reuses the full import/reapply safety pipeline.
- Import/export reconciliation is documented so config bundles remain equivalent to API-managed desired state.
- Affected documentation pages are updated in the same unit of work as service API behavior changes.
Validation
pyteston host for unit tests.nix buildto verify closure size.- Existing NixOS VM tests:
nix/tests/first-boot-provision.nix,nix/tests/first-boot-source-discovery.nix. - New NixOS VM test scenario: authenticated re-apply with async job polling.
ruff checkandruff formatpass.- API schema/serialization tests cover job response shape and service deployment event fields.
- Import/export round-trip tests are added before implementing partial mutation APIs.
- Documentation search confirms no stale references describe
/api/configas a synchronous success response.
Follow-Up Features
- Boot UI HTMX redesign: Convert Boot UI to HTMX-powered server-rendered partials
on the clean Litestar foundation. Convert
/applyto async with progress indicators. - Dynamic partial reconfiguration API: Add typed PATCH/PUT endpoints for users, network, containers, and other desired-state resources, all backed by the same candidate promotion and rollback pipeline.
Tasks: provisioning-api-service
Feature Spec And Setup
- Create feature branch and worktree
- Draft and review
design.md - Reframe feature from
provision-restructuretoprovisioning-api-service - Create
scripts/atomixos_provision/pyproject.tomlwith deps and metadata - Update the design to use Litestar instead of the original Starlette direction
- Compare against Litestar fullstack reference and record applicable patterns
Package Structure
- Create
src/atomixos_provision/package with__init__.py - Create module files: app, auth, config, config_builder, quadlet, quadlet_sync, activation, jobs, provision, bundle, ui, server
- Create
tests/directory withconftest.py
Config Parsing And Generation
- Move
config.tomlparsing and schema validation toconfig.py - Preserve
tomllibusage and validation rules - Keep config generation logic in
config_builder.pyfor tests/future use; no/generateroute is exposed - Add tests covering config parsing and config generation behavior
Authentication
- Move SSH signature verification logic to
auth.py - Implement nonce issuance, TTL, and single-use consumption
- Implement Litestar guards with first-boot bypass
- Preserve
ssh-keygen -Y verifysubprocess verification - Require SSH auth after provisioning for
/api/configand/api/validate - Keep job polling authorized by unguessable job UUID only
- Add tests covering valid signatures, invalid signatures, expired nonces, replay, and unprovisioned bypass
Quadlet Rendering And Sync
- Move container, network, volume, and build rendering to
quadlet.py - Move
quadlet-runtime.jsontracking logic - Move rendered-unit copy logic to
quadlet_sync.py - Add tests covering rendering and sync behavior
Bundle Import
- Move tar.gz/tar.zst extraction and file placement to
bundle.py - Preserve
${CONFIG_DIR}and${FILES_DIR}token substitution behavior - Add tests covering bundle import behavior
Activation And Rollback
- Move activation script execution to
activation.py - Move rootful and rootless service health checks to
activation.py - Move candidate, active, and rollback config swap handling to
activation.py - Add F2FS-safe parent-directory fsync during promotion
- Add tests covering activation and rollback behavior
Async Job API
- Create
jobs.pywith single-flight job execution - Define job states: SUBMITTED, RUNNING, SUCCEEDED, FAILED
- Track rollback status in failed jobs
- Implement mutual exclusion for concurrent submissions
- Bound retained job history to avoid unbounded memory growth
- Add tests for concurrent submission, state transitions, and cleanup
Litestar HTTP Application
- Create
app.pywith Litestar app factory - Wire API routes: GET
/api/nonce, POST/api/config, GET/api/jobs/{id}, GET/api/health, POST/api/validate - Integrate SSH auth guards with first-boot bypass
- Integrate job manager for POST
/api/config
Boot UI Routes
- Create
ui.pywith HTML form endpoints - GET
/— serve Boot UI HTML - GET
/assets/atomixos.png— serve static logo - POST
/apply— multipart form to sync provision to HTML result - Do not expose
/generate; first-boot UI only uploads or pastes a prepared config - Escape user-controlled HTML output
Server Entry Point
- Create
server.pywith click-based CLI - Implement commands:
serve,validate,import,recover,sync-quadlet - Implement sd_listen_fds socket inheritance from systemd
- Preserve systemd unit compatibility
Nix Integration
- Update
modules/first-boot.nixto reference the new Python package - Build Python environment with Litestar, uvicorn, and the new package
- Bind first-boot socket to
0.0.0.0:8080, then rebind to provisioned LAN IP - Preserve PATH dependencies (openssh, gzip, zstd, systemd, util-linux)
- Update
nix/tests/first-boot-provision.nixfor the new package - Update
nix/tests/first-boot-source-discovery.nixfor the new package
Cleanup And Close
- Move provisioning implementation into
scripts/atomixos_provision/while preserving thefirst-boot-provisioncommand interface - Update docs and reference pages for the new package layout
- Update
docs/src/planned-features.mdto mark feature complete - Add
boot-ui-htmxtoplanned-features.mdas a follow-up feature - Run full Nix build and VM tests on aarch64-linux builder
Service Foundation Follow-Up
- Add
settings.pywith a small environment-backedAppSettingsobject - Add
deps.pywith Litestar dependency providers for settings and service facades - Add
exceptions.pyfor consistent domain-error to HTTP-response mapping - Add typed schemas for nonce, validation, submit-config, job, job event, and provision result responses
- Convert job response serialization to use the typed schemas
- Split
/api/healthinto adomain/system/controller.py - Split
/api/nonceinto adomain/auth/controller.py - Split
/api/jobs/{id}into adomain/jobs/controller.py - Split
/api/configand/api/validateinto adomain/config/controller.py - Add a
ConfigServicefacade for apply and validate operations - Keep
create_app()route wiring explicit; do not add domain auto-discovery yet - Add OpenAPI operation IDs, summaries, tags, and typed response metadata for API routes
- Add docs updates for
docs/src/provisioning.md,docs/src/data-flow.md,docs/src/runtime-boundaries.md,docs/src/reference/project-structure.md,docs/src/code-reference/scripts.md, anddocs/src/testing.mdwhen service API behavior changes - Keep Boot UI routes in
ui.pyuntil theboot-ui-htmxfollow-up splits server-rendered partials
Future Dynamic API Direction
- Design
GET /api/config/currentto return normalized current desired state - Design
GET /api/config/exportto export a backup/clone config bundle - Design typed user partial updates such as
PATCH /api/config/users/{name} - Design typed network partial updates such as
PATCH /api/config/network - Design typed container partial updates such as
PATCH /api/config/containers/{name} - Ensure every partial update loads current desired state, applies a typed patch, validates full desired state, renders candidate state, promotes atomically, activates, reports job progress, and rolls back on failure
- Do not allow partial API paths to directly mutate derived files or runtime systemd/Quadlet state
- Define normalized desired-state import and export bookends before implementing partial mutation APIs
- Add import/export round-trip tests so API-managed state can always be backed up or cloned as a config bundle
- Add drift detection/reporting expectations for rendered files under
/data/config/
Validation And Readiness
- Run
uv run --extra dev pytestfor the provisioning package - Run
uv run --extra dev ruff check .for the provisioning package - Run Nix parse checks for touched modules and VM tests
- Run the relevant NixOS VM tests after controller/service refactors
- Verify rootfs closure remains within the 1 GB squashfs budget after dependency changes
- Search docs for stale synchronous
/api/configresponse descriptions after API changes - T999: Reconcile final implementation, feature specs, and docs before close-out
Explicitly Deferred
- Do not add SQLAlchemy, a database, or repository abstractions without a persistent data model that cannot be represented by config state
- Do not add Redis/SAQ unless jobs must survive service restarts or run independently of the provisioning process
- Do not add OAuth/JWT auth unless SSH-signature administration stops meeting operator needs
- Do not add Vite/SPA integration for the bootstrap UI; prefer server-rendered/HTMX follow-up work
Design: caddy-authcrunch-cockpit-tutorial
Summary
A documentation-only tutorial that provides a fully working local-management
config.toml bundle demonstrating Caddy with the AuthCrunch plugin for
Microsoft Entra OIDC authentication, provider-swap guidance for Google and other
OIDC providers, JWT-based group-to-role mapping, and Cockpit-ws for device
management – all provisioned through AtomixOS’s existing config.toml system.
Goal
An operator can copy the tutorial config, substitute their identity-provider and local DNS values, build a config bundle, and provision an AtomixOS device with a working OIDC-authenticated management stack that does not need public DNS, public ACME validation, or inbound internet exposure. The tutorial exercises every major config.toml feature: containers, networks, volumes, builds, bundle files, and token substitution.
Architecture
Container Topology
graph TD
subgraph device["AtomixOS device (host network)"]
subgraph caddy["caddy-gateway<br/>(AuthCrunch)"]
direction TB
c1["Entra OIDC login"]
c2["JWT issuance"]
c3["role-based authz"]
c4["reverse proxy"]
end
subgraph cockpit["cockpit-ws"]
direction TB
k1["--local-session"]
k2["cockpit-bridge"]
k3["host socket mounts"]
end
caddy -- "reverse_proxy localhost:9090" --> cockpit
socket["/run/podman/podman.sock"]
caddydata["caddy-data volume"]
mgmtnet["management network"]
end
lan((Local LAN browser)) -- "ports 80, 443" --> caddy
cockpit -. "mount" .-> socket
caddy -. "mount" .-> caddydata
Authentication Flow
- User navigates to
https://gateway.example.com/cockpit/, where the gateway name resolves locally to the device’s LAN address - Caddy’s authorization policy checks for a valid JWT cookie
- If no cookie: redirect to
/auth/which initiates Entra OIDC login - AuthCrunch receives the OIDC ID token, maps Entra groups to local roles:
- Entra group
AtomixOS-Admins->authp/admin - Entra group
AtomixOS-Users->authp/user
- Entra group
- AuthCrunch issues a local JWT cookie with the mapped roles
- Caddy’s authorization policy validates the JWT and allows the request
- Caddy reverse-proxies to cockpit-ws at
localhost:9090 - Cockpit-ws runs with
--local-sessionand does not perform a second login - The local cockpit-bridge session uses mounted host sockets for system and Podman management
Caddy-Gated Local Session
The tutorial uses Caddy + AuthCrunch as the authentication and authorization
boundary. Cockpit-ws runs behind Caddy with --local-session, so Cockpit starts
cockpit-bridge directly and trusts the reverse proxy boundary. The Cockpit
route is admin-only; user-facing applications can use a separate policy that
allows both authp/admin and authp/user.
Cockpit-Podman Integration
The cockpit-podman package communicates with Podman via its REST API through
the Podman socket. This tutorial installs cockpit-podman into the custom
Cockpit container and mounts /run/podman/podman.sock, so administrators can
manage host containers after Caddy authorizes access to /cockpit/*.
Bundle Structure
example/caddy-oidc/
config.toml
files/
caddy/
Caddyfile
cockpit/
Containerfile # Custom cockpit-ws image (adds management modules)
config.toml Design
Containers
| Container | Image | Privileged | Network | Purpose |
|---|---|---|---|---|
| caddy-gateway | ghcr.io/authcrunch/authcrunch:latest | true | host (forced) | OIDC auth, reverse proxy |
| cockpit-ws | custom build from quay.io/fedora/fedora:latest | true | host (forced) | Device management UI |
The cockpit-ws container uses a custom Containerfile based on Fedora that installs
cockpit-ws, Cockpit bridge, and management modules. The custom image is built
via Quadlet .build support.
Caddy is rootful because it binds privileged ports 80/443. Cockpit-ws is rootful because the example intentionally exposes a local admin session with host D-Bus, systemd, journal, and Podman sockets mounted into the container.
Builds
| Build | Base Image | Additions | Purpose |
|---|---|---|---|
| cockpit-ws | quay.io/fedora/fedora:latest | Cockpit management modules | Admin console runtime |
The cockpit-ws.build Quadlet unit builds the custom cockpit-ws image from a
Containerfile in the bundle. This exercises the new .build config.toml feature.
The build uses Network = "host" so package installation does not depend on
Podman’s build-time netavark/nftables network setup.
The cockpit-ws.container unit requires and starts after
cockpit-ws-build.service, and sets Pull = "never" so Podman uses the local
build output instead of trying to resolve localhost/cockpit-ws:latest as a
registry image.
Networks
| Network | Purpose |
|---|---|
| management | Future use: inter-container communication if containers move off host network |
The management network demonstrates the [network.*] config.toml feature. In the initial
tutorial both containers use host networking, so the network is defined but not actively
used by the containers. This is intentional: it shows operators how to define networks and
provides a foundation for moving to bridge networking later.
Volumes
| Volume | Purpose |
|---|---|
| caddy-data | Persistent Caddy state (certificates, ACME) |
Bundle Files
| File | Mount Target | Purpose |
|---|---|---|
files/caddy/Caddyfile | /etc/caddy/Caddyfile | AuthCrunch + OIDC configuration |
files/cockpit/Containerfile | build context | Custom cockpit-ws image definition |
Environment Variables (via Quadlet Environment)
| Variable | Container | Purpose |
|---|---|---|
AZURE_TENANT_ID | caddy-gateway | Entra directory/tenant ID |
AZURE_CLIENT_ID | caddy-gateway | Entra app registration client ID |
AZURE_CLIENT_SECRET | caddy-gateway | Entra app registration client secret |
ENTRA_ADMIN_GROUP_NAME | caddy-gateway | Entra group promoted to authp/admin |
GATEWAY_DOMAIN | caddy-gateway, cockpit-ws | Local device DNS name |
JWT_SHARED_KEY | caddy-gateway | Shared secret for JWT sign/verify |
Caddyfile Design
{
http_port 80
https_port 443
admin off
order authenticate before respond
order authorize before basicauth
security {
oauth identity provider azure {
realm azure
driver azure
tenant_id {env.AZURE_TENANT_ID}
client_id {env.AZURE_CLIENT_ID}
client_secret {env.AZURE_CLIENT_SECRET}
scopes openid email profile
}
authentication portal myportal {
crypto default token lifetime 3600
crypto key sign-verify {env.JWT_SHARED_KEY}
enable identity provider azure
ui {
links {
"Cockpit" /cockpit/ icon "las la-server"
}
}
transform user {
match realm azure
action add role authp/user
}
transform user {
match realm azure
match roles {$ENTRA_ADMIN_GROUP_NAME}
action add role authp/admin
}
}
authorization policy user-policy {
set auth url /auth/
crypto key verify {env.JWT_SHARED_KEY}
allow roles authp/admin authp/user
validate bearer header
inject headers with claims
}
authorization policy admin-policy {
set auth url /auth/
crypto key verify {env.JWT_SHARED_KEY}
allow roles authp/admin
validate bearer header
inject headers with claims
}
}
}
{$GATEWAY_DOMAIN} {
tls internal
redir / /cockpit/ 302
redir /cockpit /cockpit/ 302
route /auth* {
authenticate with myportal
}
route /cockpit/* {
authorize with admin-policy
reverse_proxy localhost:9090
}
# Add user-facing applications here. They can use user-policy to allow
# both admin and user roles.
# route /app/* {
# authorize with user-policy
# reverse_proxy localhost:8080
# }
}
Local TLS Design
The tutorial is meant to work locally, not as an internet-facing deployment.
Caddy uses tls internal so certificate issuance is handled by Caddy’s local CA
instead of Let’s Encrypt. Operators must ensure their management workstation
resolves GATEWAY_DOMAIN to the device’s LAN address and either trusts the
Caddy local root CA or accepts the browser warning during testing.
This intentionally avoids the failure mode where a local gateway name resolves to a public IP, causing ACME HTTP-01/TLS-ALPN-01 validation to time out against a host that is not the AtomixOS device.
Cockpit Local Session Design
The custom Cockpit image runs:
cockpit-ws --no-tls --local-session /usr/bin/cockpit-bridge
This deliberately disables Cockpit’s own login flow. Caddy is the only public
entry point and must authorize /cockpit/* with admin-policy before traffic
reaches cockpit-ws.
The custom image writes /etc/cockpit/cockpit.conf at startup using
GATEWAY_DOMAIN from config.toml, keeping all operator-editable placeholders
in one file.
Azure App Registration Prerequisites
The tutorial must document these Azure portal steps:
- Register a new App Registration in Microsoft Entra ID
- Set redirect URI:
https://<gateway-domain>/auth/oauth2/azure/authorization-code-callback - Create a client secret
- Under “Token Configuration” -> “Add groups claim” -> Select “Security groups”
- Note the Tenant ID, Client ID, and Client Secret
- Create Entra security groups (e.g.,
AtomixOS-Admins,AtomixOS-Users) - Assign users to groups
Alternate OIDC Providers
The default bundle uses Entra, but the tutorial also documents the provider-specific values that change for Google and other OIDC providers:
- AuthCrunch
oauth identity providername, realm, and driver - callback URI path (
/auth/oauth2/<provider>/authorization-code-callback) - environment variables for client ID/client secret
- transform rules used to assign
authp/adminandauthp/user
Constraints
- Must use only config.toml features that exist today or are added as part of this
feature (
.buildQuadlet support is a new prerequisite) - Both containers are rootful for this example: Caddy for privileged ports and Cockpit for host management socket access
- Tutorial values (tenant ID, client ID, local DNS name) use obvious
<PLACEHOLDER>markers - Must not require changes to the AtomixOS base image schema beyond
.buildsupport - The tutorial config must pass
first-boot-provision validate
Non-Goals
- Internet deployment and production-hardening (public certificates, certificate pinning, secret rotation, HA)
- Native host Cockpit service packaging
- Custom PAM module or Cockpit bearer-token authentication
- SAML providers (tutorial focuses on OIDC)
Success Criteria
- Tutorial config passes
first-boot-provision validate - Existing first-boot provisioning tests cover the config.toml features used by the tutorial bundle (containers, networks, volumes, builds, bundle files)
- Documentation clearly explains the authentication flow end-to-end
- Role mapping is demonstrated with Entra groups and provider-swap guidance is included for Google and other OIDC providers
- Caddy-gated local session eliminates double authentication
- Cockpit-podman container/socket integration is documented honestly
Risks and Tradeoffs
| Risk | Impact | Mitigation |
|---|---|---|
| AuthCrunch Caddyfile syntax changes | Tutorial breaks on version upgrade | Pin image tag in tutorial; note version tested |
| Cockpit has no second login | Caddy misconfiguration could expose an admin session | Keep cockpit bound behind admin-policy and host-local routing |
| Host socket mounts are powerful | Container compromise can manage host services/podman | Document that Cockpit is an admin application, not a user app |
| Entra group claim configuration | Groups may appear as GUIDs not names | Document Azure portal Token Configuration steps |
| Local CA not trusted by browsers | Browser warning on first HTTPS access | Document Caddy internal CA trust requirement |
| JWT_SHARED_KEY in container env | Secret visible in Quadlet file on disk | Document that production should use secret files |
| Cockpit package drift | Container module versions may not match host services | Treat this as an example stack; native host packaging is future |
Dependencies
Existing dependencies are satisfied. One new capability is required:
- Network and volume Quadlet support (completed:
85ec53c) - Bundle file support with
${FILES_DIR}token substitution (completed) - Container, network, volume rendering and sync (completed)
- Quadlet
.buildsupport (new): schema, rendering, sync, and test updates needed to support[containers.build.*]sections in config.toml that produce.buildQuadlet units. This is implemented as a prerequisite task within this feature.
Affected Documentation
docs/src/SUMMARY.md– add tutorial entry under new Tutorials sectiondocs/src/planned-features.md– update status toin-progress- New:
docs/src/features/caddy-authcrunch-cockpit-tutorial/design.md(this file) - New:
docs/src/features/caddy-authcrunch-cockpit-tutorial/tasks.md - New: tutorial page under
docs/src/tutorials/ - New: directly packageable example bundle under
example/caddy-oidc/
Open Design Questions
None. All questions from the project plan have been resolved:
- Cockpit-ws auth: Resolved by placing Cockpit behind Caddy/AuthCrunch and
running cockpit-ws with
--local-session - Cockpit-podman: Installed in the custom Cockpit container and connected to the mounted host Podman socket
Tasks: caddy-authcrunch-cockpit-tutorial
T000 – Feature spec review
- Review
design.mdfor completeness and accuracy - Confirm Caddy-gated
--local-sessionapproach for Cockpit - Confirm AuthCrunch Caddyfile syntax against current docs
- Resolve open design questions (Cockpit auth boundary, custom image,
.buildsupport)
T00A – Add Quadlet .build support
This is a new infrastructure prerequisite discovered during spec review.
The cockpit-ws container requires a custom Fedora image that installs Cockpit
management modules. Quadlet supports .build units; config.toml needs to support
them the same way it supports .network and .volume.
- Add
buildDefinitiontoschemas/config.schema.json($defs) - Add optional
buildtop-level key to schema - Implement
render_builds()infirst-boot-provision.py(followrender_networks()/render_volumes()pattern) - Register
.buildunits inquadlet-runtime.json(mode: rootful) - Update
sync-quadletto handle.buildfiles - Update NixOS test to cover
.buildrendering and sync - Validate that
.buildQuadlet units trigger image build on firstsystemctl daemon-reload+ container start
T00B – Write cockpit-ws Containerfile
- Create
files/cockpit/Containerfilebased onquay.io/fedora/fedora:latest - Add Cockpit bridge and management modules via
dnf install --setopt=install_weak_deps=False - Verify the built image has the required Cockpit modules available
- Keep the Containerfile minimal (single RUN layer)
T001 – Use Caddy-gated local session auth
- Remove custom bearer auth script from the example bundle
- Use Caddy/AuthCrunch as the only public authentication boundary
- Run Cockpit with
--local-sessionbehind Caddy - Restrict
/cockpit/*toauthp/admin
T002 – Write the Caddyfile
- Configure Entra OIDC identity provider with placeholder values
- Document how to swap the identity provider block for Google or another OIDC provider
- Configure authentication portal with JWT signing
- Configure user transforms for group-to-role mapping
- Configure authorization policies for admin and user routes
- Configure reverse proxy to cockpit-ws at localhost:9090
- Configure
/auth*route for authentication portal - Configure
/cockpit/*route with authorization policy - Configure local-only HTTPS with Caddy
tls internal - Validate Caddyfile syntax against AuthCrunch docs
T003 – Configure Cockpit reverse proxy settings
- Generate
/etc/cockpit/cockpit.confat container startup - Configure
Originsfrom theGATEWAY_DOMAINenvironment variable - Configure
UrlRootfor/cockpit/path prefix
T004 – Write config.toml
- Define
version = 2 - Define
users.admin.ssh_keywith placeholder public key - Define
network.firewall.inbound.wanwith ports 80 and 443 open (TCP) - Define
activation.requiredlistingcaddy-gateway.serviceandcockpit-ws.service - Define
caddy-gatewaycontainer (rootful, AuthCrunch image) - Define
cockpit-wscontainer (rootful, custom build image ref) - Define
cockpit-wsbuild section referencing Containerfile - Define
managementnetwork with subnet - Define
caddy-datavolume with local driver - Configure
Environmentkeys with placeholder values (AZURE_TENANT_ID,AZURE_CLIENT_ID,AZURE_CLIENT_SECRET,JWT_SHARED_KEY) - Configure
Volumemounts for Caddyfile and host management sockets using${FILES_DIR}tokens where appropriate - Configure Podman socket mount for cockpit-ws container
- Verify all placeholder values are obvious (
<AZURE_TENANT_ID>, etc.)
T005 – Validate config.toml
- Run
first-boot-provision validateon the tutorial config - Fix any schema or semantic validation errors
- Verify all rendered Quadlet files have correct content
T006 – Write NixOS VM test
Skipped. The existing first-boot-provision.nix test already covers all
code paths used by the tutorial config (containers, networks, volumes,
builds, bundle files, sync-quadlet). A dedicated tutorial test would
duplicate coverage without exercising new logic.
T007 – Write tutorial documentation page
- Write introduction explaining what the tutorial builds
- Document Azure App Registration prerequisites step by step
- Document the authentication flow with a diagram
- Present the complete config.toml with annotations
- Present the Caddyfile with annotations
- Present the Containerfile with annotations
- Document the bundle directory structure
- Document how to build and apply the bundle
- Document role mapping (
authp/adminfor Cockpit,authp/userfor app routes) - Document local DNS and Caddy internal TLS requirements
- Document alternate OIDC provider setup for Google and generic providers
- Document cockpit-podman container/socket integration and native-host alternative
- Document security considerations and production hardening notes
- Add placeholders table listing all values that must be substituted
T008 – Update docs/src/SUMMARY.md
- Create a Tutorials section in SUMMARY.md (does not exist yet)
- Add tutorial entry under the new Tutorials section
T009 – Update planned-features.md
- Update
caddy-authcrunch-cockpit-tutorialstatus toin-progress
T999 – Feature close-out
- All tasks T00A-T009 completed
- Tutorial config passes
first-boot-provision validate NixOS VM test passes(T006 skipped; existing test covers code paths)- Documentation builds without errors
- design.md and delivered behavior agree
- No unresolved design questions remain
Items deferred to hardware validation
- T00A: Validate
.buildQuadlet units trigger image build on daemon-reload - T00B: Verify built image has the required Cockpit modules available
Task Reference
All tasks are run with mise run <task>. Run mise tasks to list them.
Build Tasks
All build:* tasks accept --lima to run inside a Lima VM and --vm <name> to specify which VM (default: default).
| Task | Description |
|---|---|
check | Verify flake evaluates cleanly (nix flake check) |
build | Build and retain image artifacts under .gcroots/ |
build:squashfs | Build squashfs rootfs → result-squashfs/ |
build:rauc-bundle | Build signed RAUC bundle → result-rauc-bundle/ |
build:boot-script | Build U-Boot boot script → result-boot-script/ |
build also accepts -o <path> to copy the latest .img to a path.
E2E Test Tasks
| Task | Description |
|---|---|
e2e | Run the core 9-task E2E suite sequentially |
e2e:rauc-slots | RAUC slot detection after boot |
e2e:rauc-update | Bundle install + slot switch A→B |
e2e:rauc-rollback | Install → mark bad → rollback to previous slot |
e2e:rauc-confirm | os-verification health check → mark-good (~3 min) |
e2e:rauc-power-loss | Crash mid-install, verify recovery |
e2e:rauc-watchdog | Watchdog + boot-count rollback |
e2e:firewall | WAN/LAN/VPN port allow/deny (2-node VLAN) |
e2e:network-isolation | DHCP/NTP/WAN isolation (2-node VLAN) |
e2e:ssh-wan-toggle | SSH-on-WAN flag enable/disable |
e2e:debug | Interactive QEMU VM for debugging (-t <test>, --keep) |
Provisioning Tasks
| Task | Description |
|---|---|
flash | Flash image to disk device with dd + progress (macOS/Linux) |
Configuration Tasks
config:lan-range: Update LAN gateway/DHCP range across all config files.
Utility Tasks
| Task | Description |
|---|---|
gc | Delete old generations and collect unrooted store paths (--lima; --vm <name> when using --lima) |
serial:capture | Capture serial output (1.5 Mbaud, auto-reconnect). --bg for background |
serial:shell | Interactive serial shell via minicom (1.5 Mbaud) |
Flake Outputs
The Nix flake (flake.nix) provides the following outputs:
NixOS Configurations
| Output | Description |
|---|---|
nixosConfigurations.rock64 | Real hardware NixOS system (RK3328, eMMC, all service modules) |
nixosConfigurations.rock64-qemu | QEMU aarch64-virt testing target (virtio devices, custom RAUC backend) |
Both configurations share modules/base.nix and all service modules. They differ only in hardware-specific
configuration (kernel drivers, device paths, boot method).
Packages
All packages target aarch64-linux. An aarch64-darwin alias is provided so that nix build .#image works directly
from macOS when a linux-builder is available (the alias points to the same aarch64-linux package set):
| Output | Description |
|---|---|
packages.aarch64-linux.squashfs | Compressed squashfs root filesystem (~300-400 MB) |
packages.aarch64-linux.rauc-bundle | Signed multi-slot .raucb bundle for OTA updates |
packages.aarch64-linux.boot-script | Compiled U-Boot boot.scr |
packages.aarch64-linux.uboot | Custom Rock64 U-Boot package providing the bootloader artifacts |
packages.aarch64-linux.uboot-env-tools | fw_printenv / fw_setenv binaries used with the Rock64 SPI env |
packages.aarch64-linux.image | Flashable eMMC disk image (U-Boot + boot-a + rootfs-a, ~1.2 GB) |
Apps
| Output | Description |
|---|---|
apps.aarch64-linux.rock64-qemu-vm | QEMU VM runner (nix run .#rock64-qemu-vm) |
Checks (Tests)
Tests are available for both Linux and macOS:
| Output | Description |
|---|---|
checks.aarch64-linux.* | E2E tests running under TCG (software emulation) |
checks.aarch64-darwin.* | Same tests running natively on macOS via Apple Virtualization Framework |
Available test names: rauc-slots, rauc-update, rauc-rollback, rauc-confirm, rauc-power-loss, rauc-watchdog,
firewall, initrd-fresh-flash-marker, first-boot-provision, first-boot-source-discovery, forensics-podman-log-path,
forensics-rsyslog-path, forensics-rsyslog-buffering, forensics-shutdown-flush, network-isolation, ssh-wan-toggle.
Overlay
The flake includes an embeddedOverlay that strips unnecessary dependencies to reduce closure size:
crunis built without CRIU support (removescriu+python3, saving ~102 MB)
This overlay is applied to both NixOS configurations via the overlayModule.
Project Structure
flake.nix Main flake (pinned nixpkgs release, aarch64-linux)
flake.lock Pinned nixpkgs
mise.toml Tool versions, build tasks, hooks
modules/
base.nix Shared NixOS config (systemd, ssh, auth, closure opts)
hardware-rock64.nix RK3328 kernel, DTB, eMMC/watchdog drivers
hardware-qemu.nix QEMU aarch64-virt target for testing
networking.nix NIC naming (.link files), eth0/eth1 config
firewall.nix nftables rules (WAN/LAN/VPN/FORWARD)
lan-gateway.nix dnsmasq DHCP, chrony NTP, IP forwarding off
rauc.nix RAUC system.conf, slot definitions
watchdog.nix systemd watchdog config
os-verification.nix Post-update health check service
os-upgrade.nix Update polling + reserved hawkBit package path
first-boot.nix First-boot provisioning import + slot commit
logging.nix journald ingress + buffered rsyslog durability
boot-storage-debug.nix Boot-partition mount helpers for debugging
openvpn.nix OpenVPN recovery tunnel
nix/
squashfs.nix Squashfs image derivation (closureInfo + mksquashfs)
rauc-bundle.nix Multi-slot RAUC bundle derivation
boot-script.nix U-Boot boot.scr compilation
image.nix Flashable eMMC disk image derivation
tests/ NixOS VM integration tests (nixos-lib.runTest)
rauc-slots.nix RAUC slot detection + custom backend
rauc-update.nix Bundle install + slot switch
rauc-rollback.nix Install -> mark-bad -> rollback
rauc-confirm.nix os-verification health check -> mark-good
rauc-power-loss.nix Crash mid-install, verify recovery
rauc-watchdog.nix Watchdog + boot-count rollback
firewall.nix 2-node WAN/LAN port allow/deny
initrd-fresh-flash-marker.nix Initrd fresh-flash detection
first-boot-provision.nix Provisioning import + Quadlet rendering
first-boot-source-discovery.nix USB/boot seed discovery rules
forensics-*.nix journald/rsyslog durability and log-path tests
network-isolation.nix 2-node DHCP/NTP/WAN isolation
ssh-wan-toggle.nix SSH-on-WAN flag enable/disable
scripts/
build-squashfs.sh Squashfs build template (Nix derivation)
build-rauc-bundle.sh RAUC bundle build template (Nix derivation)
build-image.sh Disk image assembly template (Nix derivation)
os-verification.sh Runtime health check script
os-upgrade.sh Runtime update polling script
ssh-wan-toggle.sh SSH-on-WAN flag check
ssh-wan-reload.sh SSH-on-WAN runtime reload
first-boot.sh First-boot provisioning import + mark-good
atomixos_provision/ Provisioning server package (Litestar + uvicorn)
src/atomixos_provision/domain Explicit API domains: auth, config, jobs, system
quadlet-sync.sh Rootful/rootless Quadlet sync + startup
watchdog-boot-count.sh Boot-count decrement and rollback journal logging
boot.cmd U-Boot A/B boot script source
fw_env.config U-Boot SPI env config
.mise/tasks/
flash Flash image to disk device (macOS/Linux)
serial/
capture Serial console capture (1.5 Mbaud, --bg for background)
shell Interactive serial console (minicom)
config/
lan-range Update LAN gateway/DHCP range across all configs
e2e/
rauc-slots ... ssh-wan-toggle Individual E2E test runners
debug Interactive QEMU debugging
docs/
build Build the documentation site
serve Serve docs locally with hot reload
certs/
dev.ca.cert.pem Development RAUC CA certificate (public)
dev.signing.cert.pem Development RAUC signing certificate (public)
dev.*.key.pem Development private keys (committed for dev/test only)
docs/
book.toml mdBook configuration
src/ Documentation source (this site)
_typos.toml Typos checker config
Code Reference
This section documents the internal interfaces of AtomixOS: the NixOS modules, Nix derivations, and shell scripts that make up the system.
- NixOS Modules – the NixOS configuration modules in
modules/ - Nix Derivations – the build derivations in
nix/ - Scripts – the shell scripts in
scripts/and.mise/tasks/
NixOS Modules
All NixOS modules live in the modules/ directory. base.nix imports all service modules and is itself imported by the
hardware-specific modules (hardware-rock64.nix, hardware-qemu.nix).
Module Dependency Graph
flowchart TD
KERNEL["kernel-config.nix<br/>shared stripped kernel baseline"]
subgraph HARDWARE["hardware targets"]
direction LR
ROCK64["hardware-rock64.nix"]
QEMU["hardware-qemu.nix"]
end
ROCK64 --> BASE["base.nix"]
QEMU --> BASE
subgraph IMPORTS["base.nix imports"]
direction LR
LOGGING["logging.nix"]
NETWORKING["networking.nix"]
FIREWALL["firewall.nix"]
LAN["lan-gateway.nix"]
OPENVPN["openvpn.nix"]
RAUC["rauc.nix"]
FIRSTBOOT["first-boot.nix"]
VERIFY["os-verification.nix"]
UPGRADE["os-upgrade.nix"]
WATCHDOG["watchdog.nix"]
end
BASE --> NETWORKING
BASE --> LOGGING
BASE --> FIREWALL
BASE --> LAN
BASE --> OPENVPN
BASE --> RAUC
BASE --> FIRSTBOOT
BASE --> VERIFY
BASE --> UPGRADE
BASE --> WATCHDOG
KERNEL -. shared baseline .-> ROCK64
KERNEL -. shared baseline .-> QEMU
base.nix
Purpose: Shared NixOS configuration for both hardware and QEMU targets. Defines the core system layout, filesystem mounts, user accounts, and system packages.
Key configuration:
| Setting | Value | Notes |
|---|---|---|
system.stateVersion | "25.11" | NixOS release |
networking.hostName | "gateway" | |
nix.enable | false | No Nix daemon on read-only rootfs |
documentation.enable | false | Saves closure space |
security.sudo.enable | false | Uses run0 instead |
Filesystem layout (OverlayFS root):
The root filesystem uses a single OverlayFS assembled in the initrd from the selected squashfs slot and tmpfs-backed upper/work directories:
| Layer | Mount | Filesystem | Size | Description |
|---|---|---|---|---|
| overlay (combined) | / | overlay | – | Unified writable root presented to userspace |
| lower (read-only) | /run/rootfs-base | squashfs | – | Immutable NixOS system from the selected RAUC rootfs slot |
| upper (writable) | /run/overlay-root/* | tmpfs | runtime | Ephemeral writes, lost on reboot |
| persistent state | /data | f2fs | dynamic | Created on first boot (PARTLABEL=data, nofail, noatime) |
The overlay is assembled in the initrd before switch_root:
boot.scrpassesroot=fstabandatomixos.lowerdev=/dev/...for the selected squashfs slotinitrd-prepare-overlay-lower.servicemounts that slot read-only at/run/rootfs-basesysroot.mountmounts/as overlay withlowerdir=/run/rootfs-base,upperdir=/run/overlay-root/upper, andworkdir=/run/overlay-root/worksysroot-run.mountbind-mounts/runinto the switched root
This approach replaces the older /sysroot mutation logic and keeps the root mount fstab-driven, which fits systemd’s
initrd model more cleanly.
The lower squashfs is selected by U-Boot/RAUC, while /data remains outside the A/B slots and survives updates.
Sandboxing note: nsncd (the NSS lookup daemon) runs as root due to permission issues on the overlay filesystem.
Network wait: systemd-networkd-wait-online is configured with a 30s timeout and anyInterface=true.
Build ID: The NixOS login banner (/etc/issue) displays the build ID for easy identification.
Data partition: Not included in the flashable image. Initrd systemd-repart creates it from the remaining eMMC space
on first boot.
tmpfiles.d rules (created on boot):
/var/empty, /var/lib, /var/lib/systemd/network, /var/lib/private,
/var/lib/private/systemd/resolve, /var/lib/chrony, /var/lib/dnsmasq,
/var/cache, /var/cache/nscd, /var/log, /var/log/journal, /var/db, /var/run
User accounts:
| User | Groups | Authentication |
|---|---|---|
root | – | Locked by default; Rock64 serial-root recovery only when _RUT_OH_=1 |
appsvc | – | Runtime system account for rootless application containers |
Operator users are declared by provisioning config under [users]. Admin users are added to wheel, use SSH keys from
/data/config/ssh-authorized-keys/<user>, and keep password authentication locked.
System packages: nano, htop, curl, jq, f2fs-tools, kmod
logging.nix
Purpose: Configure the runtime logging path: volatile journald as
ingress, buffered rsyslog appends to /data/logs, and a shutdown flush
hook.
Key configuration:
| Setting | Value | Notes |
|---|---|---|
| journald storage | Storage=volatile | Keeps runtime logs in tmpfs-backed journal storage |
| journald cap | RuntimeMaxUse=32M | Bounds memory use for runtime logs |
| rsyslog output | buffered omfile appends to /data/logs/*.log | Uses async buffered writes instead of direct per-line sync |
| Podman log driver | journald | Routes container stdout/stderr into the same journald path |
Services:
| Service | Purpose |
|---|---|
syslog.service | Runs rsyslogd and drains journald into buffered files |
logging-shutdown-flush.service | Flushes journald and asks rsyslog to sync buffered output |
This module no longer installs slot-local forensic helpers. Runtime service and
script output is expected to go to stdout/stderr under systemd, which places
it into journald and then through the buffered rsyslog path.
hardware-rock64.nix
Purpose: Rock64 (RK3328) hardware-specific kernel, device tree, and RAUC slot mapping.
Kernel configuration:
| Category | Drivers | Build |
|---|---|---|
| eMMC | MMC_DW, MMC_DW_ROCKCHIP | built-in (=y) |
| Ethernet | STMMAC_ETH, DWMAC_ROCKCHIP | built-in |
| USB | DWC2, USB_XHCI_HCD, USB_EHCI_HCD, USB_OHCI_HCD | built-in |
| Watchdog | DW_WATCHDOG | built-in |
| Filesystems | SQUASHFS, SQUASHFS_ZSTD, F2FS_FS, OVERLAY_FS | built-in |
| USB Ethernet | USB_RTL8152, USB_NET_AX88179_178A, USB_NET_CDCETHER | module (=m) |
| USB Serial | FTDI_SIO, CP210X | module |
| WiFi/BT | WLAN, CFG80211, MAC80211, RFKILL, BT | unsupported |
RAUC slot mapping:
atomixos.rauc.slots = {
boot0 = "/dev/mmcblk1p1"; # boot-a
boot1 = "/dev/mmcblk1p3"; # boot-b
rootfs0 = "/dev/mmcblk1p2"; # rootfs-a
rootfs1 = "/dev/mmcblk1p4"; # rootfs-b
};
Serial console: ttyS2 at 1.5 Mbaud (Rock64 UART2), enabled via serial-getty@ttyS2.service.
kernel-config.nix
Purpose: Shared stripped kernel baseline used by both Rock64 and QEMU so the VM target stays close to the real device kernel.
Contents:
baseKernelConfig: the common stripped ARM64 gateway kernel baselineoptionalKernelConfig: isolated optional USB serial support
hardware-qemu.nix imports this file and layers only the minimal aarch64-virt, virtio, and test-harness-specific
requirements on top.
hardware-qemu.nix
Purpose: QEMU aarch64-virt configuration for development and testing.
Differences from hardware-rock64.nix:
| Setting | Rock64 | QEMU |
|---|---|---|
| Boot method | U-Boot boot.scr | extlinux |
| Block devices | /dev/mmcblk1pN | /dev/vdN (virtio) |
| RAUC backend | uboot | custom (file-based) |
| Kernel modules | Hardware-specific | virtio_pci, virtio_blk, etc. |
The QEMU RAUC tests share their slot mapping through nix/tests/rauc-qemu-config.nix:
atomixos.rauc = {
slots = {
boot0 = "/dev/vdb";
boot1 = "/dev/vdc";
rootfs0 = "/dev/vdd";
rootfs1 = "/dev/vde";
};
bootloader = "custom";
};
networking.nix
Purpose: Deterministic NIC naming and systemd-networkd configuration.
Link files:
| Priority | Match | Result |
|---|---|---|
10-onboard-eth | Platform platform-ff540000.ethernet | Name = eth0 |
20-usb-eth | Drivers r8152, ax88179_178a, cdc_ether | Enabled as modules in Rock64 kernel config |
| WiFi | Unsupported until hardware selection | not part of current Rock64 image |
Network files:
| Priority | Interface | Configuration |
|---|---|---|
10-wan | eth0 | DHCP v4, uses DHCP DNS, no NTP from DHCP |
20-lan | eth1 | Static 172.20.30.1/24, no DHCP |
Sysctl: net.ipv4.ip_forward = 0, net.ipv6.conf.all.forwarding = 0
firewall.nix
Purpose: nftables firewall with per-interface rules and dynamic SSH-on-WAN toggle.
nftables rules (inet filter):
| Chain | Policy | Rules |
|---|---|---|
input | drop | lo: accept; established: accept; eth1: accept by default; tun0: TCP 22 |
forward | drop | (no exceptions) |
output | accept |
Dynamic SSH toggle services:
| Service | When | What |
|---|---|---|
ssh-wan-toggle | Boot (after nftables) | Reads flag file, adds SSH rule if present |
ssh-wan-reload | On demand | Removes old rule, re-adds if flag file exists |
Flag file: /data/config/ssh-wan-enabled
Provisioned inbound: /data/config/firewall-inbound.json is applied by provisioned-firewall-inbound.service.
The file may contain wan and lan scopes. wan opens selected TCP/UDP ports on the WAN interface. lan, when
present with any ports, appends those ports to the platform-required LAN ports on the LAN interface.
lan-gateway.nix
Purpose: DHCP and NTP server for isolated LAN devices.
dnsmasq configuration:
| Setting | Value |
|---|---|
| Interface | eth1 (bind-dynamic) |
| DHCP range | provisioned range, fallback 172.20.30.10 – 172.20.30.254, 24h lease |
| Gateway option | provisioned gateway IP, fallback 172.20.30.1 |
| DNS option | provisioned gateway IP (gateway-local DNS only) |
| NTP option | provisioned gateway IP |
| DNS port | 53 (local-only, no upstream forwarding) |
chrony configuration:
| Setting | Value |
|---|---|
| Upstream | time.cloudflare.com |
| Serve to | provisioned LAN subnet, fallback 172.20.30.0/24 |
| Fallback | local stratum 10 |
rauc.nix
Purpose: RAUC A/B update system configuration. Defines project options (atomixos.rauc.*) and maps them onto the
upstream NixOS services.rauc module.
Custom NixOS options (atomixos.rauc.*):
| Option | Type | Default | Description |
|---|---|---|---|
compatible | string | "rock64" | RAUC compatible string |
bootloader | enum | "uboot" | Backend (uboot, custom, etc.) |
statusFile | string | /data/rauc/status.raucs | RAUC status file |
bundleFormats | list of strings | [-plain, +verity] | Allowed bundle formats |
keyringCert | path or null | null | Production RAUC CA certificate |
allowDevelopmentKeyring | bool | true | Allow repository development CA |
slots.boot0 | string | (required) | Boot slot A device path |
slots.boot1 | string | (required) | Boot slot B device path |
slots.rootfs0 | string | (required) | Rootfs slot A device path |
slots.rootfs1 | string | (required) | Rootfs slot B device path |
Production builds should set keyringCert to the production RAUC CA and set allowDevelopmentKeyring = false. When the
development keyring is used, /etc/issue includes a warning that the image must not be used for production OTA updates.
When bootloader = "custom", a file-based shell script is generated that simulates U-Boot environment management using
files in /var/lib/rauc/.
watchdog.nix
Purpose: systemd hardware watchdog integration plus boot-count and rollback bookkeeping.
systemd.settings.Manager = {
# RuntimeWatchdogSec = "30s";
# RebootWatchdogSec = "10min";
};
The hardware watchdog manager settings remain disabled during development, but
watchdog-boot-count.service is installed so the real boot-count and rollback
path records lifecycle markers to the journal through normal service stdout.
os-verification.nix
Purpose: Post-update health-check service.
| Setting | Value |
|---|---|
| Type | oneshot |
| Condition | ConditionPathExists=/data/.completed_first_boot |
| Timeout | 180s |
| Script | scripts/os-verification.sh |
| PATH | rauc, jq, systemd, iproute2 |
os-upgrade.nix
Purpose: OTA update polling service.
Custom NixOS options (os-upgrade.*):
| Option | Type | Default | Description |
|---|---|---|---|
useHawkbit | bool | false | Reserve hawkBit path and install package |
pollingInterval | string | "1h" | Timer interval |
Timer: OnBootSec=5min, OnUnitActiveSec=<pollingInterval>, RandomizedDelaySec=10min
os-upgrade.service reads the provisioned /data/config/os-upgrade.json value and exits successfully without polling
when no provisioned update server is set.
When useHawkbit = true, AtomixOS disables the polling service and installs rauc-hawkbit-updater, but does not
configure an operational hawkBit systemd service in the current image.
first-boot.nix
Purpose: One-time first-boot provisioning and optional slot confirmation, plus a persistent LAN bootstrap console.
| Setting | Value |
|---|---|
| Type | oneshot |
| Condition | ConditionPathExists=!/data/.completed_first_boot |
| Script | scripts/first-boot.sh |
| Effect | provision config, optionally rauc status mark-good, then write sentinel |
Mutually exclusive with os-verification.service via the sentinel file.
atomixos-bootstrap.service runs atomixos-provision serve on the LAN bootstrap endpoint and remains available after
provisioning so operators can recover or reprovision without re-imaging.
openvpn.nix
Purpose: OpenVPN recovery tunnel.
| Setting | Value |
|---|---|
| Config path | /data/config/openvpn/client.conf |
| Auto-start | false |
| Condition | ConditionPathExists=/data/config/openvpn/client.conf |
Nix Derivations
The nix/ directory contains four derivations that produce the build artifacts. Each is called from flake.nix via
pkgs.callPackage.
Build Pipeline
flowchart LR
SQUASHFS["squashfs.nix"] --> ROOTFS["rootfs.squashfs"]
BOOTSCRIPT["boot-script.nix"] --> BOOTSCR["boot.scr"]
ROOTFS --> IMAGE["image.nix"]
BOOTSCR --> IMAGE
IMAGE --> IMGOUT["flashable .img"]
ROOTFS --> RAUCBUNDLE["rauc-bundle.nix"]
BOOTSCR --> RAUCBUNDLE
RAUCBUNDLE --> BUNDLEOUT["signed .raucb for OTA"]
squashfs.nix
Purpose: Builds a read-only squashfs image from the full NixOS system closure.
Function signature:
{ stdenv, squashfsTools, closureInfo, nixosConfig, maxSquashfsSize }:
| Parameter | Source | Description |
|---|---|---|
nixosConfig | rock64System.config | Evaluated NixOS configuration |
maxSquashfsSize | flake.nix (1 GB) | Maximum allowed image size |
Delegates to: scripts/build-squashfs.sh
Build steps:
- Compute all Nix store paths from
closureInfoofsystem.build.toplevel - Copy all store paths into a pseudo-root directory
- Create
/initand/sbin/initsymlinks to the NixOS init - Create empty mount-point directories (
/proc,/sys,/dev,/run,/etc,/var,/tmp, etc.) - Run
mksquashfswith zstd compression (level 19), 1 MB block size - Fail if the image exceeds
maxSquashfsSize
Output: $out/rootfs.squashfs
Compression options:
- Algorithm: zstd (level 19)
- Block size: 1 MiB (1048576)
- No xattrs
- All files owned by root
rauc-bundle.nix
Purpose: Builds a signed RAUC bundle containing boot (kernel + initrd + DTB + boot.scr) and rootfs (squashfs) images.
Function signature:
{ stdenv, rauc, dosfstools, mtools, squashfsTools,
nixosConfig, squashfsImage, bootScript, signingCert, signingKeyPath, caCert }:
| Parameter | Source | Description |
|---|---|---|
nixosConfig | rock64System.config | Provides kernel/initrd/DTB paths |
squashfsImage | packages.squashfs | The squashfs derivation output |
bootScript | packages.boot-script | Compiled boot.scr |
signingCert | ./certs/dev.signing.cert.pem | RAUC signing certificate |
signingKeyPath | ./certs/dev.signing.key.pem | RAUC signing private key |
caCert | ./certs/dev.ca.cert.pem | CA certificate for verification |
Delegates to: scripts/build-rauc-bundle.sh
Build steps:
- Create a 128 MB vfat image (
boot.vfat) - Copy kernel
Image,initrd, DTB, andboot.scrinto it using mtools - Copy
rootfs.squashfsinto the bundle directory - Generate
manifest.raucmwithcompatible=rock64and image definitions - Sign and package with
rauc bundle
Output: $out/rock64.raucb
Manifest structure:
[update]
compatible=rock64
version=<nixosConfig.system.nixos.version>
[image.boot]
filename=boot.vfat
type=raw
[image.rootfs]
filename=rootfs.squashfs
type=raw
boot-script.nix
Purpose: Compiles the U-Boot boot script from source (boot.cmd -> boot.scr).
Function signature:
{ stdenv, ubootTools, buildId }:
| Parameter | Source | Description |
|---|---|---|
buildId | flake.nix | Build identifier echoed during U-Boot boot |
Build step:
mkimage -C none -A arm64 -T script -d boot.cmd boot.scr
Output: $out/boot.scr (compiled) and $out/boot.cmd (source copy)
image.nix
Purpose: Assembles the complete flashable disk image for eMMC provisioning.
Function signature:
{ stdenv, dosfstools, mtools, util-linux,
ubootRock64, nixosConfig, squashfsImage, bootScript }:
| Parameter | Source | Description |
|---|---|---|
ubootRock64 | nixpkgs | U-Boot package for Rock64 |
nixosConfig | rock64System.config | Provides kernel, initrd, DTB |
squashfsImage | packages.squashfs | Squashfs derivation |
bootScript | packages.boot-script | Compiled boot.scr |
Delegates to: scripts/build-image.sh
Image layout (total ~1170 MiB sparse):
| Offset | Size | Content | Filesystem |
|---|---|---|---|
| 0 | 16 MB | U-Boot raw | – |
| 16 MB | 128 MB | boot-a | vfat |
| 144 MB | 1024 MB | rootfs-a | squashfs |
| 1168 MB | remaining | unallocated | – |
Output: $out/atomixos-<series>.img
The image name is derived from the pinned NixOS release series (e.g., atomixos-25.11.img). The image leaves the
remaining eMMC space unallocated so initrd systemd-repart can create boot-b, rootfs-b, and /data on first
boot.
GPT partition types: Boot partitions use the xbootldr GUID (BC13C2FF-...). Rootfs partitions use the Linux root
aarch64 GUID (B921B045-...), which is the architecturally correct type for aarch64 root filesystems.
U-Boot raw writes:
idbloader.imgat sector 64 (32 KB)u-boot.itbat sector 16384 (8 MB)
boot-a contents: kernel Image, initrd, DTB (rockchip/rk3328-rock64.dtb), boot.scr
Scripts
Shell scripts in scripts/ and .mise/tasks/ implement the runtime services and build/provisioning tooling.
Build Scripts (Nix Derivation Templates)
These scripts run inside Nix derivations. Variables like @kernel@ are substituted by Nix at build time.
build-squashfs.sh
Location: scripts/build-squashfs.sh
Builds the squashfs rootfs image from a NixOS closure.
| Input | Description |
|---|---|
@systemClosure@ | Path to system.build.toplevel |
@closureInfo@ | Closure info (contains store-paths file) |
@maxSize@ | Maximum image size in bytes |
Steps: Copy store paths to pseudo-root, create init symlinks and mount-point dirs, run mksquashfs with zstd/19,
check size limit.
build-rauc-bundle.sh
Location: scripts/build-rauc-bundle.sh
Builds a signed RAUC bundle (.raucb).
| Input | Description |
|---|---|
@kernel@ | Kernel package (contains Image and dtbs/) |
@initrd@ | Initrd package (contains initrd) |
@dtbPath@ | Relative DTB path (e.g., rockchip/rk3328-rock64.dtb) |
@squashfs@ | Squashfs image directory |
@bootScript@ | Compiled U-Boot script (boot.scr) |
@signingCert@ / @signingKey@ | RAUC signing credentials |
@version@ | Bundle version string |
Steps: Create 128 MB vfat with kernel + initrd + DTB + boot.scr (mtools), generate manifest, sign with rauc bundle.
build-image.sh
Location: scripts/build-image.sh
Assembles the flashable disk image.
| Input | Description |
|---|---|
@kernel@, @initrd@, @dtbPath@ | Kernel artifacts |
@squashfs@ | Squashfs image |
@bootScript@ | Compiled boot.scr |
@uboot@ | U-Boot package |
@imageName@ | Output filename |
Steps: Create sparse image, write U-Boot at raw offsets, create GPT with slot A partitions (boot-a, rootfs-a),
create the slot A vfat boot partition with mtools, and write squashfs to rootfs-a. Slot B and /data are created by
initrd systemd-repart on first boot.
Runtime Scripts
These scripts run on the device at runtime, invoked by systemd services.
watchdog-boot-count.sh
Location: scripts/watchdog-boot-count.sh
Records watchdog boot-count state and rollback decisions for the configured RAUC bootloader backend.
Responsibilities:
- Detect the active bootloader mode from
ATOMIXOS_RAUC_BOOTLOADER - For the
custombackend, decrement/var/lib/rauc/boot-count.<slot>on boot - Mark the failed slot bad and switch primary when the count is exhausted
- For the
ubootbackend, read the post-bootBOOT_*_LEFTvalue viafw_printenv - Emit journal-visible lifecycle lines through normal stdout
boot.cmd
Location: scripts/boot.cmd
U-Boot boot script loaded after RAUC bootmeth selects the slot and decrements the boot-count. Compiled to boot.scr by
mkimage.
Key logic:
- Echo build ID (squashfs store hash) to console for identification
- If the reset button (Linux
gpiochip3line 4, U-Boot GPIO100) is held low for 5 seconds, runums 0 mmc 1so the Rock64 OTG port exposes the full eMMC to a host computer - Auto-detect boot device number from
devnum - Override
ramdisk_addr_r=0x08000000(avoids kernel overlap) - Read RAUC bootmeth variables for selected boot/root partitions
- Set
rauc.slotandatomixos.lowerdev - Load kernel/initrd/DTB from the selected boot partition, set
root=fstab, andbooti
Console: ttyS2,1500000 (Rock64 UART2)
fw_env.config
Location: scripts/fw_env.config
Configuration for fw_setenv / fw_printenv (userspace U-Boot env tools). The installed Rock64 config points to the
single SPI flash environment exposed through /dev/mtd0.
| Entry | Device | Offset | Size | Erase size |
|---|---|---|---|---|
| Primary env | /dev/mtd0 | 0x140000 | 0x2000 | 0x1000 |
The old raw eMMC environment offsets are not used.
os-verification.sh
Location: scripts/os-verification.sh
Post-update health check. Runs after every boot (except first).
Checks performed:
- RAUC slot status – skip if already committed
dnsmasq.serviceis activechronyd.serviceis activeeth0has a WAN IPeth1has the provisioned gateway IP, falling back to172.20.30.1- Provisioned required units from
/data/config/health-required.jsonare active - Sustained 60s check (every 5s): all service, network, and required-unit checks still pass
- On success:
rauc status mark-good
Logging: Emits progress and failure details through normal service output,
which is captured by journald and forwarded to /data/logs by rsyslog.
Dependencies: rauc, jq, systemctl, ip
os-upgrade.sh
Location: scripts/os-upgrade.sh
OTA update polling script. Checks for new RAUC bundles and installs them.
Environment: ATOMIXOS_OS_UPGRADE_CONFIG (provisioned JSON config path)
Steps:
- Get current version from
rauc statusand compact lowercase 12-hex device ID from eth0 MAC - Query
$URL/api/v1/updates/latestwith version and device headers - If newer version found: download to
/data/config/bundles/,rauc install, reboot - Non-fatal on network errors (timer retries later)
Forensics: Emits Tier 0 install and managed reboot markers, but avoids noisy polling or “no update” chatter in the durable forensic log.
first-boot.sh
Location: scripts/first-boot.sh
First-boot provisioning/import/bootstrap flow plus boot confirmation.
Steps:
- Check for
/data/.completed_first_bootand exit if it already exists - Discover provisioning input from fresh-flash
/boot/config.toml, USB media, or the LAN bootstrap console - Validate and import the config into
/data/config/ - Render and sync rootful and rootless Quadlet units
- Restart Quadlet sync, LAN apply, and provisioned firewall apply services; fail before slot confirmation if LAN or firewall apply fails
- Mark the current RAUC slot good when RAUC is enabled
- Write timestamp to
/data/.completed_first_boot
atomixos_provision
Location: scripts/atomixos_provision/
Litestar provisioning service package used by first boot and re-apply flows.
Key modules:
app.pywires the Litestar application explicitlysettings.pyloads environment-backed service settingsdeps.pyprovides Litestar dependency providersdomain/*/controller.pycontains API route handlers grouped by domaindomain/config/service.pyexposes the config apply/validate facadeschemas.pydefines typed API response shapesexceptions.pymaps domain errors to API response bodiesprovision.py,bundle.py,quadlet.py, andactivation.pyimplement the safe apply pipeline
POST /api/config is asynchronous and returns a job URL. The job endpoint
reports provisioning steps, service deployment/status events, final result, and
rollback status.
ssh-wan-toggle.sh
Location: scripts/ssh-wan-toggle.sh
Boot-time SSH-on-WAN rule application.
Logic: If /data/config/ssh-wan-enabled exists, add nftables rule iifname "eth0" tcp dport 22 accept with
comment SSH-WAN-dynamic.
ssh-wan-reload.sh
Location: scripts/ssh-wan-reload.sh
Runtime SSH-on-WAN toggle (remove and re-add rule).
Logic: Find and delete existing SSH-WAN-dynamic rule by handle, then re-add if flag file exists. Idempotent.
mise Task Scripts
These are the .mise/tasks/ scripts invoked via mise run.
flash
Location: .mise/tasks/flash
Cross-platform disk flasher (macOS + Linux).
| Flag | Description |
|---|---|
<disk> | Target device (e.g., /dev/disk4) |
-i <path> | Image file (auto-detects if not specified) |
-y | Skip confirmation |
macOS features: Converts /dev/diskN to /dev/rdiskN for unbuffered I/O; refuses to write to boot disk; ejects
after flash.
serial:capture
Location: .mise/tasks/serial/capture
Serial console capture wrapper with auto-reconnect.
| Flag | Default | Description |
|---|---|---|
-p | /dev/cu.usbserial-DM02496T | Serial device |
-l | /tmp/rock64-serial.log | Log file |
-t | 0 (infinite) | Capture timeout |
--bg | (flag) | Run in background |
Launches scripts/serial-capture.py in a nix-shell with pyserial.
serial:shell
Location: .mise/tasks/serial/shell
Interactive serial console via minicom (1.5 Mbaud, no hardware flow control). Uses nix build nixpkgs#minicom to
resolve the minicom binary.
config/lan-range
Location: .mise/tasks/config/lan-range
Updates LAN gateway/DHCP configuration across all files.
| Flag | Default | Description |
|---|---|---|
--gateway-cidr | 172.20.30.1/24 | Gateway IP and subnet |
--dhcp-start | 172.20.30.10 | DHCP pool start |
--dhcp-end | 172.20.30.254 | DHCP pool end |
Modifies: modules/networking.nix, modules/lan-gateway.nix, scripts/os-verification.sh.