Phase 1: docker-based deploy pipeline#43
Merged
Merged
Conversation
Build a syncloud/redirect image and push it from CI. Add a deploy step that exercises the systemd->docker migration on the same test platform that the integration step already provisioned, so the real UAT/prod migration has no surprises. UAT and prod steps are wired like the store pipeline; they will be red until the corresponding deploy_* secrets are configured.
bookworm tag does not exist for that Go version; the rest of the pipeline already pins to buster.
On a freshly-provisioned host where docker.io was just installed the daemon is not auto-started inside containerised systemd contexts (deb-systemd-invoke can't reach systemd from postinst).
CI uses a privileged platform-bookworm service container in which docker.service refuses to start under systemd. Bring up dockerd in the background with the vfs storage driver so nested execution is not gated on overlay/cgroup features. Real UAT/prod hosts will be served by the systemctl branch above.
The CI 'deploy test' step targets the in-pipeline www.syncloud.test service, but the URL the verifier curls (api.syncloud.test) is not resolvable from the runner without an alias. UAT/prod use real DNS so the alias is only added when the URL host fails to resolve.
The 'build backend' pipeline step already produces static binaries at
build/bin/{api,www,cli}; rebuilding inside the image was duplicate
work and forced a heavier multi-stage layout. Now the image just
copies what the previous step produced, matching how ../store does
it.
- Move the Go build and test invocations out of .drone.jsonnet into backend/build.sh and backend/test.sh so the same recipe can be run locally and by CI without copy-paste drift. - Drop -linkmode external -extldflags -static; with CGO_ENABLED=0 the Go toolchain already produces a fully static binary suitable for distroless/static. - Add a backend/version package and inject GitSha/BuildNumber/BuildTime via -X ldflags, matching the store layout. Each main.go prints its build info on startup so it lands in journal/docker logs. - Scope the long-standing 'api' .gitignore entry to the repo root so backend/cmd/api stops being shadowed.
Rename 'test-integration' to 'systemd install' and have it run only the install path plus a smoke index check on the systemd-deployed redirect. Add a new 'test-api' step that runs the full API suite from verify.py *after* deploy test migrates the host to docker, so the upgrade path (the critical one for the first UAT) is exercised end to end and every API assertion runs against what is actually deployed.
Root cause of the intermittent 'cannot mount squashfs ... failed to setup loop device' failure in test-integration / systemd install: the verify.py test_start fixture invokes 'snap remove platform' on the in-pipeline www.syncloud.test service, which forces snapd to run its syscheck and try to loop-mount a squashfs probe. Under load on the shared CI host loop devices are scarce and that mount fails about half the time. Switch the test service from syncloud/platform-bookworm-amd64 (which ships the platform snap pre-installed) to syncloud/bootstrap-bookworm-amd64 (same image ../store uses). Bootstrap has no snap, so no probe runs and the loop-device contention is gone. Drop the now-unnecessary 'snap remove platform' line; the rest of test_start only relies on apt + ssh + systemctl which bootstrap provides.
test_start (skipped in this step) was the only place that called add_host_alias to make api.syncloud.test and auth.syncloud.test resolve to the device. Re-add the alias inline before pytest so the post-deploy API suite can reach the deployed host.
test_backup hits /var/www/redirect/current/bin/redirectdb after the docker migration and fails with exit 127. Dump the post-migration state of the redirect dir and check mysqldump availability so the next CI run shows which piece is missing.
test_backup (and any future ssh-based check) shells out to sshpass via device.run_ssh; the test-api runner image had only default-mysql-client so sshpass returned 127. Add both packages to match what 'systemd install' installs. Drop the deploy-verify diagnostic now that the cause is identified.
verify.py was doing two jobs: installing redirect on a fresh systemd host (test_start), and asserting API behaviour against the running service. The pipeline now has two distinct phases (systemd install, then docker deploy), so put each phase in its own file: - test-systemd.py owns the tarball install plus a smoke index check and the apache/journal log-collection teardown. - test.py (renamed from verify.py) holds the API suite that runs against the docker-deployed redirect. Move the add_host_alias setup into an autouse session-scoped fixture in conftest.py so both files get it without any /etc/hosts bash hack in the Drone yaml. Drop the now-unused imports left in test.py and the corresponding --deselect flags from the Drone commands.
deploy-verify.sh now also curls the www host (https://www.<domain>/) to confirm the user-facing UI is up, and POSTs /domain/update with $SMOKE_TOKEN to exercise the critical DB+Route53 path end to end. The smoke step is opt-in: deploy uat/prod read SMOKE_TOKEN from new uat_smoke_token / prod_smoke_token secrets, and the check is skipped when the env is empty (so deploy test still works as before).
/status only proves the api process is alive and apache is routing to it; it does not exercise the DB. POST /domain/update with a known bogus token: the handler must query the DB to know the token does not exist, so a working DB returns success:false with 'unknown domain update token' while a broken DB would 500 or hang. Runs on every deploy (test/uat/prod) — no setup needed.
Bogus-token DB smoke is enough until we set up a dedicated smoke account; the SMOKE_TOKEN branch was dead code that referenced secrets we agreed not to create yet.
Validation runs before the DB query, so a token-only payload short circuits with 'web_protocol Missing' / 'web_local_port Missing' instead of touching the DB. Include both with default values so the handler reaches GetDomainByToken and we can assert 'unknown domain update token' as the DB-alive signal.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
syncloud/redirectdocker image from a rootDockerfile(multi-stage, distroless static, api/www/cli + emails baked in)deploy/deploy.shmigrates a host from the existing systemd units (redirect.api,redirect.www) to two--restart=unless-stoppedcontainers, mounting/var/www/redirectso unix sockets stay where Apache already proxies todeploy teststep that runs the script against the samewww.syncloud.testservice the integration step provisioned via systemd, so the systemd-to-docker migration is exercised end-to-end before any UAT/prod rundeploy uatanddeploy prodare wired up like the store pipeline; they will be red on CI untiluat_deploy_*/prod_deploy_*secrets are added to Drone — that gating is deliberate so the first real UAT run is the only thing left untestedThe legacy tarball + systemd flow is left in place so running prod is undisturbed by this branch. Removal will be a follow-up phase.
Test plan
deploy uatis green (clone, services, build web/backend, package, test-integration, docker push, deploy test, test-ui-desktop, test-ui-mobile, artifact, all testapi stages);deploy uatis red because secrets aren't set;deploy prodis skipped on this non-stable branchuat_deploy_host/uat_deploy_user/uat_deploy_key/uat_deploy_urlin Drone and push to trigger a real UAT migration rehearsalprod_deploy_*secrets and merge tostableto migrate production