Initial extraction from monorepo

This commit is contained in:
Viktor Barzin 2026-05-07 17:06:11 +00:00
commit 5c7baa8acc
20 changed files with 1974 additions and 0 deletions

8
.gitignore vendored Normal file
View file

@ -0,0 +1,8 @@
__pycache__/
*.pyc
.venv/
.mypy_cache/
.pytest_cache/
.ruff_cache/
.hypothesis/
*.egg-info/

34
.woodpecker.yml Normal file
View file

@ -0,0 +1,34 @@
when:
event: push
clone:
git:
image: woodpeckerci/plugin-git
settings:
attempts: 5
backoff: 10s
steps:
- name: build-and-push
image: woodpeckerci/plugin-docker-buildx
settings:
# Dual-push during the Forgejo registry consolidation bake. After
# ≥14 days clean, registry.viktorbarzin.me drops out (Phase 4).
repo:
- registry.viktorbarzin.me/hmrc-sync
- forgejo.viktorbarzin.me/viktor/hmrc-sync
logins:
- registry: registry.viktorbarzin.me
username: viktorbarzin
password:
from_secret: registry-password
- registry: forgejo.viktorbarzin.me
username:
from_secret: forgejo_user
password:
from_secret: forgejo_push_token
dockerfile: Dockerfile
context: .
auto_tag: true
platforms:
- linux/amd64

33
Dockerfile Normal file
View file

@ -0,0 +1,33 @@
FROM python:3.12-slim AS builder
ENV POETRY_VERSION=1.8.4 \
POETRY_VIRTUALENVS_IN_PROJECT=true \
PIP_NO_CACHE_DIR=1
RUN pip install --no-cache-dir "poetry==${POETRY_VERSION}"
WORKDIR /app
COPY pyproject.toml poetry.lock* README.md ./
RUN poetry install --only main --no-root
COPY hmrc_sync ./hmrc_sync
COPY alembic ./alembic
COPY alembic.ini ./alembic.ini
RUN poetry install --only main
FROM python:3.12-slim
WORKDIR /app
RUN useradd --system --uid 10002 --home /app --shell /usr/sbin/nologin hmrc
COPY --from=builder --chown=hmrc:hmrc /app /app
ENV PATH="/app/.venv/bin:${PATH}" \
PYTHONUNBUFFERED=1
EXPOSE 8080
USER hmrc
ENTRYPOINT ["python", "-m", "hmrc_sync"]
CMD ["serve"]

90
README.md Normal file
View file

@ -0,0 +1,90 @@
# hmrc-sync
Pulls annual PAYE/NI figures from **HMRC Individual Tax API v1.1** to
reconcile against the monthly payslip data captured by `payslip-ingest/`.
## Phase 1 — sandbox OAuth smoke test (shipped)
Scripts live at the repo root next to this README:
- `oauth_dance.py` — interactive browser OAuth flow against
`test-api.service.hmrc.gov.uk`, captures the callback on
`localhost:8080/oauth/callback`, exchanges for tokens, hits
`/individual-income/sa/{utr}/annual-summary/{tax_year}`.
- `headless_auth.py` — same flow but driven by Chromium via Playwright.
Useful for CI smoke tests.
See the inline module docstrings for usage.
## Phase 2 — production service (scaffolded, awaiting HMRC approval)
Directory layout matches `payslip-ingest/`:
```
hmrc-sync/
├── hmrc_sync/
│ ├── __init__.py
│ ├── __main__.py # click CLI: serve / sync / migrate
│ ├── app.py # FastAPI (authorize, callback, sync, healthz)
│ ├── client.py # HmrcClient — wraps Individual Tax API v1.1
│ ├── db.py # SQLAlchemy models (tax_year_snapshot, fetch_log)
│ ├── fraud_headers.py # build Gov-Client-/Gov-Vendor- headers
│ └── oauth.py # Vault-backed refresh_token storage
├── alembic/
│ ├── env.py
│ └── versions/0001_initial.py
├── tests/
│ └── test_fraud_headers.py # CI-gated shape tests + sandbox validator smoke
├── Dockerfile
├── alembic.ini
└── pyproject.toml
```
### Critical path to prod
1. **HMRC Dev Hub** (user action, ~10 min):
- Subscribe to *Individual Tax API v1.1*.
- Add prod redirect URI: `https://hmrc-oauth.viktorbarzin.me/callback`.
- Submit Production Access application — 2 questionnaires, frame as
"single-user PAYE reconciliation, not redistributed".
- Review takes ~10 working days.
2. **File HMRC SDST support ticket** up-front asking (a) is MTD ITSA
signup required for Individual Tax API prod access, and (b) can a
PAYE-only individual voluntarily enroll without self-employment
income. Proceed with app submission in parallel.
3. **Fraud-header validator sweep** (local — blocking):
```
HMRC_VALIDATOR=1 pytest tests/test_fraud_headers.py
```
Must be green before prod deploy.
4. **After HMRC approval arrives**:
- Seed Vault keys: `hmrc_prod_client_id`, `hmrc_prod_client_secret`,
`hmrc_sync_webhook_token`, `hmrc_device_id` at `secret/viktor/`.
- Create `infra/stacks/hmrc-sync/` Terraform stack (clone from
`infra/stacks/payslip-ingest/`): Deployment, Service, Ingress via
`ingress_factory` (protected=false for HMRC callback), ESO for
Vault→K8s Secret, Grafana datasource ConfigMap, CronJob at 06:00
UTC daily running `python -m hmrc_sync sync --tax-year current`.
- Deploy stack.
- Visit `https://hmrc-oauth.viktorbarzin.me/authorize` once in a
browser to seed the refresh_token. CronJob takes over thereafter.
### Dashboard Panel 10
`infra/stacks/monitoring/modules/monitoring/dashboards/uk-payslip.json`
already carries Panel 10 ("HMRC Tax Year Reconciliation — Individual
Tax API"). It queries `hmrc_sync.tax_year_snapshot` which doesn't
exist yet on the monitoring DB — the panel renders empty until
hmrc-sync is deployed and the Alembic migration runs.
### Risks / mitigations
- **MTD pilot gate blocks API** — SDST ticket resolves; fallback is
payslip-ingest P60 reconciliation (already shipped).
- **Prod approval denied on "personal use"** — reframe + appeal; else
permanent P60-only reconciliation.
- **Fraud-header audit fails** — validator API gates deploy.
- **Refresh token expires (18 months)** — alert on `expires_in` < 30
days; manual re-auth via `/authorize`.
Tracked as beads `code-74j`.

37
alembic.ini Normal file
View file

@ -0,0 +1,37 @@
[alembic]
script_location = alembic
sqlalchemy.url = placeholder
[loggers]
keys = root,sqlalchemy,alembic
[handlers]
keys = console
[formatters]
keys = generic
[logger_root]
level = WARN
handlers = console
qualname =
[logger_sqlalchemy]
level = WARN
handlers =
qualname = sqlalchemy.engine
[logger_alembic]
level = INFO
handlers =
qualname = alembic
[handler_console]
class = StreamHandler
args = (sys.stderr,)
level = NOTSET
formatter = generic
[formatter_generic]
format = %(levelname)-5.5s [%(name)s] %(message)s
datefmt = %H:%M:%S

59
alembic/env.py Normal file
View file

@ -0,0 +1,59 @@
import asyncio
import os
from logging.config import fileConfig
from sqlalchemy.engine import Connection
from sqlalchemy.ext.asyncio import async_engine_from_config
from alembic import context
from hmrc_sync.db import SCHEMA_NAME, Base
config = context.config
if config.config_file_name is not None:
fileConfig(config.config_file_name)
db_url = os.environ.get("DB_CONNECTION_STRING")
if db_url:
config.set_main_option("sqlalchemy.url", db_url)
target_metadata = Base.metadata
def do_run_migrations(connection: Connection) -> None:
connection.exec_driver_sql(f'CREATE SCHEMA IF NOT EXISTS "{SCHEMA_NAME}"')
context.configure(
connection=connection,
target_metadata=target_metadata,
version_table_schema=SCHEMA_NAME,
include_schemas=True,
)
with context.begin_transaction():
context.run_migrations()
async def run_migrations_online() -> None:
configuration = config.get_section(config.config_ini_section, {})
connectable = async_engine_from_config(configuration, prefix="sqlalchemy.")
async with connectable.connect() as connection:
await connection.run_sync(do_run_migrations)
await connection.commit()
await connectable.dispose()
def run_migrations_offline() -> None:
context.configure(
url=config.get_main_option("sqlalchemy.url"),
target_metadata=target_metadata,
literal_binds=True,
version_table_schema=SCHEMA_NAME,
include_schemas=True,
dialect_opts={"paramstyle": "named"},
)
with context.begin_transaction():
context.run_migrations()
if context.is_offline_mode():
run_migrations_offline()
else:
asyncio.run(run_migrations_online())

26
alembic/script.py.mako Normal file
View file

@ -0,0 +1,26 @@
"""${message}
Revision ID: ${up_revision}
Revises: ${down_revision | comma,n}
Create Date: ${create_date}
"""
from collections.abc import Sequence
from typing import Union
import sqlalchemy as sa
from alembic import op
${imports if imports else ""}
revision: str = ${repr(up_revision)}
down_revision: Union[str, None] = ${repr(down_revision)}
branch_labels: Union[str, Sequence[str], None] = ${repr(branch_labels)}
depends_on: Union[str, Sequence[str], None] = ${repr(depends_on)}
def upgrade() -> None:
${upgrades if upgrades else "pass"}
def downgrade() -> None:
${downgrades if downgrades else "pass"}

View file

@ -0,0 +1,85 @@
"""Create hmrc_sync.tax_year_snapshot + hmrc_sync.fetch_log.
These two tables hold everything hmrc-sync persists. The snapshot table
keeps HMRC's `hmrc-held` PAYE/NI view per (tax_year, employer, day);
fetch_log is the audit trail of every outbound API call (for
fraud-header compliance reviews HMRC may trigger).
"""
import sqlalchemy as sa
from sqlalchemy.dialects import postgresql
from alembic import op
revision = "0001"
down_revision = None
branch_labels = None
depends_on = None
SCHEMA = "hmrc_sync"
def upgrade() -> None:
op.create_table(
"tax_year_snapshot",
sa.Column("id", sa.Integer(), primary_key=True, autoincrement=True),
sa.Column("tax_year", sa.String(), nullable=False),
sa.Column("employer_paye_ref", sa.String(), nullable=False),
sa.Column("snapshot_date", sa.TIMESTAMP(timezone=True), nullable=False),
sa.Column("gross_pay", sa.Numeric(12, 2), nullable=False),
sa.Column("income_tax", sa.Numeric(12, 2), nullable=False),
sa.Column("ni_contributions", sa.Numeric(12, 2), nullable=False),
sa.Column("source", sa.String(), nullable=False, server_default="hmrc-held"),
sa.Column(
"raw_response",
postgresql.JSONB().with_variant(sa.JSON(), "sqlite"),
nullable=False,
),
sa.Column(
"fetched_at",
sa.TIMESTAMP(timezone=True),
nullable=False,
server_default=sa.text("now()"),
),
sa.UniqueConstraint(
"tax_year",
"employer_paye_ref",
"snapshot_date",
name="uq_tax_year_snapshot",
),
schema=SCHEMA,
)
op.create_index(
"ix_tax_year_snapshot_tax_year",
"tax_year_snapshot",
["tax_year"],
schema=SCHEMA,
)
op.create_table(
"fetch_log",
sa.Column("id", sa.Integer(), primary_key=True, autoincrement=True),
sa.Column("endpoint", sa.String(), nullable=False),
sa.Column("status_code", sa.Integer(), nullable=False),
sa.Column("request_id", sa.String(), nullable=True),
sa.Column("correlation_id", sa.String(), nullable=True),
sa.Column(
"fraud_headers_sent",
postgresql.JSONB().with_variant(sa.JSON(), "sqlite"),
nullable=False,
),
sa.Column("response_snippet", sa.String(), nullable=True),
sa.Column("duration_ms", sa.Integer(), nullable=False),
sa.Column(
"fetched_at",
sa.TIMESTAMP(timezone=True),
nullable=False,
server_default=sa.text("now()"),
),
schema=SCHEMA,
)
def downgrade() -> None:
op.drop_table("fetch_log", schema=SCHEMA)
op.drop_index("ix_tax_year_snapshot_tax_year", table_name="tax_year_snapshot", schema=SCHEMA)
op.drop_table("tax_year_snapshot", schema=SCHEMA)

296
headless_auth.py Normal file
View file

@ -0,0 +1,296 @@
"""Headless HMRC sandbox OAuth — drives Chromium via Playwright.
Logs in as a sandbox test user without needing a human in the loop,
captures the authorization code from the localhost callback (the
callback URL is never actually fetched we abort the navigation and
read the URL), exchanges for tokens, saves them to a cache file, then
optionally calls an API endpoint.
Creds + test user credentials are read from Vault. The token cache
lives at ~/.cache/hmrc-sync/tokens.json and can be reused across runs
until the refresh_token expires (18 months).
Usage:
python3 headless_auth.py login --user-id 228488477217 --password VLAFXYsz4Uqk
python3 headless_auth.py call --utr 2762163393 --tax-year 2015-16
python3 headless_auth.py refresh
"""
from __future__ import annotations
import argparse
import contextlib
import json
import os
import secrets
import subprocess
import sys
import time
import urllib.parse
from dataclasses import dataclass
from pathlib import Path
import httpx
from playwright.sync_api import sync_playwright
SANDBOX_BASE = "https://test-api.service.hmrc.gov.uk"
AUTH_PATH = "/oauth/authorize"
TOKEN_PATH = "/oauth/token"
INCOME_PATH = "/individual-income/sa/{utr}/annual-summary/{tax_year}"
INCOME_ACCEPT = "application/vnd.hmrc.1.2+json"
REDIRECT_URI = "http://localhost:8080/oauth/callback"
SCOPE = "read:individual-income"
CACHE_DIR = Path.home() / ".cache" / "hmrc-sync"
TOKEN_CACHE = CACHE_DIR / "tokens.json"
@dataclass
class Creds:
client_id: str
client_secret: str
def load_creds() -> Creds:
env_id = os.environ.get("HMRC_CLIENT_ID")
env_secret = os.environ.get("HMRC_CLIENT_SECRET")
if env_id and env_secret:
return Creds(env_id, env_secret)
cid = subprocess.check_output(
["vault", "kv", "get", "-field=hmrc_mtd_sandbox_client_id", "secret/viktor"],
text=True,
).strip()
csec = subprocess.check_output(
["vault", "kv", "get", "-field=hmrc_mtd_sandbox_client_secret", "secret/viktor"],
text=True,
).strip()
return Creds(cid, csec)
def save_tokens(tok: dict) -> None:
CACHE_DIR.mkdir(parents=True, exist_ok=True)
tok_with_meta = dict(tok)
tok_with_meta["_cached_at"] = int(time.time())
TOKEN_CACHE.write_text(json.dumps(tok_with_meta, indent=2))
TOKEN_CACHE.chmod(0o600)
def load_tokens() -> dict | None:
if not TOKEN_CACHE.exists():
return None
return json.loads(TOKEN_CACHE.read_text())
def authorize_url(creds: Creds, state: str) -> str:
return (
f"{SANDBOX_BASE}{AUTH_PATH}?"
+ urllib.parse.urlencode({
"response_type": "code",
"client_id": creds.client_id,
"scope": SCOPE,
"redirect_uri": REDIRECT_URI,
"state": state,
})
)
def headless_get_code(auth_url: str, user_id: str, password: str, state: str) -> str:
"""Drive Chromium through HMRC sandbox login and extract the auth code."""
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
ctx = browser.new_context()
captured_code: dict[str, str] = {}
# Abort any attempt to hit localhost:8080 and capture the URL that
# triggered it — that's the callback with ?code=...
def _intercept(route):
if "localhost:8080" in route.request.url:
parsed = urllib.parse.urlparse(route.request.url)
qs = urllib.parse.parse_qs(parsed.query)
captured_code["code"] = qs.get("code", [""])[0]
captured_code["state"] = qs.get("state", [""])[0]
route.abort()
else:
route.continue_()
ctx.route("**/*", _intercept)
page = ctx.new_page()
page.set_default_timeout(30000)
page.goto(auth_url)
page.wait_for_load_state("networkidle")
# Step 1 — cookie banner ("Reject additional cookies")
with contextlib.suppress(Exception):
page.get_by_role("button", name="Reject additional cookies").click(timeout=3000)
page.wait_for_load_state("networkidle")
with contextlib.suppress(Exception):
page.get_by_role("button", name="Hide cookie message").click(timeout=2000)
# Step 2 — intro page ("Allow your software to connect with HMRC" → Continue)
with contextlib.suppress(Exception):
page.get_by_role("button", name="Continue").first.click(timeout=5000)
page.wait_for_load_state("networkidle")
# Step 3 — login form
for sel in ["input[name='userId']", "input#userId", "input[name='user_id']", "#user_id"]:
try:
page.fill(sel, user_id, timeout=2000)
break
except Exception:
continue
for sel in ["input[name='password']", "input#password"]:
try:
page.fill(sel, password, timeout=2000)
break
except Exception:
continue
for sel in ["button[type='submit']", "button:has-text('Sign in')", "input[type='submit']"]:
try:
page.click(sel, timeout=2000)
page.wait_for_load_state("networkidle")
break
except Exception:
continue
# Step 4 — consent screen ("Grant authority")
deadline = time.time() + 20
while time.time() < deadline and "code" not in captured_code:
for sel in [
"button:has-text('Grant authority')",
"button:has-text('Continue')",
"button:has-text('Accept and continue')",
"#submit",
]:
try:
page.click(sel, timeout=1500)
break
except Exception:
continue
time.sleep(0.5)
browser.close()
if "code" not in captured_code or not captured_code["code"]:
raise SystemExit(f"Headless login failed to capture code. captured={captured_code}")
if captured_code.get("state") != state:
raise SystemExit(f"CSRF state mismatch: got {captured_code.get('state')!r}, want {state!r}")
return captured_code["code"]
def exchange_code(creds: Creds, code: str) -> dict:
r = httpx.post(
f"{SANDBOX_BASE}{TOKEN_PATH}",
data={
"grant_type": "authorization_code",
"client_id": creds.client_id,
"client_secret": creds.client_secret,
"redirect_uri": REDIRECT_URI,
"code": code,
},
headers={"Accept": "application/vnd.hmrc.1.0+json"},
timeout=30,
)
r.raise_for_status()
return r.json()
def refresh_tokens(creds: Creds, refresh_token: str) -> dict:
r = httpx.post(
f"{SANDBOX_BASE}{TOKEN_PATH}",
data={
"grant_type": "refresh_token",
"client_id": creds.client_id,
"client_secret": creds.client_secret,
"refresh_token": refresh_token,
},
headers={"Accept": "application/vnd.hmrc.1.0+json"},
timeout=30,
)
r.raise_for_status()
return r.json()
def get_access_or_die() -> str:
tok = load_tokens()
if not tok:
raise SystemExit("No cached tokens. Run: headless_auth.py login --user-id ... --password ...")
age = int(time.time()) - tok.get("_cached_at", 0)
if age < tok.get("expires_in", 14400) - 300:
return tok["access_token"]
# refresh
creds = load_creds()
new_tok = refresh_tokens(creds, tok["refresh_token"])
save_tokens(new_tok)
return new_tok["access_token"]
def call_income(utr: str, tax_year: str) -> int:
access = get_access_or_die()
url = f"{SANDBOX_BASE}{INCOME_PATH.format(utr=utr, tax_year=tax_year)}"
r = httpx.get(
url,
headers={"Accept": INCOME_ACCEPT, "Authorization": f"Bearer {access}"},
timeout=30,
)
print(f"GET /individual-income/sa/{utr}/annual-summary/{tax_year} -> HTTP {r.status_code}")
try:
print(json.dumps(r.json(), indent=2))
except Exception:
print(r.text)
return 0 if r.status_code < 400 else 2
def cmd_login(args) -> int:
creds = load_creds()
state = secrets.token_urlsafe(24)
url = authorize_url(creds, state)
print(f"Headless login → {SANDBOX_BASE}{AUTH_PATH} ...")
code = headless_get_code(url, args.user_id, args.password, state)
print(f"Got code: {code[:12]}...")
tok = exchange_code(creds, code)
save_tokens(tok)
print(f"Saved tokens to {TOKEN_CACHE}. expires_in={tok.get('expires_in')}s")
return 0
def cmd_refresh(_args) -> int:
tok = load_tokens()
if not tok:
raise SystemExit("No tokens to refresh.")
creds = load_creds()
new_tok = refresh_tokens(creds, tok["refresh_token"])
save_tokens(new_tok)
print(f"Refreshed. new expires_in={new_tok.get('expires_in')}s")
return 0
def cmd_call(args) -> int:
return call_income(args.utr, args.tax_year)
def main() -> int:
p = argparse.ArgumentParser()
sub = p.add_subparsers(dest="cmd", required=True)
pl = sub.add_parser("login")
pl.add_argument("--user-id", required=True)
pl.add_argument("--password", required=True)
pl.set_defaults(func=cmd_login)
pr = sub.add_parser("refresh")
pr.set_defaults(func=cmd_refresh)
pc = sub.add_parser("call")
pc.add_argument("--utr", required=True)
pc.add_argument("--tax-year", default="2015-16")
pc.set_defaults(func=cmd_call)
args = p.parse_args()
return args.func(args)
if __name__ == "__main__":
sys.exit(main())

0
hmrc_sync/__init__.py Normal file
View file

36
hmrc_sync/__main__.py Normal file
View file

@ -0,0 +1,36 @@
import logging
import os
import subprocess
import sys
import click
import uvicorn
@click.group()
def cli() -> None:
logging.basicConfig(level=os.environ.get("LOG_LEVEL", "INFO"))
@cli.command()
def serve() -> None:
"""Run the FastAPI server (K8s entrypoint)."""
uvicorn.run("hmrc_sync.app:app", host="0.0.0.0", port=8080)
@cli.command()
@click.option("--tax-year", default="current", help="Tax year to fetch, e.g. 2024-25 or 'current'.")
def sync(tax_year: str) -> None:
"""One-shot sync of HMRC figures — used by the CronJob."""
raise click.ClickException("Sync stub — implement after HMRC prod approval lands")
@cli.command()
def migrate() -> None:
"""Run `alembic upgrade head`."""
result = subprocess.run(["alembic", "upgrade", "head"], check=False)
sys.exit(result.returncode)
if __name__ == "__main__":
cli()

129
hmrc_sync/app.py Normal file
View file

@ -0,0 +1,129 @@
"""FastAPI entrypoint for hmrc-sync.
Endpoints:
- GET /authorize redirect to HMRC OAuth, primes refresh_token
- GET /callback OAuth callback; exchange code, persist token
- POST /callback-metadata browser-side session attributes (fraud headers)
- POST /sync pull latest HMRC figures for a given tax year
- GET /healthz readiness + queue depth
"""
from __future__ import annotations
import logging
import os
import secrets
import urllib.parse
from contextlib import asynccontextmanager
from typing import Any
from fastapi import FastAPI, HTTPException, Request
from fastapi.responses import HTMLResponse, RedirectResponse
from prometheus_fastapi_instrumentator import Instrumentator
from hmrc_sync import oauth
from hmrc_sync.fraud_headers import SessionContext
log = logging.getLogger(__name__)
REQUIRED_ENV = [
"HMRC_PROD_CLIENT_ID",
"HMRC_PROD_CLIENT_SECRET",
"HMRC_PROD_REDIRECT_URI",
"DB_CONNECTION_STRING",
]
def _verify_env() -> None:
missing = [k for k in REQUIRED_ENV if not os.environ.get(k)]
if missing:
raise RuntimeError(f"Missing required env vars: {', '.join(missing)}")
@asynccontextmanager
async def lifespan(app: FastAPI): # type: ignore[no-untyped-def]
_verify_env()
app.state.session_context = SessionContext(
device_id=os.environ.get("HMRC_DEVICE_ID", ""),
public_ip=os.environ.get("HMRC_VENDOR_PUBLIC_IP", ""),
)
app.state.oauth_states = {} # anti-CSRF state → expires_at
yield
app = FastAPI(title="HMRC Sync", lifespan=lifespan)
Instrumentator().instrument(app).expose(app, endpoint="/metrics")
@app.get("/healthz")
async def healthz() -> dict[str, Any]:
return {"status": "ok"}
@app.get("/authorize")
async def authorize() -> RedirectResponse:
creds = oauth.load_creds_from_env()
state = secrets.token_urlsafe(24)
app.state.oauth_states[state] = True
params = urllib.parse.urlencode({
"response_type": "code",
"client_id": creds.client_id,
"scope": "read:self-assessment",
"redirect_uri": creds.redirect_uri,
"state": state,
})
return RedirectResponse(f"{oauth.PROD_BASE}/oauth/authorize?{params}")
@app.get("/callback", response_class=HTMLResponse)
async def callback(code: str, state: str) -> HTMLResponse:
if state not in app.state.oauth_states:
raise HTTPException(status_code=400, detail="unknown state (CSRF)")
del app.state.oauth_states[state]
creds = oauth.load_creds_from_env()
token = await oauth.exchange_code(creds, code)
oauth.persist_to_vault(token)
# Serve a 1-page form that POSTs browser attributes to /callback-metadata
# so we capture the per-session values HMRC wants in fraud headers.
return HTMLResponse(_metadata_capture_html())
@app.post("/callback-metadata")
async def callback_metadata(request: Request) -> dict[str, str]:
body = await request.json()
session: SessionContext = app.state.session_context
session.user_agent = str(body.get("user_agent", "") or "")
session.screen_width = int(body.get("screen_width", 0) or 0)
session.screen_height = int(body.get("screen_height", 0) or 0)
session.screen_colour_depth = int(body.get("screen_colour_depth", 0) or 0)
session.window_width = int(body.get("window_width", 0) or 0)
session.window_height = int(body.get("window_height", 0) or 0)
session.timezone_offset = int(body.get("timezone_offset", 0) or 0)
return {"status": "captured"}
@app.post("/sync")
async def sync(tax_year: str | None = None) -> dict[str, Any]:
"""Pull latest HMRC figures for `tax_year` (default: current fiscal year)."""
raise HTTPException(status_code=501, detail="Sync not yet implemented — awaiting HMRC prod approval")
def _metadata_capture_html() -> str:
return """<!doctype html>
<html><head><title>hmrc-sync capturing session</title></head><body>
<h2>Capturing session attributes for HMRC fraud headers...</h2>
<script>
fetch('/callback-metadata', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({
user_agent: navigator.userAgent,
screen_width: screen.width,
screen_height: screen.height,
screen_colour_depth: screen.colorDepth,
window_width: window.innerWidth,
window_height: window.innerHeight,
timezone_offset: -new Date().getTimezoneOffset()
})
}).then(() => document.body.innerHTML = '<h2>Done. You can close this tab.</h2>');
</script>
</body></html>"""

82
hmrc_sync/client.py Normal file
View file

@ -0,0 +1,82 @@
"""HMRC Individual Tax API v1.1 wrapper.
One method per endpoint we consume. Every request attaches the full fraud-
prevention header set built by `fraud_headers.build_headers()`.
Individual Tax API v1.1 returns tax-paid + income-breakdown figures per
employment per tax year exactly the ground-truth data we reconcile
against the payslip-ingest monthly aggregate.
"""
from __future__ import annotations
import logging
import time
from dataclasses import dataclass
from typing import Any
import httpx
from hmrc_sync.fraud_headers import SessionContext, build_headers
log = logging.getLogger(__name__)
PROD_BASE = "https://api.service.hmrc.gov.uk"
INDIVIDUAL_TAX_VERSION = "application/vnd.hmrc.1.1+json"
@dataclass
class HmrcResponse:
status_code: int
body: dict[str, Any]
duration_ms: int
request_id: str | None
correlation_id: str | None
fraud_headers_sent: dict[str, str]
class HmrcClient:
def __init__(self,
access_token: str,
session: SessionContext,
connection_method: str = "BATCH_PROCESS_DIRECT",
base_url: str = PROD_BASE):
self._access_token = access_token
self._session = session
self._connection_method = connection_method
self._base_url = base_url.rstrip("/")
async def individual_tax_summary(self, utr: str, tax_year: str) -> HmrcResponse:
"""GET /individuals/tax/sa/{utr}/summary/{taxYear}
`utr` is the 10-digit Self Assessment reference; tax_year format
is `YYYY-YY` (e.g. `2024-25`).
"""
path = f"/individuals/tax/sa/{utr}/summary/{tax_year}"
return await self._get(path)
async def _get(self, path: str) -> HmrcResponse:
fraud = build_headers(self._session, self._connection_method)
headers = {
"Accept": INDIVIDUAL_TAX_VERSION,
"Authorization": f"Bearer {self._access_token}",
}
headers.update(fraud)
started = time.perf_counter()
async with httpx.AsyncClient(timeout=30.0) as client:
resp = await client.get(f"{self._base_url}{path}", headers=headers)
duration_ms = int((time.perf_counter() - started) * 1000)
body: dict[str, Any]
try:
body = resp.json() if resp.content else {}
except ValueError:
body = {"raw": resp.text[:2000]}
log.info("hmrc %s status=%s duration=%dms", path, resp.status_code, duration_ms)
return HmrcResponse(
status_code=resp.status_code,
body=body,
duration_ms=duration_ms,
request_id=resp.headers.get("x-request-id"),
correlation_id=resp.headers.get("x-correlation-id"),
fraud_headers_sent=fraud,
)

70
hmrc_sync/db.py Normal file
View file

@ -0,0 +1,70 @@
import os
from datetime import datetime
from decimal import Decimal
from typing import Any
from sqlalchemy import JSON, TIMESTAMP, Integer, Numeric, String, text
from sqlalchemy.dialects.postgresql import JSONB
from sqlalchemy.ext.asyncio import AsyncEngine, async_sessionmaker, create_async_engine
from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column
SCHEMA_NAME = "hmrc_sync"
class Base(DeclarativeBase):
pass
JSON_TYPE = JSONB().with_variant(JSON(), "sqlite")
class TaxYearSnapshot(Base):
"""One row per (tax_year, employer_paye_ref, snapshot_date).
HMRC returns the `hmrc-held` view of annual PAYE/NI for a given
employment. Taking a daily snapshot lets us see HMRC's figures evolve
as late RTI filings land, and lets the dashboard always show the
latest value by snapshot_date.
"""
__tablename__ = "tax_year_snapshot"
__table_args__ = {"schema": SCHEMA_NAME} # noqa: RUF012
id: Mapped[int] = mapped_column(Integer, primary_key=True, autoincrement=True)
tax_year: Mapped[str] = mapped_column(String, nullable=False, index=True)
employer_paye_ref: Mapped[str] = mapped_column(String, nullable=False)
snapshot_date: Mapped[datetime] = mapped_column(TIMESTAMP(timezone=True), nullable=False)
gross_pay: Mapped[Decimal] = mapped_column(Numeric(12, 2), nullable=False)
income_tax: Mapped[Decimal] = mapped_column(Numeric(12, 2), nullable=False)
ni_contributions: Mapped[Decimal] = mapped_column(Numeric(12, 2), nullable=False)
source: Mapped[str] = mapped_column(String, nullable=False, server_default="hmrc-held")
raw_response: Mapped[dict[str, Any]] = mapped_column(JSON_TYPE, nullable=False)
fetched_at: Mapped[datetime] = mapped_column(TIMESTAMP(timezone=True),
nullable=False,
server_default=text("now()"))
class FetchLog(Base):
"""Audit trail of every HMRC API call — for fraud-header compliance review."""
__tablename__ = "fetch_log"
__table_args__ = {"schema": SCHEMA_NAME} # noqa: RUF012
id: Mapped[int] = mapped_column(Integer, primary_key=True, autoincrement=True)
endpoint: Mapped[str] = mapped_column(String, nullable=False)
status_code: Mapped[int] = mapped_column(Integer, nullable=False)
request_id: Mapped[str | None] = mapped_column(String, nullable=True)
correlation_id: Mapped[str | None] = mapped_column(String, nullable=True)
fraud_headers_sent: Mapped[dict[str, Any]] = mapped_column(JSON_TYPE, nullable=False)
response_snippet: Mapped[str | None] = mapped_column(String, nullable=True)
duration_ms: Mapped[int] = mapped_column(Integer, nullable=False)
fetched_at: Mapped[datetime] = mapped_column(TIMESTAMP(timezone=True),
nullable=False,
server_default=text("now()"))
def create_engine_from_env() -> AsyncEngine:
url = os.environ["DB_CONNECTION_STRING"]
return create_async_engine(url, pool_pre_ping=True)
def make_session_factory(engine: AsyncEngine) -> async_sessionmaker[Any]:
return async_sessionmaker(engine, expire_on_commit=False)

341
hmrc_sync/fraud_headers.py Normal file
View file

@ -0,0 +1,341 @@
"""Build HMRC MTD fraud-prevention headers (Gov-Client-* / Gov-Vendor-*).
HMRC's BATCH_PROCESS_DIRECT connection method (what our CronJob uses)
mandates 11 headers on every MTD API call; WEB_APP_VIA_SERVER adds a
handful of browser-derived fields. Shipping without these risks fines
and API-access revocation per the HMRC fraud-prevention guide.
Layout:
- **Static** vendor-constant across runs (product name/version,
hashed license id).
- **Runtime** collected at module load from the pod's own network
stack + OS: MAC addresses, local IPs, OS family/version, device model.
- **Per-request** built at call time (timestamps, request ids).
- **Per-session** captured from the browser on `/callback-metadata`
(screen dimensions, public IP, MFA timestamp). Only WEB_APP_VIA_SERVER.
The public entry point is `build_headers(session, connection_method)`.
Run `tests/test_fraud_headers.py::test_headers_pass_hmrc_validator`
with `HMRC_VALIDATOR=1` to verify against the HMRC sandbox validator.
Spec references:
https://developer.service.hmrc.gov.uk/guides/fraud-prevention/
https://developer.service.hmrc.gov.uk/guides/fraud-prevention/connection-method/batch-process-direct/
https://developer.service.hmrc.gov.uk/api-documentation/docs/api/service/txm-fph-validator-api/1.0
"""
from __future__ import annotations
import getpass
import hashlib
import logging
import os
import platform
import secrets
import socket
import time
import urllib.parse
import uuid
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any
log = logging.getLogger(__name__)
VENDOR_PRODUCT_NAME = "hmrc-sync"
VENDOR_PRODUCT_VERSION = "0.1.0"
# Self-assigned for a personal single-user tool. HMRC permits arbitrary
# vendor strings; the header value is then SHA-256-hashed per spec
# (`Gov-Vendor-License-IDs: <name>=<hashed-value>`).
VENDOR_LICENSE_ID = os.environ.get("HMRC_VENDOR_LICENSE_ID",
"hmrc-sync-private-single-user")
VENDOR_PUBLIC_IP = os.environ.get("HMRC_VENDOR_PUBLIC_IP", "")
# Valid HMRC connection-method enum values.
CONNECTION_METHOD_BATCH = "BATCH_PROCESS_DIRECT"
CONNECTION_METHOD_WEB_APP = "WEB_APP_VIA_SERVER"
CONNECTION_METHOD_MFA = "AUTH_USING_MFA"
_NET_CLASS = Path("/sys/class/net")
_EMPTY_MAC = "00:00:00:00:00:00"
@dataclass
class SessionContext:
"""Browser-side attributes captured on the `/callback-metadata` POST.
Only relevant for WEB_APP_VIA_SERVER flows (browser-initiated OAuth
+ server-side API calls). BATCH_PROCESS_DIRECT flows derive their
context from `RuntimeContext` (see below) without touching these.
"""
user_agent: str = ""
screen_width: int = 0
screen_height: int = 0
screen_colour_depth: int = 0
window_width: int = 0
window_height: int = 0
timezone_offset: int = 0
device_id: str = ""
mfa_timestamp: str = ""
public_ip: str = ""
public_port: int = 0
@dataclass
class RuntimeContext:
"""Pod-side environment values required on every API call.
Collected once at module load (cheap all local syscalls). If any
field is empty, the header emitter falls back to safe defaults so
the call never goes out with an empty mandatory header.
"""
mac_addresses: list[str] = field(default_factory=list)
local_ips: list[str] = field(default_factory=list)
os_family: str = ""
os_version: str = ""
device_manufacturer: str = "Kubernetes"
device_model: str = ""
os_user: str = ""
def _collect_mac_addresses() -> list[str]:
"""Read every non-loopback interface MAC from `/sys/class/net/*/address`.
Colons are kept raw; `_format_mac_list` percent-encodes on output per spec.
"""
out: list[str] = []
if not _NET_CLASS.exists():
return out
for iface in sorted(_NET_CLASS.iterdir()):
if iface.name == "lo":
continue
addr_file = iface / "address"
if not addr_file.exists():
continue
try:
mac = addr_file.read_text().strip()
except OSError:
continue
if mac and mac != _EMPTY_MAC:
out.append(mac)
return out
def _collect_local_ips() -> list[str]:
"""Every IP bound to this host — IPv4 + IPv6, loopback excluded."""
ips: set[str] = set()
try:
hostname = socket.gethostname()
for family, _, _, _, sockaddr in socket.getaddrinfo(hostname, None):
raw = sockaddr[0]
if not isinstance(raw, str):
continue
if family == socket.AF_INET and not raw.startswith("127."):
ips.add(raw)
elif family == socket.AF_INET6 and not raw.startswith("::1"):
ips.add(raw.split("%")[0]) # strip zone id
except (socket.gaierror, OSError):
pass
# Also grab the primary outbound IP — `getaddrinfo(hostname)` can miss
# it inside containers whose hostname has no DNS entry.
try:
with socket.socket(socket.AF_INET, socket.SOCK_DGRAM) as s:
s.connect(("10.255.255.255", 1))
ips.add(s.getsockname()[0])
except OSError:
pass
return sorted(ips)
def _detect_runtime_context() -> RuntimeContext:
uname = platform.uname()
return RuntimeContext(
mac_addresses=_collect_mac_addresses(),
local_ips=_collect_local_ips(),
os_family=uname.system or "Linux",
os_version=uname.release or "unknown",
device_manufacturer="Kubernetes",
device_model=uname.node or socket.gethostname() or "pod",
os_user=_safe_getuser(),
)
def _safe_getuser() -> str:
try:
return getpass.getuser()
except (KeyError, OSError):
return os.environ.get("USER", "unknown")
RUNTIME_CONTEXT: RuntimeContext = _detect_runtime_context()
def build_headers(session: SessionContext | None = None,
connection_method: str = CONNECTION_METHOD_BATCH,
runtime: RuntimeContext | None = None) -> dict[str, str]:
"""Return the full header dict to attach to every HMRC API call.
Defaults to BATCH_PROCESS_DIRECT the mode the CronJob uses. Pass
a populated `SessionContext` + `connection_method=WEB_APP_VIA_SERVER`
for browser-initiated flows; the browser-only fields layer on top.
"""
session = session or SessionContext()
rt = runtime or RUNTIME_CONTEXT
headers: dict[str, str] = {}
headers.update(_static_headers())
headers.update(_per_request_headers())
headers.update(_mandatory_runtime_headers(rt, session, connection_method))
if connection_method == CONNECTION_METHOD_WEB_APP:
headers.update(_web_app_session_headers(session))
if connection_method == CONNECTION_METHOD_MFA and session.mfa_timestamp:
headers["Gov-Client-MFA-Timestamp"] = session.mfa_timestamp
return headers
def _static_headers() -> dict[str, str]:
"""Vendor-constant identity headers that apply to every connection method.
Product-Name is percent-encoded per spec; License-IDs value is SHA-256-
hashed per spec; Version is a key-value pair of `<software-name>=<version>`.
Gov-Vendor-Public-IP and Gov-Vendor-Forwarded are NOT emitted here the
HMRC validator rejects them for BATCH_PROCESS_DIRECT (where no vendor
server sits between the client and the HMRC API). They're added in
`_web_app_session_headers` for the WEB_APP_VIA_SERVER path only.
"""
license_hash = hashlib.sha256(VENDOR_LICENSE_ID.encode()).hexdigest()
return {
"Gov-Vendor-Product-Name": _pct(VENDOR_PRODUCT_NAME),
"Gov-Vendor-Version": f"{VENDOR_PRODUCT_NAME}={VENDOR_PRODUCT_VERSION}",
"Gov-Vendor-License-IDs": f"{VENDOR_PRODUCT_NAME}={license_hash}",
}
def _per_request_headers() -> dict[str, str]:
"""Per-call trace + timestamp headers. Local-IPs-Timestamp uses HMRC's
exact format `yyyy-MM-ddThh:mm:ss.sssZ` always UTC, always millis."""
now_ms = int(time.time() * 1000)
iso_ms = time.strftime("%Y-%m-%dT%H:%M:%S", time.gmtime(now_ms / 1000))
now_iso = f"{iso_ms}.{now_ms % 1000:03d}Z"
return {
"Gov-Client-Timezone": "UTC+00:00",
"Gov-Client-Local-IPs-Timestamp": now_iso,
"x-correlation-id": str(uuid.uuid4()),
"x-request-id": secrets.token_hex(16),
}
def _mandatory_runtime_headers(rt: RuntimeContext, session: SessionContext,
connection_method: str) -> dict[str, str]:
"""The 8 headers mandatory for BATCH_PROCESS_DIRECT that come from the
host Connection-Method, Device-ID, User-IDs, User-Agent, Local-IPs,
MAC-Addresses (+ Timezone and Local-IPs-Timestamp live in
`_per_request_headers`)."""
return {
"Gov-Client-Connection-Method": connection_method,
"Gov-Client-Device-ID": session.device_id or _fallback_device_id(),
"Gov-Client-User-IDs": _user_ids(rt, session),
"Gov-Client-User-Agent": _user_agent(rt, session),
"Gov-Client-Local-IPs": _format_ip_list(rt.local_ips),
"Gov-Client-MAC-Addresses": _format_mac_list(rt.mac_addresses),
}
def _web_app_session_headers(session: SessionContext) -> dict[str, str]:
"""WEB_APP_VIA_SERVER-only headers — browser context + vendor hop trail.
Gov-Vendor-Public-IP and Gov-Vendor-Forwarded describe the vendor server
that sits between the user's browser and HMRC — only meaningful for
WEB_APP_VIA_SERVER. BATCH_PROCESS_DIRECT must omit them (validator
rejects them there).
"""
out: dict[str, str] = {}
if session.screen_width and session.screen_height:
out["Gov-Client-Screens"] = (
f"width={session.screen_width}&height={session.screen_height}"
f"&scaling-factor=1&colour-depth={session.screen_colour_depth}")
if session.window_width and session.window_height:
out["Gov-Client-Window-Size"] = (f"width={session.window_width}&"
f"height={session.window_height}")
if session.public_ip:
out["Gov-Client-Public-IP"] = session.public_ip
if session.public_port:
out["Gov-Client-Public-Port"] = str(session.public_port)
vendor_ip = VENDOR_PUBLIC_IP or (RUNTIME_CONTEXT.local_ips[0] if RUNTIME_CONTEXT.local_ips
else "")
if vendor_ip:
out["Gov-Vendor-Public-IP"] = vendor_ip
out["Gov-Vendor-Forwarded"] = f"by={vendor_ip}&for={vendor_ip}"
return out
def _user_ids(rt: RuntimeContext, session: SessionContext) -> str:
"""Per spec: `os=<device-user>&<app>=<app-user>`. The `os=` field is
mandatory; we additionally tag our application with the OAuth user
so HMRC can correlate activity in a breach investigation.
"""
os_user = rt.os_user or "unknown"
pairs = [f"os={_pct(os_user)}"]
oauth_user = os.environ.get("HMRC_OAUTH_USER", "viktor")
pairs.append(f"hmrc-sync={_pct(oauth_user)}")
_ = session # reserved for future per-session identity extension
return "&".join(pairs)
def _user_agent(rt: RuntimeContext, session: SessionContext) -> str:
"""Per spec: `os-family=…&os-version=…&device-manufacturer=…&device-model=…`.
For WEB_APP_VIA_SERVER with a captured browser UA, the browser string
is encoded under `device-model` with the rest of the fields defaulting
to our pod's values — HMRC's validator accepts this hybrid shape.
"""
model = session.user_agent or rt.device_model or "pod"
pairs = [
f"os-family={_pct(rt.os_family)}",
f"os-version={_pct(rt.os_version)}",
f"device-manufacturer={_pct(rt.device_manufacturer)}",
f"device-model={_pct(model)}",
]
return "&".join(pairs)
def _format_ip_list(ips: list[str]) -> str:
"""IPv6 addresses percent-encoded; IPv4 passes through. Joined with ','.
HMRC's validator accepts an empty header only if the request truly
has no IPs; on a live pod we always have at least one if the list
comes back empty we fall back to the loopback so the header is
syntactically valid.
"""
if not ips:
return "127.0.0.1"
out = []
for ip in ips:
out.append(_pct(ip) if ":" in ip else ip)
return ",".join(out)
def _format_mac_list(macs: list[str]) -> str:
"""Each MAC percent-encoded (colons → %3A), comma-joined.
Empty list single dummy MAC so we never ship a blank header;
HMRC's validator treats blank as a violation.
"""
if not macs:
return _pct("02:00:00:00:00:00")
return ",".join(_pct(m) for m in macs)
def _fallback_device_id() -> str:
"""Deterministic UUID derived from hostname when no Vault-backed
Device-ID is seeded. Stable across restarts on the same node."""
return str(uuid.uuid5(uuid.NAMESPACE_DNS, f"hmrc-sync-{platform.node()}"))
def _pct(s: str) -> str:
return urllib.parse.quote(s, safe="")
def as_validator_payload(headers: dict[str, str]) -> dict[str, Any]:
"""Reshape headers for the HMRC fraud-header validator API body."""
return {"headers": [{"name": k, "value": v} for k, v in headers.items()]}

125
hmrc_sync/oauth.py Normal file
View file

@ -0,0 +1,125 @@
"""HMRC OAuth token persistence — Vault-backed refresh-token store.
The refresh_token is long-lived (HMRC grants 18 months). We keep it in
Vault at `secret/viktor/hmrc_refresh_token` and let ESO sync it to a K8s
Secret the pod mounts as an env var. On every refresh, we write the new
token back to Vault so the next pod restart picks it up.
Writing back requires Vault write access the pod uses a short-lived
K8s-auth Vault token with a narrow policy that only allows writing
`secret/viktor/hmrc_refresh_token`.
"""
from __future__ import annotations
import logging
import os
from dataclasses import dataclass
import httpx
log = logging.getLogger(__name__)
VAULT_KEY = "secret/viktor/hmrc_refresh_token"
PROD_BASE = "https://api.service.hmrc.gov.uk"
TOKEN_PATH = "/oauth/token"
@dataclass(frozen=True)
class OAuthCreds:
client_id: str
client_secret: str
redirect_uri: str
@dataclass
class TokenBundle:
access_token: str
refresh_token: str
expires_in: int
scope: str
@classmethod
def from_json(cls, data: dict[str, object]) -> TokenBundle:
return cls(
access_token=str(data["access_token"]),
refresh_token=str(data["refresh_token"]),
expires_in=int(data["expires_in"]), # type: ignore[arg-type]
scope=str(data.get("scope", "")),
)
def load_creds_from_env() -> OAuthCreds:
return OAuthCreds(
client_id=os.environ["HMRC_PROD_CLIENT_ID"],
client_secret=os.environ["HMRC_PROD_CLIENT_SECRET"],
redirect_uri=os.environ["HMRC_PROD_REDIRECT_URI"],
)
async def exchange_code(creds: OAuthCreds, code: str) -> TokenBundle:
"""Swap a fresh authorization_code for an access+refresh token pair."""
async with httpx.AsyncClient(timeout=30.0) as client:
resp = await client.post(
f"{PROD_BASE}{TOKEN_PATH}",
data={
"grant_type": "authorization_code",
"client_id": creds.client_id,
"client_secret": creds.client_secret,
"redirect_uri": creds.redirect_uri,
"code": code,
},
headers={"Accept": "application/vnd.hmrc.1.0+json"},
)
resp.raise_for_status()
return TokenBundle.from_json(resp.json())
async def refresh(creds: OAuthCreds, refresh_token: str) -> TokenBundle:
"""Exchange an old refresh_token for a fresh access+refresh pair.
HMRC rotates the refresh_token on every refresh the old one becomes
invalid immediately after this call returns. Persist the new one to
Vault atomically; a failure between the refresh and the Vault write
leaves us stranded.
"""
async with httpx.AsyncClient(timeout=30.0) as client:
resp = await client.post(
f"{PROD_BASE}{TOKEN_PATH}",
data={
"grant_type": "refresh_token",
"client_id": creds.client_id,
"client_secret": creds.client_secret,
"refresh_token": refresh_token,
},
headers={"Accept": "application/vnd.hmrc.1.0+json"},
)
resp.raise_for_status()
return TokenBundle.from_json(resp.json())
def persist_to_vault(token: TokenBundle) -> None:
"""Write the new refresh_token back to Vault.
Uses the hvac client with K8s-auth the pod's service-account token
at /var/run/secrets/kubernetes.io/serviceaccount/token logs into
Vault's kubernetes auth method and receives a short-lived Vault token
with write access to `secret/viktor/hmrc_refresh_token` only.
"""
import hvac
addr = os.environ.get("VAULT_ADDR", "https://vault.viktorbarzin.me")
role = os.environ.get("VAULT_K8S_ROLE", "hmrc-sync")
jwt_path = "/var/run/secrets/kubernetes.io/serviceaccount/token"
with open(jwt_path, encoding="utf-8") as fh:
jwt = fh.read()
client = hvac.Client(url=addr)
client.auth.kubernetes.login(role=role, jwt=jwt)
client.secrets.kv.v2.create_or_update_secret(
path="viktor/hmrc_refresh_token",
secret={
"refresh_token": token.refresh_token,
"expires_in": token.expires_in,
"scope": token.scope,
},
)
log.info("Rotated HMRC refresh_token persisted to Vault")

178
oauth_dance.py Normal file
View file

@ -0,0 +1,178 @@
"""Phase-1 HMRC MTD OAuth sandbox smoke test.
Runs the authorization_code flow against HMRC's test environment, captures
the callback on localhost:8080, exchanges for tokens, then calls
/individuals/income-received/employments/{nino}/{taxYear} for a test user.
Prerequisites (do once in the HMRC dev hub for the app):
1. Add http://localhost:8080/oauth/callback as a Redirect URI.
2. Subscribe to "Individuals Income Received API" (and accept terms).
3. Create a sandbox test user (Individuals Create Test User) and note
the NINO + Government Gateway user ID + password.
Credentials are read from Vault (secret/viktor/hmrc_mtd_sandbox_client_{id,secret})
with env-var fallback for portability.
Run:
python3 oauth_dance.py --nino NH000000A --tax-year 2025-26
"""
from __future__ import annotations
import argparse
import http.server
import json
import os
import secrets
import socketserver
import sys
import threading
import urllib.parse
import webbrowser
from dataclasses import dataclass
import httpx
SANDBOX_BASE = "https://test-api.service.hmrc.gov.uk"
AUTH_PATH = "/oauth/authorize"
TOKEN_PATH = "/oauth/token"
# Legacy "Individual Income API" v1.2 — annual SA summary. Path uses
# the 10-digit Self-Assessment UTR, NOT the NINO. MTD
# "Individuals Income Received API" would be richer (in-year YTD) but
# isn't available to this app's subscription list.
INCOME_PATH = "/individual-income/sa/{utr}/annual-summary/{tax_year}"
INCOME_ACCEPT = "application/vnd.hmrc.1.2+json"
REDIRECT_URI = "http://localhost:8080/oauth/callback"
CALLBACK_PORT = 8080
SCOPE = "read:individual-income"
@dataclass
class Creds:
client_id: str
client_secret: str
def load_creds() -> Creds:
env_id = os.environ.get("HMRC_CLIENT_ID")
env_secret = os.environ.get("HMRC_CLIENT_SECRET")
if env_id and env_secret:
return Creds(env_id, env_secret)
import subprocess
cid = subprocess.check_output(
["vault", "kv", "get", "-field=hmrc_mtd_sandbox_client_id", "secret/viktor"],
text=True,
).strip()
csec = subprocess.check_output(
["vault", "kv", "get", "-field=hmrc_mtd_sandbox_client_secret", "secret/viktor"],
text=True,
).strip()
return Creds(cid, csec)
class _CallbackHandler(http.server.BaseHTTPRequestHandler):
captured: dict[str, str] = {}
def do_GET(self) -> None:
parsed = urllib.parse.urlparse(self.path)
if parsed.path != "/oauth/callback":
self.send_response(404)
self.end_headers()
return
qs = urllib.parse.parse_qs(parsed.query)
_CallbackHandler.captured.update({k: v[0] for k, v in qs.items()})
self.send_response(200)
self.send_header("Content-Type", "text/html; charset=utf-8")
self.end_headers()
body = b"<h2>HMRC auth received. You can close this tab.</h2>"
self.wfile.write(body)
def log_message(self, *_args) -> None: # silence default stderr spam
pass
def run_callback_server_until_code(expected_state: str) -> dict[str, str]:
with socketserver.TCPServer(("127.0.0.1", CALLBACK_PORT), _CallbackHandler) as srv:
t = threading.Thread(target=srv.serve_forever, daemon=True)
t.start()
while "code" not in _CallbackHandler.captured and "error" not in _CallbackHandler.captured:
threading.Event().wait(0.25)
srv.shutdown()
got = dict(_CallbackHandler.captured)
if got.get("state") != expected_state:
raise SystemExit(f"CSRF: state mismatch (got {got.get('state')!r}, want {expected_state!r})")
if "error" in got:
raise SystemExit(f"HMRC returned error: {got}")
return got
def exchange_code(creds: Creds, code: str) -> dict:
r = httpx.post(
f"{SANDBOX_BASE}{TOKEN_PATH}",
data={
"grant_type": "authorization_code",
"client_id": creds.client_id,
"client_secret": creds.client_secret,
"redirect_uri": REDIRECT_URI,
"code": code,
},
headers={"Accept": "application/vnd.hmrc.1.0+json"},
timeout=30,
)
r.raise_for_status()
return r.json()
def call_income_received(access_token: str, utr: str, tax_year: str) -> httpx.Response:
"""tax_year is '2015-16' style (legacy Individual Income API)."""
url = f"{SANDBOX_BASE}{INCOME_PATH.format(utr=utr, tax_year=tax_year)}"
return httpx.get(
url,
headers={
"Accept": INCOME_ACCEPT,
"Authorization": f"Bearer {access_token}",
},
timeout=30,
)
def main() -> int:
parser = argparse.ArgumentParser()
parser.add_argument("--utr", required=True, help="Sandbox test-user 10-digit SA UTR, e.g. 2762163393")
parser.add_argument("--tax-year", default="2015-16", help="Format 2015-16. Sandbox may only have canned data for certain years.")
args = parser.parse_args()
creds = load_creds()
state = secrets.token_urlsafe(24)
auth_url = (
f"{SANDBOX_BASE}{AUTH_PATH}?"
+ urllib.parse.urlencode({
"response_type": "code",
"client_id": creds.client_id,
"scope": SCOPE,
"redirect_uri": REDIRECT_URI,
"state": state,
})
)
print(f"Opening browser to HMRC sandbox login...\n {auth_url}\n")
webbrowser.open(auth_url)
captured = run_callback_server_until_code(expected_state=state)
print(f"Got auth code (truncated): {captured['code'][:12]}...")
tokens = exchange_code(creds, captured["code"])
access = tokens["access_token"]
print(f"Got access_token (exp {tokens.get('expires_in')}s), refresh_token present={('refresh_token' in tokens)}")
resp = call_income_received(access, args.utr, args.tax_year)
print(f"\nGET /individual-income/sa/{args.utr}/annual-summary/{args.tax_year} → HTTP {resp.status_code}")
try:
print(json.dumps(resp.json(), indent=2))
except Exception:
print(resp.text)
return 0 if resp.status_code < 400 else 2
if __name__ == "__main__":
sys.exit(main())

53
pyproject.toml Normal file
View file

@ -0,0 +1,53 @@
[tool.poetry]
name = "hmrc-sync"
version = "0.1.0"
description = "Pulls annual PAYE/NI from HMRC Individual Tax API v1.1 to reconcile against payslip-ingest"
authors = ["Viktor Barzin <viktorbarzin@meta.com>"]
readme = "README.md"
packages = [{ include = "hmrc_sync" }]
[tool.poetry.dependencies]
python = ">=3.12,<3.13"
fastapi = "^0.115"
uvicorn = "^0.32"
httpx = "^0.27"
pydantic = "^2.9"
sqlalchemy = { extras = ["asyncio"], version = "^2.0" }
asyncpg = "^0.29"
alembic = "^1.13"
click = "^8.1"
prometheus-fastapi-instrumentator = "^7.0"
hvac = "^2.3"
[tool.poetry.group.dev.dependencies]
pytest = "^8.3"
pytest-asyncio = "^0.23"
mypy = "^1.11"
ruff = "^0.6"
yapf = "^0.43"
respx = "^0.21"
aiosqlite = "^0.20"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
[tool.pytest.ini_options]
asyncio_mode = "auto"
testpaths = ["tests"]
[tool.mypy]
python_version = "3.12"
strict = true
files = ["hmrc_sync", "tests"]
[tool.ruff]
line-length = 100
target-version = "py312"
[tool.ruff.lint]
select = ["E", "F", "W", "I", "UP", "B", "SIM", "RUF"]
[tool.yapf]
based_on_style = "pep8"
column_limit = 100

0
tests/__init__.py Normal file
View file

292
tests/test_fraud_headers.py Normal file
View file

@ -0,0 +1,292 @@
"""Fraud-header compliance checks.
Two layers:
1. **Local shape assertions** pure-python checks that every mandatory
Gov-Client-*/Gov-Vendor-* header is present and shaped per HMRC spec.
Runs in every CI build.
2. **HMRC validator API smoke test** (`test_headers_pass_hmrc_validator`):
POSTs the generated header set to the HMRC sandbox validator and
asserts a clean 200 with no rejected headers. Gated on the
`HMRC_VALIDATOR` env var so `pytest` still runs fine offline.
HMRC audits fraud headers during production-access review a failing
validator smoke test MUST block deploy.
Spec references (primary):
https://developer.service.hmrc.gov.uk/guides/fraud-prevention/connection-method/batch-process-direct/
https://developer.service.hmrc.gov.uk/api-documentation/docs/api/service/txm-fph-validator-api/1.0
"""
from __future__ import annotations
import hashlib
import os
import re
import httpx
import pytest
from hmrc_sync.fraud_headers import (
CONNECTION_METHOD_BATCH,
CONNECTION_METHOD_WEB_APP,
RUNTIME_CONTEXT,
VENDOR_LICENSE_ID,
VENDOR_PRODUCT_NAME,
RuntimeContext,
SessionContext,
as_validator_payload,
build_headers,
)
VALIDATOR_URL = (
"https://test-api.service.hmrc.gov.uk/test/fraud-prevention-headers/validate")
# Per HMRC BATCH_PROCESS_DIRECT spec (11 mandatory headers).
BATCH_MANDATORY = {
"Gov-Client-Connection-Method",
"Gov-Client-Device-ID",
"Gov-Client-Local-IPs",
"Gov-Client-Local-IPs-Timestamp",
"Gov-Client-MAC-Addresses",
"Gov-Client-Timezone",
"Gov-Client-User-Agent",
"Gov-Client-User-IDs",
"Gov-Vendor-License-IDs",
"Gov-Vendor-Product-Name",
"Gov-Vendor-Version",
}
# WEB_APP_VIA_SERVER adds browser-origin context on top of the batch set.
WEB_APP_EXTRAS = {
"Gov-Client-Screens",
"Gov-Client-Window-Size",
"Gov-Client-Public-IP",
"Gov-Client-Public-Port",
}
def _full_session() -> SessionContext:
return SessionContext(
user_agent="Mozilla/5.0 (X11; Linux x86_64) hmrc-sync-test",
screen_width=1920,
screen_height=1080,
screen_colour_depth=24,
window_width=1600,
window_height=900,
timezone_offset=0,
device_id="6c3a9f60-1111-2222-3333-abcdef012345",
public_ip="203.0.113.5",
public_port=443,
)
# --------------------------------------------------------------------------
# BATCH_PROCESS_DIRECT — the CronJob path. All 11 headers must be present.
# --------------------------------------------------------------------------
def test_batch_process_includes_all_11_mandatory_headers() -> None:
hdrs = build_headers(connection_method=CONNECTION_METHOD_BATCH)
missing = BATCH_MANDATORY - hdrs.keys()
assert not missing, f"BATCH_PROCESS_DIRECT missing mandatory headers: {missing}"
def test_batch_process_omits_browser_only_headers() -> None:
hdrs = build_headers(connection_method=CONNECTION_METHOD_BATCH)
# Screens / Window-Size are browser-origin; Public-IP/Port route via a
# client-facing IP which doesn't apply to a batch job.
for h in ("Gov-Client-Screens", "Gov-Client-Window-Size",
"Gov-Client-Public-IP", "Gov-Client-Public-Port"):
assert h not in hdrs, f"BATCH emitted browser-only header: {h}"
def test_batch_process_connection_method_value() -> None:
hdrs = build_headers(connection_method=CONNECTION_METHOD_BATCH)
assert hdrs["Gov-Client-Connection-Method"] == "BATCH_PROCESS_DIRECT"
# --------------------------------------------------------------------------
# Header-value shape assertions (per HMRC spec).
# --------------------------------------------------------------------------
def test_user_ids_starts_with_os_field() -> None:
"""Per spec: `os=<device-user>&<app>=<app-user>`. `os=` is mandatory."""
hdrs = build_headers(connection_method=CONNECTION_METHOD_BATCH)
value = hdrs["Gov-Client-User-IDs"]
assert value.startswith("os="), f"User-IDs missing os= prefix: {value!r}"
# Key-value pairs separated by & — at least one beyond `os=`.
pairs = value.split("&")
assert len(pairs) >= 2, f"User-IDs should have app identifier too: {value!r}"
def test_user_agent_has_all_four_spec_fields() -> None:
"""Spec: `os-family=…&os-version=…&device-manufacturer=…&device-model=…`."""
hdrs = build_headers(connection_method=CONNECTION_METHOD_BATCH)
value = hdrs["Gov-Client-User-Agent"]
for key in ("os-family=", "os-version=", "device-manufacturer=", "device-model="):
assert key in value, f"User-Agent missing {key!r}: {value!r}"
def test_mac_addresses_percent_encoded() -> None:
"""Spec: colons in MACs must be percent-encoded (%3A)."""
hdrs = build_headers(connection_method=CONNECTION_METHOD_BATCH)
value = hdrs["Gov-Client-MAC-Addresses"]
assert value, "MAC-Addresses must never be empty"
assert ":" not in value, f"MAC-Addresses contains raw colons: {value!r}"
assert "%3A" in value, f"MAC-Addresses must use %3A: {value!r}"
def test_local_ips_ipv6_percent_encoded() -> None:
"""IPv6 entries percent-encoded; IPv4 passes through."""
hdrs = build_headers(
connection_method=CONNECTION_METHOD_BATCH,
runtime=_runtime_with_ips(["10.0.0.4", "fe80::1"]),
)
value = hdrs["Gov-Client-Local-IPs"]
assert "10.0.0.4" in value
assert "fe80::1" not in value # raw v6 forbidden
assert "fe80%3A%3A1" in value, f"IPv6 not encoded: {value!r}"
def test_vendor_license_id_is_sha256_hashed() -> None:
"""Spec: `Gov-Vendor-License-IDs: <name>=<hashed-value>`."""
hdrs = build_headers(connection_method=CONNECTION_METHOD_BATCH)
value = hdrs["Gov-Vendor-License-IDs"]
expected_hash = hashlib.sha256(VENDOR_LICENSE_ID.encode()).hexdigest()
assert value == f"{VENDOR_PRODUCT_NAME}={expected_hash}", value
# Hash must be 64 hex chars — catches accidental plaintext leakage.
assert re.fullmatch(r"[a-z0-9-]+=[0-9a-f]{64}", value), value
def test_vendor_product_name_percent_encoded() -> None:
hdrs = build_headers(connection_method=CONNECTION_METHOD_BATCH)
assert hdrs["Gov-Vendor-Product-Name"] == "hmrc-sync" # no reserved chars in name
def test_vendor_version_format() -> None:
hdrs = build_headers(connection_method=CONNECTION_METHOD_BATCH)
value = hdrs["Gov-Vendor-Version"]
assert re.fullmatch(r"[a-z0-9-]+=\d+\.\d+\.\d+", value), value
def test_local_ips_timestamp_spec_format() -> None:
"""Spec: `yyyy-MM-ddThh:mm:ss.sssZ` — 24-hour, UTC, 3-digit millis."""
hdrs = build_headers(connection_method=CONNECTION_METHOD_BATCH)
value = hdrs["Gov-Client-Local-IPs-Timestamp"]
assert re.fullmatch(r"\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{3}Z", value), value
def test_timezone_utc_offset_format() -> None:
hdrs = build_headers(connection_method=CONNECTION_METHOD_BATCH)
assert re.fullmatch(r"UTC[+-]\d{2}:\d{2}", hdrs["Gov-Client-Timezone"])
def test_device_id_is_valid_uuid() -> None:
"""UUID shape check: 8-4-4-4-12 hex — applies to fallback too."""
hdrs = build_headers(connection_method=CONNECTION_METHOD_BATCH)
value = hdrs["Gov-Client-Device-ID"]
assert re.fullmatch(
r"[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}",
value,
), value
# --------------------------------------------------------------------------
# MFA gating + per-call variance.
# --------------------------------------------------------------------------
def test_mfa_timestamp_only_emitted_for_mfa_method() -> None:
"""Gov-Client-MFA-Timestamp is for AUTH_USING_MFA; batch must not emit it."""
batch = build_headers(connection_method=CONNECTION_METHOD_BATCH)
assert "Gov-Client-MFA-Timestamp" not in batch
session = _full_session()
session.mfa_timestamp = "2026-04-19T21:30:00.000Z"
mfa = build_headers(session, connection_method="AUTH_USING_MFA")
assert mfa.get("Gov-Client-MFA-Timestamp") == "2026-04-19T21:30:00.000Z"
def test_correlation_id_differs_per_call() -> None:
a = build_headers(connection_method=CONNECTION_METHOD_BATCH)
b = build_headers(connection_method=CONNECTION_METHOD_BATCH)
assert a["x-correlation-id"] != b["x-correlation-id"]
# --------------------------------------------------------------------------
# WEB_APP_VIA_SERVER — batch set + browser extras.
# --------------------------------------------------------------------------
def test_web_app_includes_batch_mandatory_plus_browser_extras() -> None:
hdrs = build_headers(_full_session(), connection_method=CONNECTION_METHOD_WEB_APP)
missing = (BATCH_MANDATORY | WEB_APP_EXTRAS) - hdrs.keys()
assert not missing, f"WEB_APP missing headers: {missing}"
# --------------------------------------------------------------------------
# Payload reshape (used by the validator smoke test + CI self-tests).
# --------------------------------------------------------------------------
def test_as_validator_payload_reshape() -> None:
hdrs = {"Gov-Client-Connection-Method": "X", "Gov-Vendor-Product-Name": "y"}
payload = as_validator_payload(hdrs)
assert payload["headers"] == [
{"name": "Gov-Client-Connection-Method", "value": "X"},
{"name": "Gov-Vendor-Product-Name", "value": "y"},
]
# --------------------------------------------------------------------------
# HMRC sandbox validator smoke test — set HMRC_VALIDATOR=1 to enable.
# --------------------------------------------------------------------------
@pytest.mark.skipif(
not (os.environ.get("HMRC_VALIDATOR")
and os.environ.get("HMRC_SANDBOX_TOKEN")),
reason=("HMRC sandbox validator smoke test — set HMRC_VALIDATOR=1 AND "
"HMRC_SANDBOX_TOKEN=<app-token>. Dev Hub app must be subscribed "
"to txm-fph-validator-api/1.0 (application-restricted)."),
)
def test_headers_pass_hmrc_validator() -> None:
"""GET /test/fraud-prevention-headers/validate with BATCH headers.
Per the OAS spec the validator is a GET endpoint headers go in the
actual HTTP request, not a JSON body. Auth is application-restricted
(client_credentials bearer). A successful response has code=VALID_HEADERS;
POTENTIALLY_INVALID_HEADERS emits warnings but still passes; only
INVALID_HEADERS is a hard fail.
"""
hdrs = build_headers(connection_method=CONNECTION_METHOD_BATCH)
request_headers = {
**hdrs,
"Accept": "application/vnd.hmrc.1.0+json",
"Authorization": f"Bearer {os.environ['HMRC_SANDBOX_TOKEN']}",
}
resp = httpx.get(VALIDATOR_URL, headers=request_headers, timeout=30.0)
assert resp.status_code == 200, (
f"validator refused: {resp.status_code} {resp.text[:500]}")
body = resp.json()
code = body.get("code")
assert code != "INVALID_HEADERS", f"validator rejected: {body}"
# POTENTIALLY_INVALID_HEADERS is allowed — HMRC surfaces them as warnings;
# log for visibility but don't fail the build on them.
if code == "POTENTIALLY_INVALID_HEADERS":
print(f"validator warnings: {body.get('warnings')}")
def _runtime_with_ips(ips: list[str]) -> RuntimeContext:
"""Build a RuntimeContext override with caller-specified local_ips."""
return RuntimeContext(
mac_addresses=RUNTIME_CONTEXT.mac_addresses,
local_ips=ips,
os_family=RUNTIME_CONTEXT.os_family,
os_version=RUNTIME_CONTEXT.os_version,
device_manufacturer=RUNTIME_CONTEXT.device_manufacturer,
device_model=RUNTIME_CONTEXT.device_model,
os_user=RUNTIME_CONTEXT.os_user,
)