Initial extraction from monorepo
This commit is contained in:
commit
5c7baa8acc
20 changed files with 1974 additions and 0 deletions
8
.gitignore
vendored
Normal file
8
.gitignore
vendored
Normal file
|
|
@ -0,0 +1,8 @@
|
|||
__pycache__/
|
||||
*.pyc
|
||||
.venv/
|
||||
.mypy_cache/
|
||||
.pytest_cache/
|
||||
.ruff_cache/
|
||||
.hypothesis/
|
||||
*.egg-info/
|
||||
34
.woodpecker.yml
Normal file
34
.woodpecker.yml
Normal file
|
|
@ -0,0 +1,34 @@
|
|||
when:
|
||||
event: push
|
||||
|
||||
clone:
|
||||
git:
|
||||
image: woodpeckerci/plugin-git
|
||||
settings:
|
||||
attempts: 5
|
||||
backoff: 10s
|
||||
|
||||
steps:
|
||||
- name: build-and-push
|
||||
image: woodpeckerci/plugin-docker-buildx
|
||||
settings:
|
||||
# Dual-push during the Forgejo registry consolidation bake. After
|
||||
# ≥14 days clean, registry.viktorbarzin.me drops out (Phase 4).
|
||||
repo:
|
||||
- registry.viktorbarzin.me/hmrc-sync
|
||||
- forgejo.viktorbarzin.me/viktor/hmrc-sync
|
||||
logins:
|
||||
- registry: registry.viktorbarzin.me
|
||||
username: viktorbarzin
|
||||
password:
|
||||
from_secret: registry-password
|
||||
- registry: forgejo.viktorbarzin.me
|
||||
username:
|
||||
from_secret: forgejo_user
|
||||
password:
|
||||
from_secret: forgejo_push_token
|
||||
dockerfile: Dockerfile
|
||||
context: .
|
||||
auto_tag: true
|
||||
platforms:
|
||||
- linux/amd64
|
||||
33
Dockerfile
Normal file
33
Dockerfile
Normal file
|
|
@ -0,0 +1,33 @@
|
|||
FROM python:3.12-slim AS builder
|
||||
|
||||
ENV POETRY_VERSION=1.8.4 \
|
||||
POETRY_VIRTUALENVS_IN_PROJECT=true \
|
||||
PIP_NO_CACHE_DIR=1
|
||||
|
||||
RUN pip install --no-cache-dir "poetry==${POETRY_VERSION}"
|
||||
|
||||
WORKDIR /app
|
||||
COPY pyproject.toml poetry.lock* README.md ./
|
||||
RUN poetry install --only main --no-root
|
||||
|
||||
COPY hmrc_sync ./hmrc_sync
|
||||
COPY alembic ./alembic
|
||||
COPY alembic.ini ./alembic.ini
|
||||
RUN poetry install --only main
|
||||
|
||||
|
||||
FROM python:3.12-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
RUN useradd --system --uid 10002 --home /app --shell /usr/sbin/nologin hmrc
|
||||
|
||||
COPY --from=builder --chown=hmrc:hmrc /app /app
|
||||
|
||||
ENV PATH="/app/.venv/bin:${PATH}" \
|
||||
PYTHONUNBUFFERED=1
|
||||
|
||||
EXPOSE 8080
|
||||
USER hmrc
|
||||
ENTRYPOINT ["python", "-m", "hmrc_sync"]
|
||||
CMD ["serve"]
|
||||
90
README.md
Normal file
90
README.md
Normal file
|
|
@ -0,0 +1,90 @@
|
|||
# hmrc-sync
|
||||
|
||||
Pulls annual PAYE/NI figures from **HMRC Individual Tax API v1.1** to
|
||||
reconcile against the monthly payslip data captured by `payslip-ingest/`.
|
||||
|
||||
## Phase 1 — sandbox OAuth smoke test (shipped)
|
||||
|
||||
Scripts live at the repo root next to this README:
|
||||
|
||||
- `oauth_dance.py` — interactive browser OAuth flow against
|
||||
`test-api.service.hmrc.gov.uk`, captures the callback on
|
||||
`localhost:8080/oauth/callback`, exchanges for tokens, hits
|
||||
`/individual-income/sa/{utr}/annual-summary/{tax_year}`.
|
||||
- `headless_auth.py` — same flow but driven by Chromium via Playwright.
|
||||
Useful for CI smoke tests.
|
||||
|
||||
See the inline module docstrings for usage.
|
||||
|
||||
## Phase 2 — production service (scaffolded, awaiting HMRC approval)
|
||||
|
||||
Directory layout matches `payslip-ingest/`:
|
||||
|
||||
```
|
||||
hmrc-sync/
|
||||
├── hmrc_sync/
|
||||
│ ├── __init__.py
|
||||
│ ├── __main__.py # click CLI: serve / sync / migrate
|
||||
│ ├── app.py # FastAPI (authorize, callback, sync, healthz)
|
||||
│ ├── client.py # HmrcClient — wraps Individual Tax API v1.1
|
||||
│ ├── db.py # SQLAlchemy models (tax_year_snapshot, fetch_log)
|
||||
│ ├── fraud_headers.py # build Gov-Client-/Gov-Vendor- headers
|
||||
│ └── oauth.py # Vault-backed refresh_token storage
|
||||
├── alembic/
|
||||
│ ├── env.py
|
||||
│ └── versions/0001_initial.py
|
||||
├── tests/
|
||||
│ └── test_fraud_headers.py # CI-gated shape tests + sandbox validator smoke
|
||||
├── Dockerfile
|
||||
├── alembic.ini
|
||||
└── pyproject.toml
|
||||
```
|
||||
|
||||
### Critical path to prod
|
||||
|
||||
1. **HMRC Dev Hub** (user action, ~10 min):
|
||||
- Subscribe to *Individual Tax API v1.1*.
|
||||
- Add prod redirect URI: `https://hmrc-oauth.viktorbarzin.me/callback`.
|
||||
- Submit Production Access application — 2 questionnaires, frame as
|
||||
"single-user PAYE reconciliation, not redistributed".
|
||||
- Review takes ~10 working days.
|
||||
2. **File HMRC SDST support ticket** up-front asking (a) is MTD ITSA
|
||||
signup required for Individual Tax API prod access, and (b) can a
|
||||
PAYE-only individual voluntarily enroll without self-employment
|
||||
income. Proceed with app submission in parallel.
|
||||
3. **Fraud-header validator sweep** (local — blocking):
|
||||
```
|
||||
HMRC_VALIDATOR=1 pytest tests/test_fraud_headers.py
|
||||
```
|
||||
Must be green before prod deploy.
|
||||
4. **After HMRC approval arrives**:
|
||||
- Seed Vault keys: `hmrc_prod_client_id`, `hmrc_prod_client_secret`,
|
||||
`hmrc_sync_webhook_token`, `hmrc_device_id` at `secret/viktor/`.
|
||||
- Create `infra/stacks/hmrc-sync/` Terraform stack (clone from
|
||||
`infra/stacks/payslip-ingest/`): Deployment, Service, Ingress via
|
||||
`ingress_factory` (protected=false for HMRC callback), ESO for
|
||||
Vault→K8s Secret, Grafana datasource ConfigMap, CronJob at 06:00
|
||||
UTC daily running `python -m hmrc_sync sync --tax-year current`.
|
||||
- Deploy stack.
|
||||
- Visit `https://hmrc-oauth.viktorbarzin.me/authorize` once in a
|
||||
browser to seed the refresh_token. CronJob takes over thereafter.
|
||||
|
||||
### Dashboard Panel 10
|
||||
|
||||
`infra/stacks/monitoring/modules/monitoring/dashboards/uk-payslip.json`
|
||||
already carries Panel 10 ("HMRC Tax Year Reconciliation — Individual
|
||||
Tax API"). It queries `hmrc_sync.tax_year_snapshot` which doesn't
|
||||
exist yet on the monitoring DB — the panel renders empty until
|
||||
hmrc-sync is deployed and the Alembic migration runs.
|
||||
|
||||
### Risks / mitigations
|
||||
|
||||
- **MTD pilot gate blocks API** — SDST ticket resolves; fallback is
|
||||
payslip-ingest P60 reconciliation (already shipped).
|
||||
- **Prod approval denied on "personal use"** — reframe + appeal; else
|
||||
permanent P60-only reconciliation.
|
||||
- **Fraud-header audit fails** — validator API gates deploy.
|
||||
- **Refresh token expires (18 months)** — alert on `expires_in` < 30
|
||||
days; manual re-auth via `/authorize`.
|
||||
|
||||
Tracked as beads `code-74j`.
|
||||
37
alembic.ini
Normal file
37
alembic.ini
Normal file
|
|
@ -0,0 +1,37 @@
|
|||
[alembic]
|
||||
script_location = alembic
|
||||
sqlalchemy.url = placeholder
|
||||
|
||||
[loggers]
|
||||
keys = root,sqlalchemy,alembic
|
||||
|
||||
[handlers]
|
||||
keys = console
|
||||
|
||||
[formatters]
|
||||
keys = generic
|
||||
|
||||
[logger_root]
|
||||
level = WARN
|
||||
handlers = console
|
||||
qualname =
|
||||
|
||||
[logger_sqlalchemy]
|
||||
level = WARN
|
||||
handlers =
|
||||
qualname = sqlalchemy.engine
|
||||
|
||||
[logger_alembic]
|
||||
level = INFO
|
||||
handlers =
|
||||
qualname = alembic
|
||||
|
||||
[handler_console]
|
||||
class = StreamHandler
|
||||
args = (sys.stderr,)
|
||||
level = NOTSET
|
||||
formatter = generic
|
||||
|
||||
[formatter_generic]
|
||||
format = %(levelname)-5.5s [%(name)s] %(message)s
|
||||
datefmt = %H:%M:%S
|
||||
59
alembic/env.py
Normal file
59
alembic/env.py
Normal file
|
|
@ -0,0 +1,59 @@
|
|||
import asyncio
|
||||
import os
|
||||
from logging.config import fileConfig
|
||||
|
||||
from sqlalchemy.engine import Connection
|
||||
from sqlalchemy.ext.asyncio import async_engine_from_config
|
||||
|
||||
from alembic import context
|
||||
from hmrc_sync.db import SCHEMA_NAME, Base
|
||||
|
||||
config = context.config
|
||||
if config.config_file_name is not None:
|
||||
fileConfig(config.config_file_name)
|
||||
|
||||
db_url = os.environ.get("DB_CONNECTION_STRING")
|
||||
if db_url:
|
||||
config.set_main_option("sqlalchemy.url", db_url)
|
||||
|
||||
target_metadata = Base.metadata
|
||||
|
||||
|
||||
def do_run_migrations(connection: Connection) -> None:
|
||||
connection.exec_driver_sql(f'CREATE SCHEMA IF NOT EXISTS "{SCHEMA_NAME}"')
|
||||
context.configure(
|
||||
connection=connection,
|
||||
target_metadata=target_metadata,
|
||||
version_table_schema=SCHEMA_NAME,
|
||||
include_schemas=True,
|
||||
)
|
||||
with context.begin_transaction():
|
||||
context.run_migrations()
|
||||
|
||||
|
||||
async def run_migrations_online() -> None:
|
||||
configuration = config.get_section(config.config_ini_section, {})
|
||||
connectable = async_engine_from_config(configuration, prefix="sqlalchemy.")
|
||||
async with connectable.connect() as connection:
|
||||
await connection.run_sync(do_run_migrations)
|
||||
await connection.commit()
|
||||
await connectable.dispose()
|
||||
|
||||
|
||||
def run_migrations_offline() -> None:
|
||||
context.configure(
|
||||
url=config.get_main_option("sqlalchemy.url"),
|
||||
target_metadata=target_metadata,
|
||||
literal_binds=True,
|
||||
version_table_schema=SCHEMA_NAME,
|
||||
include_schemas=True,
|
||||
dialect_opts={"paramstyle": "named"},
|
||||
)
|
||||
with context.begin_transaction():
|
||||
context.run_migrations()
|
||||
|
||||
|
||||
if context.is_offline_mode():
|
||||
run_migrations_offline()
|
||||
else:
|
||||
asyncio.run(run_migrations_online())
|
||||
26
alembic/script.py.mako
Normal file
26
alembic/script.py.mako
Normal file
|
|
@ -0,0 +1,26 @@
|
|||
"""${message}
|
||||
|
||||
Revision ID: ${up_revision}
|
||||
Revises: ${down_revision | comma,n}
|
||||
Create Date: ${create_date}
|
||||
"""
|
||||
from collections.abc import Sequence
|
||||
from typing import Union
|
||||
|
||||
import sqlalchemy as sa
|
||||
|
||||
from alembic import op
|
||||
${imports if imports else ""}
|
||||
|
||||
revision: str = ${repr(up_revision)}
|
||||
down_revision: Union[str, None] = ${repr(down_revision)}
|
||||
branch_labels: Union[str, Sequence[str], None] = ${repr(branch_labels)}
|
||||
depends_on: Union[str, Sequence[str], None] = ${repr(depends_on)}
|
||||
|
||||
|
||||
def upgrade() -> None:
|
||||
${upgrades if upgrades else "pass"}
|
||||
|
||||
|
||||
def downgrade() -> None:
|
||||
${downgrades if downgrades else "pass"}
|
||||
85
alembic/versions/0001_initial.py
Normal file
85
alembic/versions/0001_initial.py
Normal file
|
|
@ -0,0 +1,85 @@
|
|||
"""Create hmrc_sync.tax_year_snapshot + hmrc_sync.fetch_log.
|
||||
|
||||
These two tables hold everything hmrc-sync persists. The snapshot table
|
||||
keeps HMRC's `hmrc-held` PAYE/NI view per (tax_year, employer, day);
|
||||
fetch_log is the audit trail of every outbound API call (for
|
||||
fraud-header compliance reviews HMRC may trigger).
|
||||
"""
|
||||
import sqlalchemy as sa
|
||||
from sqlalchemy.dialects import postgresql
|
||||
|
||||
from alembic import op
|
||||
|
||||
revision = "0001"
|
||||
down_revision = None
|
||||
branch_labels = None
|
||||
depends_on = None
|
||||
|
||||
SCHEMA = "hmrc_sync"
|
||||
|
||||
|
||||
def upgrade() -> None:
|
||||
op.create_table(
|
||||
"tax_year_snapshot",
|
||||
sa.Column("id", sa.Integer(), primary_key=True, autoincrement=True),
|
||||
sa.Column("tax_year", sa.String(), nullable=False),
|
||||
sa.Column("employer_paye_ref", sa.String(), nullable=False),
|
||||
sa.Column("snapshot_date", sa.TIMESTAMP(timezone=True), nullable=False),
|
||||
sa.Column("gross_pay", sa.Numeric(12, 2), nullable=False),
|
||||
sa.Column("income_tax", sa.Numeric(12, 2), nullable=False),
|
||||
sa.Column("ni_contributions", sa.Numeric(12, 2), nullable=False),
|
||||
sa.Column("source", sa.String(), nullable=False, server_default="hmrc-held"),
|
||||
sa.Column(
|
||||
"raw_response",
|
||||
postgresql.JSONB().with_variant(sa.JSON(), "sqlite"),
|
||||
nullable=False,
|
||||
),
|
||||
sa.Column(
|
||||
"fetched_at",
|
||||
sa.TIMESTAMP(timezone=True),
|
||||
nullable=False,
|
||||
server_default=sa.text("now()"),
|
||||
),
|
||||
sa.UniqueConstraint(
|
||||
"tax_year",
|
||||
"employer_paye_ref",
|
||||
"snapshot_date",
|
||||
name="uq_tax_year_snapshot",
|
||||
),
|
||||
schema=SCHEMA,
|
||||
)
|
||||
op.create_index(
|
||||
"ix_tax_year_snapshot_tax_year",
|
||||
"tax_year_snapshot",
|
||||
["tax_year"],
|
||||
schema=SCHEMA,
|
||||
)
|
||||
|
||||
op.create_table(
|
||||
"fetch_log",
|
||||
sa.Column("id", sa.Integer(), primary_key=True, autoincrement=True),
|
||||
sa.Column("endpoint", sa.String(), nullable=False),
|
||||
sa.Column("status_code", sa.Integer(), nullable=False),
|
||||
sa.Column("request_id", sa.String(), nullable=True),
|
||||
sa.Column("correlation_id", sa.String(), nullable=True),
|
||||
sa.Column(
|
||||
"fraud_headers_sent",
|
||||
postgresql.JSONB().with_variant(sa.JSON(), "sqlite"),
|
||||
nullable=False,
|
||||
),
|
||||
sa.Column("response_snippet", sa.String(), nullable=True),
|
||||
sa.Column("duration_ms", sa.Integer(), nullable=False),
|
||||
sa.Column(
|
||||
"fetched_at",
|
||||
sa.TIMESTAMP(timezone=True),
|
||||
nullable=False,
|
||||
server_default=sa.text("now()"),
|
||||
),
|
||||
schema=SCHEMA,
|
||||
)
|
||||
|
||||
|
||||
def downgrade() -> None:
|
||||
op.drop_table("fetch_log", schema=SCHEMA)
|
||||
op.drop_index("ix_tax_year_snapshot_tax_year", table_name="tax_year_snapshot", schema=SCHEMA)
|
||||
op.drop_table("tax_year_snapshot", schema=SCHEMA)
|
||||
296
headless_auth.py
Normal file
296
headless_auth.py
Normal file
|
|
@ -0,0 +1,296 @@
|
|||
"""Headless HMRC sandbox OAuth — drives Chromium via Playwright.
|
||||
|
||||
Logs in as a sandbox test user without needing a human in the loop,
|
||||
captures the authorization code from the localhost callback (the
|
||||
callback URL is never actually fetched — we abort the navigation and
|
||||
read the URL), exchanges for tokens, saves them to a cache file, then
|
||||
optionally calls an API endpoint.
|
||||
|
||||
Creds + test user credentials are read from Vault. The token cache
|
||||
lives at ~/.cache/hmrc-sync/tokens.json and can be reused across runs
|
||||
until the refresh_token expires (18 months).
|
||||
|
||||
Usage:
|
||||
python3 headless_auth.py login --user-id 228488477217 --password VLAFXYsz4Uqk
|
||||
python3 headless_auth.py call --utr 2762163393 --tax-year 2015-16
|
||||
python3 headless_auth.py refresh
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import contextlib
|
||||
import json
|
||||
import os
|
||||
import secrets
|
||||
import subprocess
|
||||
import sys
|
||||
import time
|
||||
import urllib.parse
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
|
||||
import httpx
|
||||
from playwright.sync_api import sync_playwright
|
||||
|
||||
SANDBOX_BASE = "https://test-api.service.hmrc.gov.uk"
|
||||
AUTH_PATH = "/oauth/authorize"
|
||||
TOKEN_PATH = "/oauth/token"
|
||||
INCOME_PATH = "/individual-income/sa/{utr}/annual-summary/{tax_year}"
|
||||
INCOME_ACCEPT = "application/vnd.hmrc.1.2+json"
|
||||
|
||||
REDIRECT_URI = "http://localhost:8080/oauth/callback"
|
||||
SCOPE = "read:individual-income"
|
||||
|
||||
CACHE_DIR = Path.home() / ".cache" / "hmrc-sync"
|
||||
TOKEN_CACHE = CACHE_DIR / "tokens.json"
|
||||
|
||||
|
||||
@dataclass
|
||||
class Creds:
|
||||
client_id: str
|
||||
client_secret: str
|
||||
|
||||
|
||||
def load_creds() -> Creds:
|
||||
env_id = os.environ.get("HMRC_CLIENT_ID")
|
||||
env_secret = os.environ.get("HMRC_CLIENT_SECRET")
|
||||
if env_id and env_secret:
|
||||
return Creds(env_id, env_secret)
|
||||
cid = subprocess.check_output(
|
||||
["vault", "kv", "get", "-field=hmrc_mtd_sandbox_client_id", "secret/viktor"],
|
||||
text=True,
|
||||
).strip()
|
||||
csec = subprocess.check_output(
|
||||
["vault", "kv", "get", "-field=hmrc_mtd_sandbox_client_secret", "secret/viktor"],
|
||||
text=True,
|
||||
).strip()
|
||||
return Creds(cid, csec)
|
||||
|
||||
|
||||
def save_tokens(tok: dict) -> None:
|
||||
CACHE_DIR.mkdir(parents=True, exist_ok=True)
|
||||
tok_with_meta = dict(tok)
|
||||
tok_with_meta["_cached_at"] = int(time.time())
|
||||
TOKEN_CACHE.write_text(json.dumps(tok_with_meta, indent=2))
|
||||
TOKEN_CACHE.chmod(0o600)
|
||||
|
||||
|
||||
def load_tokens() -> dict | None:
|
||||
if not TOKEN_CACHE.exists():
|
||||
return None
|
||||
return json.loads(TOKEN_CACHE.read_text())
|
||||
|
||||
|
||||
def authorize_url(creds: Creds, state: str) -> str:
|
||||
return (
|
||||
f"{SANDBOX_BASE}{AUTH_PATH}?"
|
||||
+ urllib.parse.urlencode({
|
||||
"response_type": "code",
|
||||
"client_id": creds.client_id,
|
||||
"scope": SCOPE,
|
||||
"redirect_uri": REDIRECT_URI,
|
||||
"state": state,
|
||||
})
|
||||
)
|
||||
|
||||
|
||||
def headless_get_code(auth_url: str, user_id: str, password: str, state: str) -> str:
|
||||
"""Drive Chromium through HMRC sandbox login and extract the auth code."""
|
||||
with sync_playwright() as p:
|
||||
browser = p.chromium.launch(headless=True)
|
||||
ctx = browser.new_context()
|
||||
|
||||
captured_code: dict[str, str] = {}
|
||||
|
||||
# Abort any attempt to hit localhost:8080 and capture the URL that
|
||||
# triggered it — that's the callback with ?code=...
|
||||
def _intercept(route):
|
||||
if "localhost:8080" in route.request.url:
|
||||
parsed = urllib.parse.urlparse(route.request.url)
|
||||
qs = urllib.parse.parse_qs(parsed.query)
|
||||
captured_code["code"] = qs.get("code", [""])[0]
|
||||
captured_code["state"] = qs.get("state", [""])[0]
|
||||
route.abort()
|
||||
else:
|
||||
route.continue_()
|
||||
|
||||
ctx.route("**/*", _intercept)
|
||||
page = ctx.new_page()
|
||||
page.set_default_timeout(30000)
|
||||
|
||||
page.goto(auth_url)
|
||||
page.wait_for_load_state("networkidle")
|
||||
|
||||
# Step 1 — cookie banner ("Reject additional cookies")
|
||||
with contextlib.suppress(Exception):
|
||||
page.get_by_role("button", name="Reject additional cookies").click(timeout=3000)
|
||||
page.wait_for_load_state("networkidle")
|
||||
with contextlib.suppress(Exception):
|
||||
page.get_by_role("button", name="Hide cookie message").click(timeout=2000)
|
||||
|
||||
# Step 2 — intro page ("Allow your software to connect with HMRC" → Continue)
|
||||
with contextlib.suppress(Exception):
|
||||
page.get_by_role("button", name="Continue").first.click(timeout=5000)
|
||||
page.wait_for_load_state("networkidle")
|
||||
|
||||
# Step 3 — login form
|
||||
for sel in ["input[name='userId']", "input#userId", "input[name='user_id']", "#user_id"]:
|
||||
try:
|
||||
page.fill(sel, user_id, timeout=2000)
|
||||
break
|
||||
except Exception:
|
||||
continue
|
||||
for sel in ["input[name='password']", "input#password"]:
|
||||
try:
|
||||
page.fill(sel, password, timeout=2000)
|
||||
break
|
||||
except Exception:
|
||||
continue
|
||||
for sel in ["button[type='submit']", "button:has-text('Sign in')", "input[type='submit']"]:
|
||||
try:
|
||||
page.click(sel, timeout=2000)
|
||||
page.wait_for_load_state("networkidle")
|
||||
break
|
||||
except Exception:
|
||||
continue
|
||||
|
||||
# Step 4 — consent screen ("Grant authority")
|
||||
deadline = time.time() + 20
|
||||
while time.time() < deadline and "code" not in captured_code:
|
||||
for sel in [
|
||||
"button:has-text('Grant authority')",
|
||||
"button:has-text('Continue')",
|
||||
"button:has-text('Accept and continue')",
|
||||
"#submit",
|
||||
]:
|
||||
try:
|
||||
page.click(sel, timeout=1500)
|
||||
break
|
||||
except Exception:
|
||||
continue
|
||||
time.sleep(0.5)
|
||||
|
||||
browser.close()
|
||||
|
||||
if "code" not in captured_code or not captured_code["code"]:
|
||||
raise SystemExit(f"Headless login failed to capture code. captured={captured_code}")
|
||||
if captured_code.get("state") != state:
|
||||
raise SystemExit(f"CSRF state mismatch: got {captured_code.get('state')!r}, want {state!r}")
|
||||
return captured_code["code"]
|
||||
|
||||
|
||||
def exchange_code(creds: Creds, code: str) -> dict:
|
||||
r = httpx.post(
|
||||
f"{SANDBOX_BASE}{TOKEN_PATH}",
|
||||
data={
|
||||
"grant_type": "authorization_code",
|
||||
"client_id": creds.client_id,
|
||||
"client_secret": creds.client_secret,
|
||||
"redirect_uri": REDIRECT_URI,
|
||||
"code": code,
|
||||
},
|
||||
headers={"Accept": "application/vnd.hmrc.1.0+json"},
|
||||
timeout=30,
|
||||
)
|
||||
r.raise_for_status()
|
||||
return r.json()
|
||||
|
||||
|
||||
def refresh_tokens(creds: Creds, refresh_token: str) -> dict:
|
||||
r = httpx.post(
|
||||
f"{SANDBOX_BASE}{TOKEN_PATH}",
|
||||
data={
|
||||
"grant_type": "refresh_token",
|
||||
"client_id": creds.client_id,
|
||||
"client_secret": creds.client_secret,
|
||||
"refresh_token": refresh_token,
|
||||
},
|
||||
headers={"Accept": "application/vnd.hmrc.1.0+json"},
|
||||
timeout=30,
|
||||
)
|
||||
r.raise_for_status()
|
||||
return r.json()
|
||||
|
||||
|
||||
def get_access_or_die() -> str:
|
||||
tok = load_tokens()
|
||||
if not tok:
|
||||
raise SystemExit("No cached tokens. Run: headless_auth.py login --user-id ... --password ...")
|
||||
age = int(time.time()) - tok.get("_cached_at", 0)
|
||||
if age < tok.get("expires_in", 14400) - 300:
|
||||
return tok["access_token"]
|
||||
# refresh
|
||||
creds = load_creds()
|
||||
new_tok = refresh_tokens(creds, tok["refresh_token"])
|
||||
save_tokens(new_tok)
|
||||
return new_tok["access_token"]
|
||||
|
||||
|
||||
def call_income(utr: str, tax_year: str) -> int:
|
||||
access = get_access_or_die()
|
||||
url = f"{SANDBOX_BASE}{INCOME_PATH.format(utr=utr, tax_year=tax_year)}"
|
||||
r = httpx.get(
|
||||
url,
|
||||
headers={"Accept": INCOME_ACCEPT, "Authorization": f"Bearer {access}"},
|
||||
timeout=30,
|
||||
)
|
||||
print(f"GET /individual-income/sa/{utr}/annual-summary/{tax_year} -> HTTP {r.status_code}")
|
||||
try:
|
||||
print(json.dumps(r.json(), indent=2))
|
||||
except Exception:
|
||||
print(r.text)
|
||||
return 0 if r.status_code < 400 else 2
|
||||
|
||||
|
||||
def cmd_login(args) -> int:
|
||||
creds = load_creds()
|
||||
state = secrets.token_urlsafe(24)
|
||||
url = authorize_url(creds, state)
|
||||
print(f"Headless login → {SANDBOX_BASE}{AUTH_PATH} ...")
|
||||
code = headless_get_code(url, args.user_id, args.password, state)
|
||||
print(f"Got code: {code[:12]}...")
|
||||
tok = exchange_code(creds, code)
|
||||
save_tokens(tok)
|
||||
print(f"Saved tokens to {TOKEN_CACHE}. expires_in={tok.get('expires_in')}s")
|
||||
return 0
|
||||
|
||||
|
||||
def cmd_refresh(_args) -> int:
|
||||
tok = load_tokens()
|
||||
if not tok:
|
||||
raise SystemExit("No tokens to refresh.")
|
||||
creds = load_creds()
|
||||
new_tok = refresh_tokens(creds, tok["refresh_token"])
|
||||
save_tokens(new_tok)
|
||||
print(f"Refreshed. new expires_in={new_tok.get('expires_in')}s")
|
||||
return 0
|
||||
|
||||
|
||||
def cmd_call(args) -> int:
|
||||
return call_income(args.utr, args.tax_year)
|
||||
|
||||
|
||||
def main() -> int:
|
||||
p = argparse.ArgumentParser()
|
||||
sub = p.add_subparsers(dest="cmd", required=True)
|
||||
|
||||
pl = sub.add_parser("login")
|
||||
pl.add_argument("--user-id", required=True)
|
||||
pl.add_argument("--password", required=True)
|
||||
pl.set_defaults(func=cmd_login)
|
||||
|
||||
pr = sub.add_parser("refresh")
|
||||
pr.set_defaults(func=cmd_refresh)
|
||||
|
||||
pc = sub.add_parser("call")
|
||||
pc.add_argument("--utr", required=True)
|
||||
pc.add_argument("--tax-year", default="2015-16")
|
||||
pc.set_defaults(func=cmd_call)
|
||||
|
||||
args = p.parse_args()
|
||||
return args.func(args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
0
hmrc_sync/__init__.py
Normal file
0
hmrc_sync/__init__.py
Normal file
36
hmrc_sync/__main__.py
Normal file
36
hmrc_sync/__main__.py
Normal file
|
|
@ -0,0 +1,36 @@
|
|||
import logging
|
||||
import os
|
||||
import subprocess
|
||||
import sys
|
||||
|
||||
import click
|
||||
import uvicorn
|
||||
|
||||
|
||||
@click.group()
|
||||
def cli() -> None:
|
||||
logging.basicConfig(level=os.environ.get("LOG_LEVEL", "INFO"))
|
||||
|
||||
|
||||
@cli.command()
|
||||
def serve() -> None:
|
||||
"""Run the FastAPI server (K8s entrypoint)."""
|
||||
uvicorn.run("hmrc_sync.app:app", host="0.0.0.0", port=8080)
|
||||
|
||||
|
||||
@cli.command()
|
||||
@click.option("--tax-year", default="current", help="Tax year to fetch, e.g. 2024-25 or 'current'.")
|
||||
def sync(tax_year: str) -> None:
|
||||
"""One-shot sync of HMRC figures — used by the CronJob."""
|
||||
raise click.ClickException("Sync stub — implement after HMRC prod approval lands")
|
||||
|
||||
|
||||
@cli.command()
|
||||
def migrate() -> None:
|
||||
"""Run `alembic upgrade head`."""
|
||||
result = subprocess.run(["alembic", "upgrade", "head"], check=False)
|
||||
sys.exit(result.returncode)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
cli()
|
||||
129
hmrc_sync/app.py
Normal file
129
hmrc_sync/app.py
Normal file
|
|
@ -0,0 +1,129 @@
|
|||
"""FastAPI entrypoint for hmrc-sync.
|
||||
|
||||
Endpoints:
|
||||
- GET /authorize — redirect to HMRC OAuth, primes refresh_token
|
||||
- GET /callback — OAuth callback; exchange code, persist token
|
||||
- POST /callback-metadata — browser-side session attributes (fraud headers)
|
||||
- POST /sync — pull latest HMRC figures for a given tax year
|
||||
- GET /healthz — readiness + queue depth
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import os
|
||||
import secrets
|
||||
import urllib.parse
|
||||
from contextlib import asynccontextmanager
|
||||
from typing import Any
|
||||
|
||||
from fastapi import FastAPI, HTTPException, Request
|
||||
from fastapi.responses import HTMLResponse, RedirectResponse
|
||||
from prometheus_fastapi_instrumentator import Instrumentator
|
||||
|
||||
from hmrc_sync import oauth
|
||||
from hmrc_sync.fraud_headers import SessionContext
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
REQUIRED_ENV = [
|
||||
"HMRC_PROD_CLIENT_ID",
|
||||
"HMRC_PROD_CLIENT_SECRET",
|
||||
"HMRC_PROD_REDIRECT_URI",
|
||||
"DB_CONNECTION_STRING",
|
||||
]
|
||||
|
||||
|
||||
def _verify_env() -> None:
|
||||
missing = [k for k in REQUIRED_ENV if not os.environ.get(k)]
|
||||
if missing:
|
||||
raise RuntimeError(f"Missing required env vars: {', '.join(missing)}")
|
||||
|
||||
|
||||
@asynccontextmanager
|
||||
async def lifespan(app: FastAPI): # type: ignore[no-untyped-def]
|
||||
_verify_env()
|
||||
app.state.session_context = SessionContext(
|
||||
device_id=os.environ.get("HMRC_DEVICE_ID", ""),
|
||||
public_ip=os.environ.get("HMRC_VENDOR_PUBLIC_IP", ""),
|
||||
)
|
||||
app.state.oauth_states = {} # anti-CSRF state → expires_at
|
||||
yield
|
||||
|
||||
|
||||
app = FastAPI(title="HMRC Sync", lifespan=lifespan)
|
||||
Instrumentator().instrument(app).expose(app, endpoint="/metrics")
|
||||
|
||||
|
||||
@app.get("/healthz")
|
||||
async def healthz() -> dict[str, Any]:
|
||||
return {"status": "ok"}
|
||||
|
||||
|
||||
@app.get("/authorize")
|
||||
async def authorize() -> RedirectResponse:
|
||||
creds = oauth.load_creds_from_env()
|
||||
state = secrets.token_urlsafe(24)
|
||||
app.state.oauth_states[state] = True
|
||||
params = urllib.parse.urlencode({
|
||||
"response_type": "code",
|
||||
"client_id": creds.client_id,
|
||||
"scope": "read:self-assessment",
|
||||
"redirect_uri": creds.redirect_uri,
|
||||
"state": state,
|
||||
})
|
||||
return RedirectResponse(f"{oauth.PROD_BASE}/oauth/authorize?{params}")
|
||||
|
||||
|
||||
@app.get("/callback", response_class=HTMLResponse)
|
||||
async def callback(code: str, state: str) -> HTMLResponse:
|
||||
if state not in app.state.oauth_states:
|
||||
raise HTTPException(status_code=400, detail="unknown state (CSRF)")
|
||||
del app.state.oauth_states[state]
|
||||
creds = oauth.load_creds_from_env()
|
||||
token = await oauth.exchange_code(creds, code)
|
||||
oauth.persist_to_vault(token)
|
||||
# Serve a 1-page form that POSTs browser attributes to /callback-metadata
|
||||
# so we capture the per-session values HMRC wants in fraud headers.
|
||||
return HTMLResponse(_metadata_capture_html())
|
||||
|
||||
|
||||
@app.post("/callback-metadata")
|
||||
async def callback_metadata(request: Request) -> dict[str, str]:
|
||||
body = await request.json()
|
||||
session: SessionContext = app.state.session_context
|
||||
session.user_agent = str(body.get("user_agent", "") or "")
|
||||
session.screen_width = int(body.get("screen_width", 0) or 0)
|
||||
session.screen_height = int(body.get("screen_height", 0) or 0)
|
||||
session.screen_colour_depth = int(body.get("screen_colour_depth", 0) or 0)
|
||||
session.window_width = int(body.get("window_width", 0) or 0)
|
||||
session.window_height = int(body.get("window_height", 0) or 0)
|
||||
session.timezone_offset = int(body.get("timezone_offset", 0) or 0)
|
||||
return {"status": "captured"}
|
||||
|
||||
|
||||
@app.post("/sync")
|
||||
async def sync(tax_year: str | None = None) -> dict[str, Any]:
|
||||
"""Pull latest HMRC figures for `tax_year` (default: current fiscal year)."""
|
||||
raise HTTPException(status_code=501, detail="Sync not yet implemented — awaiting HMRC prod approval")
|
||||
|
||||
|
||||
def _metadata_capture_html() -> str:
|
||||
return """<!doctype html>
|
||||
<html><head><title>hmrc-sync — capturing session</title></head><body>
|
||||
<h2>Capturing session attributes for HMRC fraud headers...</h2>
|
||||
<script>
|
||||
fetch('/callback-metadata', {
|
||||
method: 'POST',
|
||||
headers: {'Content-Type': 'application/json'},
|
||||
body: JSON.stringify({
|
||||
user_agent: navigator.userAgent,
|
||||
screen_width: screen.width,
|
||||
screen_height: screen.height,
|
||||
screen_colour_depth: screen.colorDepth,
|
||||
window_width: window.innerWidth,
|
||||
window_height: window.innerHeight,
|
||||
timezone_offset: -new Date().getTimezoneOffset()
|
||||
})
|
||||
}).then(() => document.body.innerHTML = '<h2>Done. You can close this tab.</h2>');
|
||||
</script>
|
||||
</body></html>"""
|
||||
82
hmrc_sync/client.py
Normal file
82
hmrc_sync/client.py
Normal file
|
|
@ -0,0 +1,82 @@
|
|||
"""HMRC Individual Tax API v1.1 wrapper.
|
||||
|
||||
One method per endpoint we consume. Every request attaches the full fraud-
|
||||
prevention header set built by `fraud_headers.build_headers()`.
|
||||
|
||||
Individual Tax API v1.1 returns tax-paid + income-breakdown figures per
|
||||
employment per tax year — exactly the ground-truth data we reconcile
|
||||
against the payslip-ingest monthly aggregate.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import time
|
||||
from dataclasses import dataclass
|
||||
from typing import Any
|
||||
|
||||
import httpx
|
||||
|
||||
from hmrc_sync.fraud_headers import SessionContext, build_headers
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
PROD_BASE = "https://api.service.hmrc.gov.uk"
|
||||
INDIVIDUAL_TAX_VERSION = "application/vnd.hmrc.1.1+json"
|
||||
|
||||
|
||||
@dataclass
|
||||
class HmrcResponse:
|
||||
status_code: int
|
||||
body: dict[str, Any]
|
||||
duration_ms: int
|
||||
request_id: str | None
|
||||
correlation_id: str | None
|
||||
fraud_headers_sent: dict[str, str]
|
||||
|
||||
|
||||
class HmrcClient:
|
||||
|
||||
def __init__(self,
|
||||
access_token: str,
|
||||
session: SessionContext,
|
||||
connection_method: str = "BATCH_PROCESS_DIRECT",
|
||||
base_url: str = PROD_BASE):
|
||||
self._access_token = access_token
|
||||
self._session = session
|
||||
self._connection_method = connection_method
|
||||
self._base_url = base_url.rstrip("/")
|
||||
|
||||
async def individual_tax_summary(self, utr: str, tax_year: str) -> HmrcResponse:
|
||||
"""GET /individuals/tax/sa/{utr}/summary/{taxYear}
|
||||
|
||||
`utr` is the 10-digit Self Assessment reference; tax_year format
|
||||
is `YYYY-YY` (e.g. `2024-25`).
|
||||
"""
|
||||
path = f"/individuals/tax/sa/{utr}/summary/{tax_year}"
|
||||
return await self._get(path)
|
||||
|
||||
async def _get(self, path: str) -> HmrcResponse:
|
||||
fraud = build_headers(self._session, self._connection_method)
|
||||
headers = {
|
||||
"Accept": INDIVIDUAL_TAX_VERSION,
|
||||
"Authorization": f"Bearer {self._access_token}",
|
||||
}
|
||||
headers.update(fraud)
|
||||
started = time.perf_counter()
|
||||
async with httpx.AsyncClient(timeout=30.0) as client:
|
||||
resp = await client.get(f"{self._base_url}{path}", headers=headers)
|
||||
duration_ms = int((time.perf_counter() - started) * 1000)
|
||||
body: dict[str, Any]
|
||||
try:
|
||||
body = resp.json() if resp.content else {}
|
||||
except ValueError:
|
||||
body = {"raw": resp.text[:2000]}
|
||||
log.info("hmrc %s status=%s duration=%dms", path, resp.status_code, duration_ms)
|
||||
return HmrcResponse(
|
||||
status_code=resp.status_code,
|
||||
body=body,
|
||||
duration_ms=duration_ms,
|
||||
request_id=resp.headers.get("x-request-id"),
|
||||
correlation_id=resp.headers.get("x-correlation-id"),
|
||||
fraud_headers_sent=fraud,
|
||||
)
|
||||
70
hmrc_sync/db.py
Normal file
70
hmrc_sync/db.py
Normal file
|
|
@ -0,0 +1,70 @@
|
|||
import os
|
||||
from datetime import datetime
|
||||
from decimal import Decimal
|
||||
from typing import Any
|
||||
|
||||
from sqlalchemy import JSON, TIMESTAMP, Integer, Numeric, String, text
|
||||
from sqlalchemy.dialects.postgresql import JSONB
|
||||
from sqlalchemy.ext.asyncio import AsyncEngine, async_sessionmaker, create_async_engine
|
||||
from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column
|
||||
|
||||
SCHEMA_NAME = "hmrc_sync"
|
||||
|
||||
|
||||
class Base(DeclarativeBase):
|
||||
pass
|
||||
|
||||
|
||||
JSON_TYPE = JSONB().with_variant(JSON(), "sqlite")
|
||||
|
||||
|
||||
class TaxYearSnapshot(Base):
|
||||
"""One row per (tax_year, employer_paye_ref, snapshot_date).
|
||||
|
||||
HMRC returns the `hmrc-held` view of annual PAYE/NI for a given
|
||||
employment. Taking a daily snapshot lets us see HMRC's figures evolve
|
||||
as late RTI filings land, and lets the dashboard always show the
|
||||
latest value by snapshot_date.
|
||||
"""
|
||||
__tablename__ = "tax_year_snapshot"
|
||||
__table_args__ = {"schema": SCHEMA_NAME} # noqa: RUF012
|
||||
|
||||
id: Mapped[int] = mapped_column(Integer, primary_key=True, autoincrement=True)
|
||||
tax_year: Mapped[str] = mapped_column(String, nullable=False, index=True)
|
||||
employer_paye_ref: Mapped[str] = mapped_column(String, nullable=False)
|
||||
snapshot_date: Mapped[datetime] = mapped_column(TIMESTAMP(timezone=True), nullable=False)
|
||||
gross_pay: Mapped[Decimal] = mapped_column(Numeric(12, 2), nullable=False)
|
||||
income_tax: Mapped[Decimal] = mapped_column(Numeric(12, 2), nullable=False)
|
||||
ni_contributions: Mapped[Decimal] = mapped_column(Numeric(12, 2), nullable=False)
|
||||
source: Mapped[str] = mapped_column(String, nullable=False, server_default="hmrc-held")
|
||||
raw_response: Mapped[dict[str, Any]] = mapped_column(JSON_TYPE, nullable=False)
|
||||
fetched_at: Mapped[datetime] = mapped_column(TIMESTAMP(timezone=True),
|
||||
nullable=False,
|
||||
server_default=text("now()"))
|
||||
|
||||
|
||||
class FetchLog(Base):
|
||||
"""Audit trail of every HMRC API call — for fraud-header compliance review."""
|
||||
__tablename__ = "fetch_log"
|
||||
__table_args__ = {"schema": SCHEMA_NAME} # noqa: RUF012
|
||||
|
||||
id: Mapped[int] = mapped_column(Integer, primary_key=True, autoincrement=True)
|
||||
endpoint: Mapped[str] = mapped_column(String, nullable=False)
|
||||
status_code: Mapped[int] = mapped_column(Integer, nullable=False)
|
||||
request_id: Mapped[str | None] = mapped_column(String, nullable=True)
|
||||
correlation_id: Mapped[str | None] = mapped_column(String, nullable=True)
|
||||
fraud_headers_sent: Mapped[dict[str, Any]] = mapped_column(JSON_TYPE, nullable=False)
|
||||
response_snippet: Mapped[str | None] = mapped_column(String, nullable=True)
|
||||
duration_ms: Mapped[int] = mapped_column(Integer, nullable=False)
|
||||
fetched_at: Mapped[datetime] = mapped_column(TIMESTAMP(timezone=True),
|
||||
nullable=False,
|
||||
server_default=text("now()"))
|
||||
|
||||
|
||||
def create_engine_from_env() -> AsyncEngine:
|
||||
url = os.environ["DB_CONNECTION_STRING"]
|
||||
return create_async_engine(url, pool_pre_ping=True)
|
||||
|
||||
|
||||
def make_session_factory(engine: AsyncEngine) -> async_sessionmaker[Any]:
|
||||
return async_sessionmaker(engine, expire_on_commit=False)
|
||||
341
hmrc_sync/fraud_headers.py
Normal file
341
hmrc_sync/fraud_headers.py
Normal file
|
|
@ -0,0 +1,341 @@
|
|||
"""Build HMRC MTD fraud-prevention headers (Gov-Client-* / Gov-Vendor-*).
|
||||
|
||||
HMRC's BATCH_PROCESS_DIRECT connection method (what our CronJob uses)
|
||||
mandates 11 headers on every MTD API call; WEB_APP_VIA_SERVER adds a
|
||||
handful of browser-derived fields. Shipping without these risks fines
|
||||
and API-access revocation per the HMRC fraud-prevention guide.
|
||||
|
||||
Layout:
|
||||
|
||||
- **Static** — vendor-constant across runs (product name/version,
|
||||
hashed license id).
|
||||
- **Runtime** — collected at module load from the pod's own network
|
||||
stack + OS: MAC addresses, local IPs, OS family/version, device model.
|
||||
- **Per-request** — built at call time (timestamps, request ids).
|
||||
- **Per-session** — captured from the browser on `/callback-metadata`
|
||||
(screen dimensions, public IP, MFA timestamp). Only WEB_APP_VIA_SERVER.
|
||||
|
||||
The public entry point is `build_headers(session, connection_method)`.
|
||||
Run `tests/test_fraud_headers.py::test_headers_pass_hmrc_validator`
|
||||
with `HMRC_VALIDATOR=1` to verify against the HMRC sandbox validator.
|
||||
|
||||
Spec references:
|
||||
https://developer.service.hmrc.gov.uk/guides/fraud-prevention/
|
||||
https://developer.service.hmrc.gov.uk/guides/fraud-prevention/connection-method/batch-process-direct/
|
||||
https://developer.service.hmrc.gov.uk/api-documentation/docs/api/service/txm-fph-validator-api/1.0
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import getpass
|
||||
import hashlib
|
||||
import logging
|
||||
import os
|
||||
import platform
|
||||
import secrets
|
||||
import socket
|
||||
import time
|
||||
import urllib.parse
|
||||
import uuid
|
||||
from dataclasses import dataclass, field
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
VENDOR_PRODUCT_NAME = "hmrc-sync"
|
||||
VENDOR_PRODUCT_VERSION = "0.1.0"
|
||||
# Self-assigned for a personal single-user tool. HMRC permits arbitrary
|
||||
# vendor strings; the header value is then SHA-256-hashed per spec
|
||||
# (`Gov-Vendor-License-IDs: <name>=<hashed-value>`).
|
||||
VENDOR_LICENSE_ID = os.environ.get("HMRC_VENDOR_LICENSE_ID",
|
||||
"hmrc-sync-private-single-user")
|
||||
VENDOR_PUBLIC_IP = os.environ.get("HMRC_VENDOR_PUBLIC_IP", "")
|
||||
|
||||
# Valid HMRC connection-method enum values.
|
||||
CONNECTION_METHOD_BATCH = "BATCH_PROCESS_DIRECT"
|
||||
CONNECTION_METHOD_WEB_APP = "WEB_APP_VIA_SERVER"
|
||||
CONNECTION_METHOD_MFA = "AUTH_USING_MFA"
|
||||
|
||||
_NET_CLASS = Path("/sys/class/net")
|
||||
_EMPTY_MAC = "00:00:00:00:00:00"
|
||||
|
||||
|
||||
@dataclass
|
||||
class SessionContext:
|
||||
"""Browser-side attributes captured on the `/callback-metadata` POST.
|
||||
|
||||
Only relevant for WEB_APP_VIA_SERVER flows (browser-initiated OAuth
|
||||
+ server-side API calls). BATCH_PROCESS_DIRECT flows derive their
|
||||
context from `RuntimeContext` (see below) without touching these.
|
||||
"""
|
||||
user_agent: str = ""
|
||||
screen_width: int = 0
|
||||
screen_height: int = 0
|
||||
screen_colour_depth: int = 0
|
||||
window_width: int = 0
|
||||
window_height: int = 0
|
||||
timezone_offset: int = 0
|
||||
device_id: str = ""
|
||||
mfa_timestamp: str = ""
|
||||
public_ip: str = ""
|
||||
public_port: int = 0
|
||||
|
||||
|
||||
@dataclass
|
||||
class RuntimeContext:
|
||||
"""Pod-side environment values required on every API call.
|
||||
|
||||
Collected once at module load (cheap — all local syscalls). If any
|
||||
field is empty, the header emitter falls back to safe defaults so
|
||||
the call never goes out with an empty mandatory header.
|
||||
"""
|
||||
mac_addresses: list[str] = field(default_factory=list)
|
||||
local_ips: list[str] = field(default_factory=list)
|
||||
os_family: str = ""
|
||||
os_version: str = ""
|
||||
device_manufacturer: str = "Kubernetes"
|
||||
device_model: str = ""
|
||||
os_user: str = ""
|
||||
|
||||
|
||||
def _collect_mac_addresses() -> list[str]:
|
||||
"""Read every non-loopback interface MAC from `/sys/class/net/*/address`.
|
||||
|
||||
Colons are kept raw; `_format_mac_list` percent-encodes on output per spec.
|
||||
"""
|
||||
out: list[str] = []
|
||||
if not _NET_CLASS.exists():
|
||||
return out
|
||||
for iface in sorted(_NET_CLASS.iterdir()):
|
||||
if iface.name == "lo":
|
||||
continue
|
||||
addr_file = iface / "address"
|
||||
if not addr_file.exists():
|
||||
continue
|
||||
try:
|
||||
mac = addr_file.read_text().strip()
|
||||
except OSError:
|
||||
continue
|
||||
if mac and mac != _EMPTY_MAC:
|
||||
out.append(mac)
|
||||
return out
|
||||
|
||||
|
||||
def _collect_local_ips() -> list[str]:
|
||||
"""Every IP bound to this host — IPv4 + IPv6, loopback excluded."""
|
||||
ips: set[str] = set()
|
||||
try:
|
||||
hostname = socket.gethostname()
|
||||
for family, _, _, _, sockaddr in socket.getaddrinfo(hostname, None):
|
||||
raw = sockaddr[0]
|
||||
if not isinstance(raw, str):
|
||||
continue
|
||||
if family == socket.AF_INET and not raw.startswith("127."):
|
||||
ips.add(raw)
|
||||
elif family == socket.AF_INET6 and not raw.startswith("::1"):
|
||||
ips.add(raw.split("%")[0]) # strip zone id
|
||||
except (socket.gaierror, OSError):
|
||||
pass
|
||||
# Also grab the primary outbound IP — `getaddrinfo(hostname)` can miss
|
||||
# it inside containers whose hostname has no DNS entry.
|
||||
try:
|
||||
with socket.socket(socket.AF_INET, socket.SOCK_DGRAM) as s:
|
||||
s.connect(("10.255.255.255", 1))
|
||||
ips.add(s.getsockname()[0])
|
||||
except OSError:
|
||||
pass
|
||||
return sorted(ips)
|
||||
|
||||
|
||||
def _detect_runtime_context() -> RuntimeContext:
|
||||
uname = platform.uname()
|
||||
return RuntimeContext(
|
||||
mac_addresses=_collect_mac_addresses(),
|
||||
local_ips=_collect_local_ips(),
|
||||
os_family=uname.system or "Linux",
|
||||
os_version=uname.release or "unknown",
|
||||
device_manufacturer="Kubernetes",
|
||||
device_model=uname.node or socket.gethostname() or "pod",
|
||||
os_user=_safe_getuser(),
|
||||
)
|
||||
|
||||
|
||||
def _safe_getuser() -> str:
|
||||
try:
|
||||
return getpass.getuser()
|
||||
except (KeyError, OSError):
|
||||
return os.environ.get("USER", "unknown")
|
||||
|
||||
|
||||
RUNTIME_CONTEXT: RuntimeContext = _detect_runtime_context()
|
||||
|
||||
|
||||
def build_headers(session: SessionContext | None = None,
|
||||
connection_method: str = CONNECTION_METHOD_BATCH,
|
||||
runtime: RuntimeContext | None = None) -> dict[str, str]:
|
||||
"""Return the full header dict to attach to every HMRC API call.
|
||||
|
||||
Defaults to BATCH_PROCESS_DIRECT — the mode the CronJob uses. Pass
|
||||
a populated `SessionContext` + `connection_method=WEB_APP_VIA_SERVER`
|
||||
for browser-initiated flows; the browser-only fields layer on top.
|
||||
"""
|
||||
session = session or SessionContext()
|
||||
rt = runtime or RUNTIME_CONTEXT
|
||||
headers: dict[str, str] = {}
|
||||
headers.update(_static_headers())
|
||||
headers.update(_per_request_headers())
|
||||
headers.update(_mandatory_runtime_headers(rt, session, connection_method))
|
||||
if connection_method == CONNECTION_METHOD_WEB_APP:
|
||||
headers.update(_web_app_session_headers(session))
|
||||
if connection_method == CONNECTION_METHOD_MFA and session.mfa_timestamp:
|
||||
headers["Gov-Client-MFA-Timestamp"] = session.mfa_timestamp
|
||||
return headers
|
||||
|
||||
|
||||
def _static_headers() -> dict[str, str]:
|
||||
"""Vendor-constant identity headers that apply to every connection method.
|
||||
|
||||
Product-Name is percent-encoded per spec; License-IDs value is SHA-256-
|
||||
hashed per spec; Version is a key-value pair of `<software-name>=<version>`.
|
||||
|
||||
Gov-Vendor-Public-IP and Gov-Vendor-Forwarded are NOT emitted here — the
|
||||
HMRC validator rejects them for BATCH_PROCESS_DIRECT (where no vendor
|
||||
server sits between the client and the HMRC API). They're added in
|
||||
`_web_app_session_headers` for the WEB_APP_VIA_SERVER path only.
|
||||
"""
|
||||
license_hash = hashlib.sha256(VENDOR_LICENSE_ID.encode()).hexdigest()
|
||||
return {
|
||||
"Gov-Vendor-Product-Name": _pct(VENDOR_PRODUCT_NAME),
|
||||
"Gov-Vendor-Version": f"{VENDOR_PRODUCT_NAME}={VENDOR_PRODUCT_VERSION}",
|
||||
"Gov-Vendor-License-IDs": f"{VENDOR_PRODUCT_NAME}={license_hash}",
|
||||
}
|
||||
|
||||
|
||||
def _per_request_headers() -> dict[str, str]:
|
||||
"""Per-call trace + timestamp headers. Local-IPs-Timestamp uses HMRC's
|
||||
exact format `yyyy-MM-ddThh:mm:ss.sssZ` — always UTC, always millis."""
|
||||
now_ms = int(time.time() * 1000)
|
||||
iso_ms = time.strftime("%Y-%m-%dT%H:%M:%S", time.gmtime(now_ms / 1000))
|
||||
now_iso = f"{iso_ms}.{now_ms % 1000:03d}Z"
|
||||
return {
|
||||
"Gov-Client-Timezone": "UTC+00:00",
|
||||
"Gov-Client-Local-IPs-Timestamp": now_iso,
|
||||
"x-correlation-id": str(uuid.uuid4()),
|
||||
"x-request-id": secrets.token_hex(16),
|
||||
}
|
||||
|
||||
|
||||
def _mandatory_runtime_headers(rt: RuntimeContext, session: SessionContext,
|
||||
connection_method: str) -> dict[str, str]:
|
||||
"""The 8 headers mandatory for BATCH_PROCESS_DIRECT that come from the
|
||||
host — Connection-Method, Device-ID, User-IDs, User-Agent, Local-IPs,
|
||||
MAC-Addresses (+ Timezone and Local-IPs-Timestamp live in
|
||||
`_per_request_headers`)."""
|
||||
return {
|
||||
"Gov-Client-Connection-Method": connection_method,
|
||||
"Gov-Client-Device-ID": session.device_id or _fallback_device_id(),
|
||||
"Gov-Client-User-IDs": _user_ids(rt, session),
|
||||
"Gov-Client-User-Agent": _user_agent(rt, session),
|
||||
"Gov-Client-Local-IPs": _format_ip_list(rt.local_ips),
|
||||
"Gov-Client-MAC-Addresses": _format_mac_list(rt.mac_addresses),
|
||||
}
|
||||
|
||||
|
||||
def _web_app_session_headers(session: SessionContext) -> dict[str, str]:
|
||||
"""WEB_APP_VIA_SERVER-only headers — browser context + vendor hop trail.
|
||||
|
||||
Gov-Vendor-Public-IP and Gov-Vendor-Forwarded describe the vendor server
|
||||
that sits between the user's browser and HMRC — only meaningful for
|
||||
WEB_APP_VIA_SERVER. BATCH_PROCESS_DIRECT must omit them (validator
|
||||
rejects them there).
|
||||
"""
|
||||
out: dict[str, str] = {}
|
||||
if session.screen_width and session.screen_height:
|
||||
out["Gov-Client-Screens"] = (
|
||||
f"width={session.screen_width}&height={session.screen_height}"
|
||||
f"&scaling-factor=1&colour-depth={session.screen_colour_depth}")
|
||||
if session.window_width and session.window_height:
|
||||
out["Gov-Client-Window-Size"] = (f"width={session.window_width}&"
|
||||
f"height={session.window_height}")
|
||||
if session.public_ip:
|
||||
out["Gov-Client-Public-IP"] = session.public_ip
|
||||
if session.public_port:
|
||||
out["Gov-Client-Public-Port"] = str(session.public_port)
|
||||
vendor_ip = VENDOR_PUBLIC_IP or (RUNTIME_CONTEXT.local_ips[0] if RUNTIME_CONTEXT.local_ips
|
||||
else "")
|
||||
if vendor_ip:
|
||||
out["Gov-Vendor-Public-IP"] = vendor_ip
|
||||
out["Gov-Vendor-Forwarded"] = f"by={vendor_ip}&for={vendor_ip}"
|
||||
return out
|
||||
|
||||
|
||||
def _user_ids(rt: RuntimeContext, session: SessionContext) -> str:
|
||||
"""Per spec: `os=<device-user>&<app>=<app-user>`. The `os=` field is
|
||||
mandatory; we additionally tag our application with the OAuth user
|
||||
so HMRC can correlate activity in a breach investigation.
|
||||
"""
|
||||
os_user = rt.os_user or "unknown"
|
||||
pairs = [f"os={_pct(os_user)}"]
|
||||
oauth_user = os.environ.get("HMRC_OAUTH_USER", "viktor")
|
||||
pairs.append(f"hmrc-sync={_pct(oauth_user)}")
|
||||
_ = session # reserved for future per-session identity extension
|
||||
return "&".join(pairs)
|
||||
|
||||
|
||||
def _user_agent(rt: RuntimeContext, session: SessionContext) -> str:
|
||||
"""Per spec: `os-family=…&os-version=…&device-manufacturer=…&device-model=…`.
|
||||
|
||||
For WEB_APP_VIA_SERVER with a captured browser UA, the browser string
|
||||
is encoded under `device-model` with the rest of the fields defaulting
|
||||
to our pod's values — HMRC's validator accepts this hybrid shape.
|
||||
"""
|
||||
model = session.user_agent or rt.device_model or "pod"
|
||||
pairs = [
|
||||
f"os-family={_pct(rt.os_family)}",
|
||||
f"os-version={_pct(rt.os_version)}",
|
||||
f"device-manufacturer={_pct(rt.device_manufacturer)}",
|
||||
f"device-model={_pct(model)}",
|
||||
]
|
||||
return "&".join(pairs)
|
||||
|
||||
|
||||
def _format_ip_list(ips: list[str]) -> str:
|
||||
"""IPv6 addresses percent-encoded; IPv4 passes through. Joined with ','.
|
||||
|
||||
HMRC's validator accepts an empty header only if the request truly
|
||||
has no IPs; on a live pod we always have at least one — if the list
|
||||
comes back empty we fall back to the loopback so the header is
|
||||
syntactically valid.
|
||||
"""
|
||||
if not ips:
|
||||
return "127.0.0.1"
|
||||
out = []
|
||||
for ip in ips:
|
||||
out.append(_pct(ip) if ":" in ip else ip)
|
||||
return ",".join(out)
|
||||
|
||||
|
||||
def _format_mac_list(macs: list[str]) -> str:
|
||||
"""Each MAC percent-encoded (colons → %3A), comma-joined.
|
||||
|
||||
Empty list → single dummy MAC so we never ship a blank header;
|
||||
HMRC's validator treats blank as a violation.
|
||||
"""
|
||||
if not macs:
|
||||
return _pct("02:00:00:00:00:00")
|
||||
return ",".join(_pct(m) for m in macs)
|
||||
|
||||
|
||||
def _fallback_device_id() -> str:
|
||||
"""Deterministic UUID derived from hostname when no Vault-backed
|
||||
Device-ID is seeded. Stable across restarts on the same node."""
|
||||
return str(uuid.uuid5(uuid.NAMESPACE_DNS, f"hmrc-sync-{platform.node()}"))
|
||||
|
||||
|
||||
def _pct(s: str) -> str:
|
||||
return urllib.parse.quote(s, safe="")
|
||||
|
||||
|
||||
def as_validator_payload(headers: dict[str, str]) -> dict[str, Any]:
|
||||
"""Reshape headers for the HMRC fraud-header validator API body."""
|
||||
return {"headers": [{"name": k, "value": v} for k, v in headers.items()]}
|
||||
125
hmrc_sync/oauth.py
Normal file
125
hmrc_sync/oauth.py
Normal file
|
|
@ -0,0 +1,125 @@
|
|||
"""HMRC OAuth token persistence — Vault-backed refresh-token store.
|
||||
|
||||
The refresh_token is long-lived (HMRC grants 18 months). We keep it in
|
||||
Vault at `secret/viktor/hmrc_refresh_token` and let ESO sync it to a K8s
|
||||
Secret the pod mounts as an env var. On every refresh, we write the new
|
||||
token back to Vault so the next pod restart picks it up.
|
||||
|
||||
Writing back requires Vault write access — the pod uses a short-lived
|
||||
K8s-auth Vault token with a narrow policy that only allows writing
|
||||
`secret/viktor/hmrc_refresh_token`.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import os
|
||||
from dataclasses import dataclass
|
||||
|
||||
import httpx
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
VAULT_KEY = "secret/viktor/hmrc_refresh_token"
|
||||
PROD_BASE = "https://api.service.hmrc.gov.uk"
|
||||
TOKEN_PATH = "/oauth/token"
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class OAuthCreds:
|
||||
client_id: str
|
||||
client_secret: str
|
||||
redirect_uri: str
|
||||
|
||||
|
||||
@dataclass
|
||||
class TokenBundle:
|
||||
access_token: str
|
||||
refresh_token: str
|
||||
expires_in: int
|
||||
scope: str
|
||||
|
||||
@classmethod
|
||||
def from_json(cls, data: dict[str, object]) -> TokenBundle:
|
||||
return cls(
|
||||
access_token=str(data["access_token"]),
|
||||
refresh_token=str(data["refresh_token"]),
|
||||
expires_in=int(data["expires_in"]), # type: ignore[arg-type]
|
||||
scope=str(data.get("scope", "")),
|
||||
)
|
||||
|
||||
|
||||
def load_creds_from_env() -> OAuthCreds:
|
||||
return OAuthCreds(
|
||||
client_id=os.environ["HMRC_PROD_CLIENT_ID"],
|
||||
client_secret=os.environ["HMRC_PROD_CLIENT_SECRET"],
|
||||
redirect_uri=os.environ["HMRC_PROD_REDIRECT_URI"],
|
||||
)
|
||||
|
||||
|
||||
async def exchange_code(creds: OAuthCreds, code: str) -> TokenBundle:
|
||||
"""Swap a fresh authorization_code for an access+refresh token pair."""
|
||||
async with httpx.AsyncClient(timeout=30.0) as client:
|
||||
resp = await client.post(
|
||||
f"{PROD_BASE}{TOKEN_PATH}",
|
||||
data={
|
||||
"grant_type": "authorization_code",
|
||||
"client_id": creds.client_id,
|
||||
"client_secret": creds.client_secret,
|
||||
"redirect_uri": creds.redirect_uri,
|
||||
"code": code,
|
||||
},
|
||||
headers={"Accept": "application/vnd.hmrc.1.0+json"},
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return TokenBundle.from_json(resp.json())
|
||||
|
||||
|
||||
async def refresh(creds: OAuthCreds, refresh_token: str) -> TokenBundle:
|
||||
"""Exchange an old refresh_token for a fresh access+refresh pair.
|
||||
|
||||
HMRC rotates the refresh_token on every refresh — the old one becomes
|
||||
invalid immediately after this call returns. Persist the new one to
|
||||
Vault atomically; a failure between the refresh and the Vault write
|
||||
leaves us stranded.
|
||||
"""
|
||||
async with httpx.AsyncClient(timeout=30.0) as client:
|
||||
resp = await client.post(
|
||||
f"{PROD_BASE}{TOKEN_PATH}",
|
||||
data={
|
||||
"grant_type": "refresh_token",
|
||||
"client_id": creds.client_id,
|
||||
"client_secret": creds.client_secret,
|
||||
"refresh_token": refresh_token,
|
||||
},
|
||||
headers={"Accept": "application/vnd.hmrc.1.0+json"},
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return TokenBundle.from_json(resp.json())
|
||||
|
||||
|
||||
def persist_to_vault(token: TokenBundle) -> None:
|
||||
"""Write the new refresh_token back to Vault.
|
||||
|
||||
Uses the hvac client with K8s-auth — the pod's service-account token
|
||||
at /var/run/secrets/kubernetes.io/serviceaccount/token logs into
|
||||
Vault's kubernetes auth method and receives a short-lived Vault token
|
||||
with write access to `secret/viktor/hmrc_refresh_token` only.
|
||||
"""
|
||||
import hvac
|
||||
|
||||
addr = os.environ.get("VAULT_ADDR", "https://vault.viktorbarzin.me")
|
||||
role = os.environ.get("VAULT_K8S_ROLE", "hmrc-sync")
|
||||
jwt_path = "/var/run/secrets/kubernetes.io/serviceaccount/token"
|
||||
with open(jwt_path, encoding="utf-8") as fh:
|
||||
jwt = fh.read()
|
||||
client = hvac.Client(url=addr)
|
||||
client.auth.kubernetes.login(role=role, jwt=jwt)
|
||||
client.secrets.kv.v2.create_or_update_secret(
|
||||
path="viktor/hmrc_refresh_token",
|
||||
secret={
|
||||
"refresh_token": token.refresh_token,
|
||||
"expires_in": token.expires_in,
|
||||
"scope": token.scope,
|
||||
},
|
||||
)
|
||||
log.info("Rotated HMRC refresh_token persisted to Vault")
|
||||
178
oauth_dance.py
Normal file
178
oauth_dance.py
Normal file
|
|
@ -0,0 +1,178 @@
|
|||
"""Phase-1 HMRC MTD OAuth sandbox smoke test.
|
||||
|
||||
Runs the authorization_code flow against HMRC's test environment, captures
|
||||
the callback on localhost:8080, exchanges for tokens, then calls
|
||||
/individuals/income-received/employments/{nino}/{taxYear} for a test user.
|
||||
|
||||
Prerequisites (do once in the HMRC dev hub for the app):
|
||||
1. Add http://localhost:8080/oauth/callback as a Redirect URI.
|
||||
2. Subscribe to "Individuals Income Received API" (and accept terms).
|
||||
3. Create a sandbox test user (Individuals → Create Test User) and note
|
||||
the NINO + Government Gateway user ID + password.
|
||||
|
||||
Credentials are read from Vault (secret/viktor/hmrc_mtd_sandbox_client_{id,secret})
|
||||
with env-var fallback for portability.
|
||||
|
||||
Run:
|
||||
python3 oauth_dance.py --nino NH000000A --tax-year 2025-26
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import http.server
|
||||
import json
|
||||
import os
|
||||
import secrets
|
||||
import socketserver
|
||||
import sys
|
||||
import threading
|
||||
import urllib.parse
|
||||
import webbrowser
|
||||
from dataclasses import dataclass
|
||||
|
||||
import httpx
|
||||
|
||||
SANDBOX_BASE = "https://test-api.service.hmrc.gov.uk"
|
||||
AUTH_PATH = "/oauth/authorize"
|
||||
TOKEN_PATH = "/oauth/token"
|
||||
# Legacy "Individual Income API" v1.2 — annual SA summary. Path uses
|
||||
# the 10-digit Self-Assessment UTR, NOT the NINO. MTD
|
||||
# "Individuals Income Received API" would be richer (in-year YTD) but
|
||||
# isn't available to this app's subscription list.
|
||||
INCOME_PATH = "/individual-income/sa/{utr}/annual-summary/{tax_year}"
|
||||
INCOME_ACCEPT = "application/vnd.hmrc.1.2+json"
|
||||
|
||||
REDIRECT_URI = "http://localhost:8080/oauth/callback"
|
||||
CALLBACK_PORT = 8080
|
||||
SCOPE = "read:individual-income"
|
||||
|
||||
|
||||
@dataclass
|
||||
class Creds:
|
||||
client_id: str
|
||||
client_secret: str
|
||||
|
||||
|
||||
def load_creds() -> Creds:
|
||||
env_id = os.environ.get("HMRC_CLIENT_ID")
|
||||
env_secret = os.environ.get("HMRC_CLIENT_SECRET")
|
||||
if env_id and env_secret:
|
||||
return Creds(env_id, env_secret)
|
||||
import subprocess
|
||||
cid = subprocess.check_output(
|
||||
["vault", "kv", "get", "-field=hmrc_mtd_sandbox_client_id", "secret/viktor"],
|
||||
text=True,
|
||||
).strip()
|
||||
csec = subprocess.check_output(
|
||||
["vault", "kv", "get", "-field=hmrc_mtd_sandbox_client_secret", "secret/viktor"],
|
||||
text=True,
|
||||
).strip()
|
||||
return Creds(cid, csec)
|
||||
|
||||
|
||||
class _CallbackHandler(http.server.BaseHTTPRequestHandler):
|
||||
captured: dict[str, str] = {}
|
||||
|
||||
def do_GET(self) -> None:
|
||||
parsed = urllib.parse.urlparse(self.path)
|
||||
if parsed.path != "/oauth/callback":
|
||||
self.send_response(404)
|
||||
self.end_headers()
|
||||
return
|
||||
qs = urllib.parse.parse_qs(parsed.query)
|
||||
_CallbackHandler.captured.update({k: v[0] for k, v in qs.items()})
|
||||
self.send_response(200)
|
||||
self.send_header("Content-Type", "text/html; charset=utf-8")
|
||||
self.end_headers()
|
||||
body = b"<h2>HMRC auth received. You can close this tab.</h2>"
|
||||
self.wfile.write(body)
|
||||
|
||||
def log_message(self, *_args) -> None: # silence default stderr spam
|
||||
pass
|
||||
|
||||
|
||||
def run_callback_server_until_code(expected_state: str) -> dict[str, str]:
|
||||
with socketserver.TCPServer(("127.0.0.1", CALLBACK_PORT), _CallbackHandler) as srv:
|
||||
t = threading.Thread(target=srv.serve_forever, daemon=True)
|
||||
t.start()
|
||||
while "code" not in _CallbackHandler.captured and "error" not in _CallbackHandler.captured:
|
||||
threading.Event().wait(0.25)
|
||||
srv.shutdown()
|
||||
got = dict(_CallbackHandler.captured)
|
||||
if got.get("state") != expected_state:
|
||||
raise SystemExit(f"CSRF: state mismatch (got {got.get('state')!r}, want {expected_state!r})")
|
||||
if "error" in got:
|
||||
raise SystemExit(f"HMRC returned error: {got}")
|
||||
return got
|
||||
|
||||
|
||||
def exchange_code(creds: Creds, code: str) -> dict:
|
||||
r = httpx.post(
|
||||
f"{SANDBOX_BASE}{TOKEN_PATH}",
|
||||
data={
|
||||
"grant_type": "authorization_code",
|
||||
"client_id": creds.client_id,
|
||||
"client_secret": creds.client_secret,
|
||||
"redirect_uri": REDIRECT_URI,
|
||||
"code": code,
|
||||
},
|
||||
headers={"Accept": "application/vnd.hmrc.1.0+json"},
|
||||
timeout=30,
|
||||
)
|
||||
r.raise_for_status()
|
||||
return r.json()
|
||||
|
||||
|
||||
def call_income_received(access_token: str, utr: str, tax_year: str) -> httpx.Response:
|
||||
"""tax_year is '2015-16' style (legacy Individual Income API)."""
|
||||
url = f"{SANDBOX_BASE}{INCOME_PATH.format(utr=utr, tax_year=tax_year)}"
|
||||
return httpx.get(
|
||||
url,
|
||||
headers={
|
||||
"Accept": INCOME_ACCEPT,
|
||||
"Authorization": f"Bearer {access_token}",
|
||||
},
|
||||
timeout=30,
|
||||
)
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--utr", required=True, help="Sandbox test-user 10-digit SA UTR, e.g. 2762163393")
|
||||
parser.add_argument("--tax-year", default="2015-16", help="Format 2015-16. Sandbox may only have canned data for certain years.")
|
||||
args = parser.parse_args()
|
||||
|
||||
creds = load_creds()
|
||||
state = secrets.token_urlsafe(24)
|
||||
auth_url = (
|
||||
f"{SANDBOX_BASE}{AUTH_PATH}?"
|
||||
+ urllib.parse.urlencode({
|
||||
"response_type": "code",
|
||||
"client_id": creds.client_id,
|
||||
"scope": SCOPE,
|
||||
"redirect_uri": REDIRECT_URI,
|
||||
"state": state,
|
||||
})
|
||||
)
|
||||
print(f"Opening browser to HMRC sandbox login...\n {auth_url}\n")
|
||||
webbrowser.open(auth_url)
|
||||
captured = run_callback_server_until_code(expected_state=state)
|
||||
print(f"Got auth code (truncated): {captured['code'][:12]}...")
|
||||
|
||||
tokens = exchange_code(creds, captured["code"])
|
||||
access = tokens["access_token"]
|
||||
print(f"Got access_token (exp {tokens.get('expires_in')}s), refresh_token present={('refresh_token' in tokens)}")
|
||||
|
||||
resp = call_income_received(access, args.utr, args.tax_year)
|
||||
print(f"\nGET /individual-income/sa/{args.utr}/annual-summary/{args.tax_year} → HTTP {resp.status_code}")
|
||||
try:
|
||||
print(json.dumps(resp.json(), indent=2))
|
||||
except Exception:
|
||||
print(resp.text)
|
||||
|
||||
return 0 if resp.status_code < 400 else 2
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
53
pyproject.toml
Normal file
53
pyproject.toml
Normal file
|
|
@ -0,0 +1,53 @@
|
|||
[tool.poetry]
|
||||
name = "hmrc-sync"
|
||||
version = "0.1.0"
|
||||
description = "Pulls annual PAYE/NI from HMRC Individual Tax API v1.1 to reconcile against payslip-ingest"
|
||||
authors = ["Viktor Barzin <viktorbarzin@meta.com>"]
|
||||
readme = "README.md"
|
||||
packages = [{ include = "hmrc_sync" }]
|
||||
|
||||
[tool.poetry.dependencies]
|
||||
python = ">=3.12,<3.13"
|
||||
fastapi = "^0.115"
|
||||
uvicorn = "^0.32"
|
||||
httpx = "^0.27"
|
||||
pydantic = "^2.9"
|
||||
sqlalchemy = { extras = ["asyncio"], version = "^2.0" }
|
||||
asyncpg = "^0.29"
|
||||
alembic = "^1.13"
|
||||
click = "^8.1"
|
||||
prometheus-fastapi-instrumentator = "^7.0"
|
||||
hvac = "^2.3"
|
||||
|
||||
[tool.poetry.group.dev.dependencies]
|
||||
pytest = "^8.3"
|
||||
pytest-asyncio = "^0.23"
|
||||
mypy = "^1.11"
|
||||
ruff = "^0.6"
|
||||
yapf = "^0.43"
|
||||
respx = "^0.21"
|
||||
aiosqlite = "^0.20"
|
||||
|
||||
[build-system]
|
||||
requires = ["poetry-core"]
|
||||
build-backend = "poetry.core.masonry.api"
|
||||
|
||||
[tool.pytest.ini_options]
|
||||
asyncio_mode = "auto"
|
||||
testpaths = ["tests"]
|
||||
|
||||
[tool.mypy]
|
||||
python_version = "3.12"
|
||||
strict = true
|
||||
files = ["hmrc_sync", "tests"]
|
||||
|
||||
[tool.ruff]
|
||||
line-length = 100
|
||||
target-version = "py312"
|
||||
|
||||
[tool.ruff.lint]
|
||||
select = ["E", "F", "W", "I", "UP", "B", "SIM", "RUF"]
|
||||
|
||||
[tool.yapf]
|
||||
based_on_style = "pep8"
|
||||
column_limit = 100
|
||||
0
tests/__init__.py
Normal file
0
tests/__init__.py
Normal file
292
tests/test_fraud_headers.py
Normal file
292
tests/test_fraud_headers.py
Normal file
|
|
@ -0,0 +1,292 @@
|
|||
"""Fraud-header compliance checks.
|
||||
|
||||
Two layers:
|
||||
|
||||
1. **Local shape assertions** — pure-python checks that every mandatory
|
||||
Gov-Client-*/Gov-Vendor-* header is present and shaped per HMRC spec.
|
||||
Runs in every CI build.
|
||||
|
||||
2. **HMRC validator API smoke test** (`test_headers_pass_hmrc_validator`):
|
||||
POSTs the generated header set to the HMRC sandbox validator and
|
||||
asserts a clean 200 with no rejected headers. Gated on the
|
||||
`HMRC_VALIDATOR` env var so `pytest` still runs fine offline.
|
||||
|
||||
HMRC audits fraud headers during production-access review — a failing
|
||||
validator smoke test MUST block deploy.
|
||||
|
||||
Spec references (primary):
|
||||
https://developer.service.hmrc.gov.uk/guides/fraud-prevention/connection-method/batch-process-direct/
|
||||
https://developer.service.hmrc.gov.uk/api-documentation/docs/api/service/txm-fph-validator-api/1.0
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import hashlib
|
||||
import os
|
||||
import re
|
||||
|
||||
import httpx
|
||||
import pytest
|
||||
|
||||
from hmrc_sync.fraud_headers import (
|
||||
CONNECTION_METHOD_BATCH,
|
||||
CONNECTION_METHOD_WEB_APP,
|
||||
RUNTIME_CONTEXT,
|
||||
VENDOR_LICENSE_ID,
|
||||
VENDOR_PRODUCT_NAME,
|
||||
RuntimeContext,
|
||||
SessionContext,
|
||||
as_validator_payload,
|
||||
build_headers,
|
||||
)
|
||||
|
||||
VALIDATOR_URL = (
|
||||
"https://test-api.service.hmrc.gov.uk/test/fraud-prevention-headers/validate")
|
||||
|
||||
# Per HMRC BATCH_PROCESS_DIRECT spec (11 mandatory headers).
|
||||
BATCH_MANDATORY = {
|
||||
"Gov-Client-Connection-Method",
|
||||
"Gov-Client-Device-ID",
|
||||
"Gov-Client-Local-IPs",
|
||||
"Gov-Client-Local-IPs-Timestamp",
|
||||
"Gov-Client-MAC-Addresses",
|
||||
"Gov-Client-Timezone",
|
||||
"Gov-Client-User-Agent",
|
||||
"Gov-Client-User-IDs",
|
||||
"Gov-Vendor-License-IDs",
|
||||
"Gov-Vendor-Product-Name",
|
||||
"Gov-Vendor-Version",
|
||||
}
|
||||
|
||||
# WEB_APP_VIA_SERVER adds browser-origin context on top of the batch set.
|
||||
WEB_APP_EXTRAS = {
|
||||
"Gov-Client-Screens",
|
||||
"Gov-Client-Window-Size",
|
||||
"Gov-Client-Public-IP",
|
||||
"Gov-Client-Public-Port",
|
||||
}
|
||||
|
||||
|
||||
def _full_session() -> SessionContext:
|
||||
return SessionContext(
|
||||
user_agent="Mozilla/5.0 (X11; Linux x86_64) hmrc-sync-test",
|
||||
screen_width=1920,
|
||||
screen_height=1080,
|
||||
screen_colour_depth=24,
|
||||
window_width=1600,
|
||||
window_height=900,
|
||||
timezone_offset=0,
|
||||
device_id="6c3a9f60-1111-2222-3333-abcdef012345",
|
||||
public_ip="203.0.113.5",
|
||||
public_port=443,
|
||||
)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# BATCH_PROCESS_DIRECT — the CronJob path. All 11 headers must be present.
|
||||
# --------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_batch_process_includes_all_11_mandatory_headers() -> None:
|
||||
hdrs = build_headers(connection_method=CONNECTION_METHOD_BATCH)
|
||||
missing = BATCH_MANDATORY - hdrs.keys()
|
||||
assert not missing, f"BATCH_PROCESS_DIRECT missing mandatory headers: {missing}"
|
||||
|
||||
|
||||
def test_batch_process_omits_browser_only_headers() -> None:
|
||||
hdrs = build_headers(connection_method=CONNECTION_METHOD_BATCH)
|
||||
# Screens / Window-Size are browser-origin; Public-IP/Port route via a
|
||||
# client-facing IP which doesn't apply to a batch job.
|
||||
for h in ("Gov-Client-Screens", "Gov-Client-Window-Size",
|
||||
"Gov-Client-Public-IP", "Gov-Client-Public-Port"):
|
||||
assert h not in hdrs, f"BATCH emitted browser-only header: {h}"
|
||||
|
||||
|
||||
def test_batch_process_connection_method_value() -> None:
|
||||
hdrs = build_headers(connection_method=CONNECTION_METHOD_BATCH)
|
||||
assert hdrs["Gov-Client-Connection-Method"] == "BATCH_PROCESS_DIRECT"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# Header-value shape assertions (per HMRC spec).
|
||||
# --------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_user_ids_starts_with_os_field() -> None:
|
||||
"""Per spec: `os=<device-user>&<app>=<app-user>`. `os=` is mandatory."""
|
||||
hdrs = build_headers(connection_method=CONNECTION_METHOD_BATCH)
|
||||
value = hdrs["Gov-Client-User-IDs"]
|
||||
assert value.startswith("os="), f"User-IDs missing os= prefix: {value!r}"
|
||||
# Key-value pairs separated by & — at least one beyond `os=`.
|
||||
pairs = value.split("&")
|
||||
assert len(pairs) >= 2, f"User-IDs should have app identifier too: {value!r}"
|
||||
|
||||
|
||||
def test_user_agent_has_all_four_spec_fields() -> None:
|
||||
"""Spec: `os-family=…&os-version=…&device-manufacturer=…&device-model=…`."""
|
||||
hdrs = build_headers(connection_method=CONNECTION_METHOD_BATCH)
|
||||
value = hdrs["Gov-Client-User-Agent"]
|
||||
for key in ("os-family=", "os-version=", "device-manufacturer=", "device-model="):
|
||||
assert key in value, f"User-Agent missing {key!r}: {value!r}"
|
||||
|
||||
|
||||
def test_mac_addresses_percent_encoded() -> None:
|
||||
"""Spec: colons in MACs must be percent-encoded (%3A)."""
|
||||
hdrs = build_headers(connection_method=CONNECTION_METHOD_BATCH)
|
||||
value = hdrs["Gov-Client-MAC-Addresses"]
|
||||
assert value, "MAC-Addresses must never be empty"
|
||||
assert ":" not in value, f"MAC-Addresses contains raw colons: {value!r}"
|
||||
assert "%3A" in value, f"MAC-Addresses must use %3A: {value!r}"
|
||||
|
||||
|
||||
def test_local_ips_ipv6_percent_encoded() -> None:
|
||||
"""IPv6 entries percent-encoded; IPv4 passes through."""
|
||||
hdrs = build_headers(
|
||||
connection_method=CONNECTION_METHOD_BATCH,
|
||||
runtime=_runtime_with_ips(["10.0.0.4", "fe80::1"]),
|
||||
)
|
||||
value = hdrs["Gov-Client-Local-IPs"]
|
||||
assert "10.0.0.4" in value
|
||||
assert "fe80::1" not in value # raw v6 forbidden
|
||||
assert "fe80%3A%3A1" in value, f"IPv6 not encoded: {value!r}"
|
||||
|
||||
|
||||
def test_vendor_license_id_is_sha256_hashed() -> None:
|
||||
"""Spec: `Gov-Vendor-License-IDs: <name>=<hashed-value>`."""
|
||||
hdrs = build_headers(connection_method=CONNECTION_METHOD_BATCH)
|
||||
value = hdrs["Gov-Vendor-License-IDs"]
|
||||
expected_hash = hashlib.sha256(VENDOR_LICENSE_ID.encode()).hexdigest()
|
||||
assert value == f"{VENDOR_PRODUCT_NAME}={expected_hash}", value
|
||||
# Hash must be 64 hex chars — catches accidental plaintext leakage.
|
||||
assert re.fullmatch(r"[a-z0-9-]+=[0-9a-f]{64}", value), value
|
||||
|
||||
|
||||
def test_vendor_product_name_percent_encoded() -> None:
|
||||
hdrs = build_headers(connection_method=CONNECTION_METHOD_BATCH)
|
||||
assert hdrs["Gov-Vendor-Product-Name"] == "hmrc-sync" # no reserved chars in name
|
||||
|
||||
|
||||
def test_vendor_version_format() -> None:
|
||||
hdrs = build_headers(connection_method=CONNECTION_METHOD_BATCH)
|
||||
value = hdrs["Gov-Vendor-Version"]
|
||||
assert re.fullmatch(r"[a-z0-9-]+=\d+\.\d+\.\d+", value), value
|
||||
|
||||
|
||||
def test_local_ips_timestamp_spec_format() -> None:
|
||||
"""Spec: `yyyy-MM-ddThh:mm:ss.sssZ` — 24-hour, UTC, 3-digit millis."""
|
||||
hdrs = build_headers(connection_method=CONNECTION_METHOD_BATCH)
|
||||
value = hdrs["Gov-Client-Local-IPs-Timestamp"]
|
||||
assert re.fullmatch(r"\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{3}Z", value), value
|
||||
|
||||
|
||||
def test_timezone_utc_offset_format() -> None:
|
||||
hdrs = build_headers(connection_method=CONNECTION_METHOD_BATCH)
|
||||
assert re.fullmatch(r"UTC[+-]\d{2}:\d{2}", hdrs["Gov-Client-Timezone"])
|
||||
|
||||
|
||||
def test_device_id_is_valid_uuid() -> None:
|
||||
"""UUID shape check: 8-4-4-4-12 hex — applies to fallback too."""
|
||||
hdrs = build_headers(connection_method=CONNECTION_METHOD_BATCH)
|
||||
value = hdrs["Gov-Client-Device-ID"]
|
||||
assert re.fullmatch(
|
||||
r"[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}",
|
||||
value,
|
||||
), value
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# MFA gating + per-call variance.
|
||||
# --------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_mfa_timestamp_only_emitted_for_mfa_method() -> None:
|
||||
"""Gov-Client-MFA-Timestamp is for AUTH_USING_MFA; batch must not emit it."""
|
||||
batch = build_headers(connection_method=CONNECTION_METHOD_BATCH)
|
||||
assert "Gov-Client-MFA-Timestamp" not in batch
|
||||
|
||||
session = _full_session()
|
||||
session.mfa_timestamp = "2026-04-19T21:30:00.000Z"
|
||||
mfa = build_headers(session, connection_method="AUTH_USING_MFA")
|
||||
assert mfa.get("Gov-Client-MFA-Timestamp") == "2026-04-19T21:30:00.000Z"
|
||||
|
||||
|
||||
def test_correlation_id_differs_per_call() -> None:
|
||||
a = build_headers(connection_method=CONNECTION_METHOD_BATCH)
|
||||
b = build_headers(connection_method=CONNECTION_METHOD_BATCH)
|
||||
assert a["x-correlation-id"] != b["x-correlation-id"]
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# WEB_APP_VIA_SERVER — batch set + browser extras.
|
||||
# --------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_web_app_includes_batch_mandatory_plus_browser_extras() -> None:
|
||||
hdrs = build_headers(_full_session(), connection_method=CONNECTION_METHOD_WEB_APP)
|
||||
missing = (BATCH_MANDATORY | WEB_APP_EXTRAS) - hdrs.keys()
|
||||
assert not missing, f"WEB_APP missing headers: {missing}"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# Payload reshape (used by the validator smoke test + CI self-tests).
|
||||
# --------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_as_validator_payload_reshape() -> None:
|
||||
hdrs = {"Gov-Client-Connection-Method": "X", "Gov-Vendor-Product-Name": "y"}
|
||||
payload = as_validator_payload(hdrs)
|
||||
assert payload["headers"] == [
|
||||
{"name": "Gov-Client-Connection-Method", "value": "X"},
|
||||
{"name": "Gov-Vendor-Product-Name", "value": "y"},
|
||||
]
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# HMRC sandbox validator smoke test — set HMRC_VALIDATOR=1 to enable.
|
||||
# --------------------------------------------------------------------------
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
not (os.environ.get("HMRC_VALIDATOR")
|
||||
and os.environ.get("HMRC_SANDBOX_TOKEN")),
|
||||
reason=("HMRC sandbox validator smoke test — set HMRC_VALIDATOR=1 AND "
|
||||
"HMRC_SANDBOX_TOKEN=<app-token>. Dev Hub app must be subscribed "
|
||||
"to txm-fph-validator-api/1.0 (application-restricted)."),
|
||||
)
|
||||
def test_headers_pass_hmrc_validator() -> None:
|
||||
"""GET /test/fraud-prevention-headers/validate with BATCH headers.
|
||||
|
||||
Per the OAS spec the validator is a GET endpoint — headers go in the
|
||||
actual HTTP request, not a JSON body. Auth is application-restricted
|
||||
(client_credentials bearer). A successful response has code=VALID_HEADERS;
|
||||
POTENTIALLY_INVALID_HEADERS emits warnings but still passes; only
|
||||
INVALID_HEADERS is a hard fail.
|
||||
"""
|
||||
hdrs = build_headers(connection_method=CONNECTION_METHOD_BATCH)
|
||||
request_headers = {
|
||||
**hdrs,
|
||||
"Accept": "application/vnd.hmrc.1.0+json",
|
||||
"Authorization": f"Bearer {os.environ['HMRC_SANDBOX_TOKEN']}",
|
||||
}
|
||||
resp = httpx.get(VALIDATOR_URL, headers=request_headers, timeout=30.0)
|
||||
assert resp.status_code == 200, (
|
||||
f"validator refused: {resp.status_code} {resp.text[:500]}")
|
||||
body = resp.json()
|
||||
code = body.get("code")
|
||||
assert code != "INVALID_HEADERS", f"validator rejected: {body}"
|
||||
# POTENTIALLY_INVALID_HEADERS is allowed — HMRC surfaces them as warnings;
|
||||
# log for visibility but don't fail the build on them.
|
||||
if code == "POTENTIALLY_INVALID_HEADERS":
|
||||
print(f"validator warnings: {body.get('warnings')}")
|
||||
|
||||
|
||||
def _runtime_with_ips(ips: list[str]) -> RuntimeContext:
|
||||
"""Build a RuntimeContext override with caller-specified local_ips."""
|
||||
return RuntimeContext(
|
||||
mac_addresses=RUNTIME_CONTEXT.mac_addresses,
|
||||
local_ips=ips,
|
||||
os_family=RUNTIME_CONTEXT.os_family,
|
||||
os_version=RUNTIME_CONTEXT.os_version,
|
||||
device_manufacturer=RUNTIME_CONTEXT.device_manufacturer,
|
||||
device_model=RUNTIME_CONTEXT.device_model,
|
||||
os_user=RUNTIME_CONTEXT.os_user,
|
||||
)
|
||||
Loading…
Add table
Add a link
Reference in a new issue