broker-sync/broker_sync/providers/parsers/invest_engine.py

330 lines
11 KiB
Python
Raw Normal View History

Add InvestEngine email parser — RFC 2822 v1/v2 line format Context: The old finance/ app had a 324-line IE message parser with four line-based variants (v1/v2/v3/v4) plus an HTML strategy and a CSV fallback. Port into broker-sync so we can consume IE trade confirmation emails as a backup to the live HTTP client (Phase 2b) while IE's public API remains Bearer-only. The upstream parser emits storage.model.Position; we emit canonical Activity with the broker-sync invariants: account_id="invest-engine-primary" (sink remaps to Wealthfolio UUID), account_type=ISA, currency=GBP, and external_id="invest-engine:<fingerprint>" where the fingerprint is a SHA-256 of (date|symbol|quantity|unit_price) — deterministic so repeat imports of the same email dedup at the sync-record layer. This change: - Top-level `parse_invest_engine_email(raw_email: bytes) -> list[Activity]` extracts the text/plain body from an RFC 2822 message and dispatches to the line-based parser. - `_parse_rfc2822_lines(body)` tries the v2 layout first (newer IE format where `Date: DD Month` is on line 2 and the year on line 3), then the v1 layout (where the day alone is on line 2 and `Month YYYY` on line 3). v3 and v4 variants are re-added in a follow-up if we find fixtures where they matter — initial fixture coverage hits v2. - Drops the upstream `_ticker_post_processing` VUAG→VUAG.L hack. Wealthfolio's /import/check endpoint resolves exchange suffixes; the Trading212 provider also emits suffix-free tickers (e.g. `VUAG`), so staying consistent avoids double-mapping. - Notes field records the parse-strategy tag ("rfc2822-v2") plus the matched line for debugging. Test plan: poetry run pytest tests/providers/parsers/ -q → 3 passed in 0.03s poetry run mypy broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → Success: no issues found in 2 source files poetry run ruff check broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → All checks passed! poetry run yapf --diff broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → clean (no diff) Manual verification: load the fixture email, call the parser, inspect the returned Activity has symbol=VUAG, quantity=59.539562, unit_price=60.46, date=2023-01-17, external_id starts with invest-engine:.
2026-04-17 21:49:52 +00:00
"""InvestEngine email parser.
IE mails the user after each trade batch. The body shape varies over
the years IE has sent trade confirmations as plain-text RFC 2822
messages, multipart HTML emails with a summary table, and (for older
statements) CSV attachments. This module tries the three strategies in
order and returns the first that yields at least one Activity.
Every parse strategy produces canonical `Activity` objects with:
- `account_id = "invest-engine-primary"` (sink remaps to Wealthfolio UUID)
- `account_type = AccountType.ISA` (Viktor's IE account is an ISA)
- `currency = "GBP"`
- `external_id = f"invest-engine:{fingerprint}"` where fingerprint hashes
(date, symbol, quantity, unit_price) for deterministic dedup.
"""
from __future__ import annotations
Add CSV attachment fallback for InvestEngine email parser Context: IE has not (yet) sent CSV-attached statements in production, but the upstream parser had _extract_positions_csv as a third fallback for exactly this case. Keeping the fallback preserves behaviour-parity with the legacy parser and makes future statement support one fixture away — the shape is documented by column set, not scraped live. Unlike the upstream which split the body on whitespace and broke on any embedded commas in names, this port walks real MIME attachments using Python's csv.DictReader. A part qualifies as CSV if: - its Content-Type is text/csv / application/csv / application/vnd.ms-excel, OR - its filename ends in .csv (defence against IE mis-labelling the part) Rows missing required columns or containing unparseable numbers/dates are skipped silently — consistent with the "partial match" contract: a half-corrupt CSV yields whatever rows were intact. Required columns: ticker, unit_price, quantity, date (YYYY-MM-DD), currency. Non-GBP rows are filtered because the IE ISA is strictly sterling — flagging this assumption in the review notes. This change: - Adds `_parse_csv_attachment(raw_email)` as the third strategy after text/plain and text/html; it re-parses the raw email bytes so we can inspect Content-Type/filename on each part. - Flags symbols/currencies, filters non-GBP, and runs each row through the shared `_build_activity` so external_id formation matches every other strategy (dedup stays consistent across strategies). - Fixture `csv_attachment.eml` has three rows (VUAG, SWDA, VUSA) in a `text/csv` part with a `.csv` filename — covers both detection paths. Test plan: poetry run pytest tests/providers/parsers/ -q → 6 passed in 0.15s poetry run mypy broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → clean poetry run ruff check broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → All checks passed! poetry run yapf --diff → clean (no diff) Manual verification: load csv_attachment.eml, call parse_invest_engine_email, assert 3 activities each with symbol in {VUAG,SWDA,VUSA}, currency=GBP, notes containing "csv".
2026-04-17 22:01:46 +00:00
import csv
Add InvestEngine email parser — RFC 2822 v1/v2 line format Context: The old finance/ app had a 324-line IE message parser with four line-based variants (v1/v2/v3/v4) plus an HTML strategy and a CSV fallback. Port into broker-sync so we can consume IE trade confirmation emails as a backup to the live HTTP client (Phase 2b) while IE's public API remains Bearer-only. The upstream parser emits storage.model.Position; we emit canonical Activity with the broker-sync invariants: account_id="invest-engine-primary" (sink remaps to Wealthfolio UUID), account_type=ISA, currency=GBP, and external_id="invest-engine:<fingerprint>" where the fingerprint is a SHA-256 of (date|symbol|quantity|unit_price) — deterministic so repeat imports of the same email dedup at the sync-record layer. This change: - Top-level `parse_invest_engine_email(raw_email: bytes) -> list[Activity]` extracts the text/plain body from an RFC 2822 message and dispatches to the line-based parser. - `_parse_rfc2822_lines(body)` tries the v2 layout first (newer IE format where `Date: DD Month` is on line 2 and the year on line 3), then the v1 layout (where the day alone is on line 2 and `Month YYYY` on line 3). v3 and v4 variants are re-added in a follow-up if we find fixtures where they matter — initial fixture coverage hits v2. - Drops the upstream `_ticker_post_processing` VUAG→VUAG.L hack. Wealthfolio's /import/check endpoint resolves exchange suffixes; the Trading212 provider also emits suffix-free tickers (e.g. `VUAG`), so staying consistent avoids double-mapping. - Notes field records the parse-strategy tag ("rfc2822-v2") plus the matched line for debugging. Test plan: poetry run pytest tests/providers/parsers/ -q → 3 passed in 0.03s poetry run mypy broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → Success: no issues found in 2 source files poetry run ruff check broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → All checks passed! poetry run yapf --diff broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → clean (no diff) Manual verification: load the fixture email, call the parser, inspect the returned Activity has symbol=VUAG, quantity=59.539562, unit_price=60.46, date=2023-01-17, external_id starts with invest-engine:.
2026-04-17 21:49:52 +00:00
import email
import hashlib
Add CSV attachment fallback for InvestEngine email parser Context: IE has not (yet) sent CSV-attached statements in production, but the upstream parser had _extract_positions_csv as a third fallback for exactly this case. Keeping the fallback preserves behaviour-parity with the legacy parser and makes future statement support one fixture away — the shape is documented by column set, not scraped live. Unlike the upstream which split the body on whitespace and broke on any embedded commas in names, this port walks real MIME attachments using Python's csv.DictReader. A part qualifies as CSV if: - its Content-Type is text/csv / application/csv / application/vnd.ms-excel, OR - its filename ends in .csv (defence against IE mis-labelling the part) Rows missing required columns or containing unparseable numbers/dates are skipped silently — consistent with the "partial match" contract: a half-corrupt CSV yields whatever rows were intact. Required columns: ticker, unit_price, quantity, date (YYYY-MM-DD), currency. Non-GBP rows are filtered because the IE ISA is strictly sterling — flagging this assumption in the review notes. This change: - Adds `_parse_csv_attachment(raw_email)` as the third strategy after text/plain and text/html; it re-parses the raw email bytes so we can inspect Content-Type/filename on each part. - Flags symbols/currencies, filters non-GBP, and runs each row through the shared `_build_activity` so external_id formation matches every other strategy (dedup stays consistent across strategies). - Fixture `csv_attachment.eml` has three rows (VUAG, SWDA, VUSA) in a `text/csv` part with a `.csv` filename — covers both detection paths. Test plan: poetry run pytest tests/providers/parsers/ -q → 6 passed in 0.15s poetry run mypy broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → clean poetry run ruff check broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → All checks passed! poetry run yapf --diff → clean (no diff) Manual verification: load csv_attachment.eml, call parse_invest_engine_email, assert 3 activities each with symbol in {VUAG,SWDA,VUSA}, currency=GBP, notes containing "csv".
2026-04-17 22:01:46 +00:00
import io
Add HTML table fallback for InvestEngine email parser Context: Plain-text IE emails vanished around 2024-Q2 when IE switched to an HTML-only template with per-order nested summary tables. The RFC 2822 line parser returns [] on those modern emails, so we need a fallback that walks the HTML table structure. Upstream _extract_from_html parsed a fixed DOM path (table[1].tr[10]. table) and only handled ONE order per email. The real IE HTML template nests one summary <table> per ticker inside the second top-level table — multiple orders in a single batched confirmation are common — so this port walks every leaf table (no child <table>) and interprets each one as an independent trade summary. Structural (non-leaf) tables are skipped to avoid double-counting via get_text(). This change: - `_parse_html_tables(body)` extracts the date once from the full text then walks leaf tables looking for "Bought N @ £P" rows. - `_try_html_summary_table` parses one leaf; returns None on structural tables or missing ticker/qty/price — so a partial email yields only its intact orders (the "2 orders, 1 parseable → 1 returned" invariant works by construction without raising). - `parse_invest_engine_email` now falls through text/plain → text/html in the multipart message, picking the first strategy that returns activities. Order matters: text/plain wins when both succeed because the RFC 2822 strategy is the more constrained grammar. - Regexes are module-level constants so they compile once per process. Fixture `html_two_orders.eml` is a minimal-but-realistic multipart email with two nested summary tables (VUAG + SWDA), no personal data beyond tickers/qty/price. Test plan: poetry run pytest tests/providers/parsers/ -q → 5 passed in 0.16s poetry run mypy broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → Success: no issues found in 2 source files poetry run ruff check broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → All checks passed! poetry run yapf --diff broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → clean (no diff) Manual verification: load html_two_orders.eml, call parse_invest_engine_email, assert len == 2 with both expected tickers (VUAG, SWDA) and numbers, dates set to 2026-04-01.
2026-04-17 21:58:15 +00:00
import re
Add InvestEngine email parser — RFC 2822 v1/v2 line format Context: The old finance/ app had a 324-line IE message parser with four line-based variants (v1/v2/v3/v4) plus an HTML strategy and a CSV fallback. Port into broker-sync so we can consume IE trade confirmation emails as a backup to the live HTTP client (Phase 2b) while IE's public API remains Bearer-only. The upstream parser emits storage.model.Position; we emit canonical Activity with the broker-sync invariants: account_id="invest-engine-primary" (sink remaps to Wealthfolio UUID), account_type=ISA, currency=GBP, and external_id="invest-engine:<fingerprint>" where the fingerprint is a SHA-256 of (date|symbol|quantity|unit_price) — deterministic so repeat imports of the same email dedup at the sync-record layer. This change: - Top-level `parse_invest_engine_email(raw_email: bytes) -> list[Activity]` extracts the text/plain body from an RFC 2822 message and dispatches to the line-based parser. - `_parse_rfc2822_lines(body)` tries the v2 layout first (newer IE format where `Date: DD Month` is on line 2 and the year on line 3), then the v1 layout (where the day alone is on line 2 and `Month YYYY` on line 3). v3 and v4 variants are re-added in a follow-up if we find fixtures where they matter — initial fixture coverage hits v2. - Drops the upstream `_ticker_post_processing` VUAG→VUAG.L hack. Wealthfolio's /import/check endpoint resolves exchange suffixes; the Trading212 provider also emits suffix-free tickers (e.g. `VUAG`), so staying consistent avoids double-mapping. - Notes field records the parse-strategy tag ("rfc2822-v2") plus the matched line for debugging. Test plan: poetry run pytest tests/providers/parsers/ -q → 3 passed in 0.03s poetry run mypy broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → Success: no issues found in 2 source files poetry run ruff check broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → All checks passed! poetry run yapf --diff broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → clean (no diff) Manual verification: load the fixture email, call the parser, inspect the returned Activity has symbol=VUAG, quantity=59.539562, unit_price=60.46, date=2023-01-17, external_id starts with invest-engine:.
2026-04-17 21:49:52 +00:00
from datetime import datetime
Add CSV attachment fallback for InvestEngine email parser Context: IE has not (yet) sent CSV-attached statements in production, but the upstream parser had _extract_positions_csv as a third fallback for exactly this case. Keeping the fallback preserves behaviour-parity with the legacy parser and makes future statement support one fixture away — the shape is documented by column set, not scraped live. Unlike the upstream which split the body on whitespace and broke on any embedded commas in names, this port walks real MIME attachments using Python's csv.DictReader. A part qualifies as CSV if: - its Content-Type is text/csv / application/csv / application/vnd.ms-excel, OR - its filename ends in .csv (defence against IE mis-labelling the part) Rows missing required columns or containing unparseable numbers/dates are skipped silently — consistent with the "partial match" contract: a half-corrupt CSV yields whatever rows were intact. Required columns: ticker, unit_price, quantity, date (YYYY-MM-DD), currency. Non-GBP rows are filtered because the IE ISA is strictly sterling — flagging this assumption in the review notes. This change: - Adds `_parse_csv_attachment(raw_email)` as the third strategy after text/plain and text/html; it re-parses the raw email bytes so we can inspect Content-Type/filename on each part. - Flags symbols/currencies, filters non-GBP, and runs each row through the shared `_build_activity` so external_id formation matches every other strategy (dedup stays consistent across strategies). - Fixture `csv_attachment.eml` has three rows (VUAG, SWDA, VUSA) in a `text/csv` part with a `.csv` filename — covers both detection paths. Test plan: poetry run pytest tests/providers/parsers/ -q → 6 passed in 0.15s poetry run mypy broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → clean poetry run ruff check broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → All checks passed! poetry run yapf --diff → clean (no diff) Manual verification: load csv_attachment.eml, call parse_invest_engine_email, assert 3 activities each with symbol in {VUAG,SWDA,VUSA}, currency=GBP, notes containing "csv".
2026-04-17 22:01:46 +00:00
from decimal import Decimal, InvalidOperation
Add InvestEngine email parser — RFC 2822 v1/v2 line format Context: The old finance/ app had a 324-line IE message parser with four line-based variants (v1/v2/v3/v4) plus an HTML strategy and a CSV fallback. Port into broker-sync so we can consume IE trade confirmation emails as a backup to the live HTTP client (Phase 2b) while IE's public API remains Bearer-only. The upstream parser emits storage.model.Position; we emit canonical Activity with the broker-sync invariants: account_id="invest-engine-primary" (sink remaps to Wealthfolio UUID), account_type=ISA, currency=GBP, and external_id="invest-engine:<fingerprint>" where the fingerprint is a SHA-256 of (date|symbol|quantity|unit_price) — deterministic so repeat imports of the same email dedup at the sync-record layer. This change: - Top-level `parse_invest_engine_email(raw_email: bytes) -> list[Activity]` extracts the text/plain body from an RFC 2822 message and dispatches to the line-based parser. - `_parse_rfc2822_lines(body)` tries the v2 layout first (newer IE format where `Date: DD Month` is on line 2 and the year on line 3), then the v1 layout (where the day alone is on line 2 and `Month YYYY` on line 3). v3 and v4 variants are re-added in a follow-up if we find fixtures where they matter — initial fixture coverage hits v2. - Drops the upstream `_ticker_post_processing` VUAG→VUAG.L hack. Wealthfolio's /import/check endpoint resolves exchange suffixes; the Trading212 provider also emits suffix-free tickers (e.g. `VUAG`), so staying consistent avoids double-mapping. - Notes field records the parse-strategy tag ("rfc2822-v2") plus the matched line for debugging. Test plan: poetry run pytest tests/providers/parsers/ -q → 3 passed in 0.03s poetry run mypy broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → Success: no issues found in 2 source files poetry run ruff check broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → All checks passed! poetry run yapf --diff broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → clean (no diff) Manual verification: load the fixture email, call the parser, inspect the returned Activity has symbol=VUAG, quantity=59.539562, unit_price=60.46, date=2023-01-17, external_id starts with invest-engine:.
2026-04-17 21:49:52 +00:00
from email.message import Message
Add HTML table fallback for InvestEngine email parser Context: Plain-text IE emails vanished around 2024-Q2 when IE switched to an HTML-only template with per-order nested summary tables. The RFC 2822 line parser returns [] on those modern emails, so we need a fallback that walks the HTML table structure. Upstream _extract_from_html parsed a fixed DOM path (table[1].tr[10]. table) and only handled ONE order per email. The real IE HTML template nests one summary <table> per ticker inside the second top-level table — multiple orders in a single batched confirmation are common — so this port walks every leaf table (no child <table>) and interprets each one as an independent trade summary. Structural (non-leaf) tables are skipped to avoid double-counting via get_text(). This change: - `_parse_html_tables(body)` extracts the date once from the full text then walks leaf tables looking for "Bought N @ £P" rows. - `_try_html_summary_table` parses one leaf; returns None on structural tables or missing ticker/qty/price — so a partial email yields only its intact orders (the "2 orders, 1 parseable → 1 returned" invariant works by construction without raising). - `parse_invest_engine_email` now falls through text/plain → text/html in the multipart message, picking the first strategy that returns activities. Order matters: text/plain wins when both succeed because the RFC 2822 strategy is the more constrained grammar. - Regexes are module-level constants so they compile once per process. Fixture `html_two_orders.eml` is a minimal-but-realistic multipart email with two nested summary tables (VUAG + SWDA), no personal data beyond tickers/qty/price. Test plan: poetry run pytest tests/providers/parsers/ -q → 5 passed in 0.16s poetry run mypy broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → Success: no issues found in 2 source files poetry run ruff check broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → All checks passed! poetry run yapf --diff broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → clean (no diff) Manual verification: load html_two_orders.eml, call parse_invest_engine_email, assert len == 2 with both expected tickers (VUAG, SWDA) and numbers, dates set to 2026-04-01.
2026-04-17 21:58:15 +00:00
from bs4 import BeautifulSoup
Add InvestEngine email parser — RFC 2822 v1/v2 line format Context: The old finance/ app had a 324-line IE message parser with four line-based variants (v1/v2/v3/v4) plus an HTML strategy and a CSV fallback. Port into broker-sync so we can consume IE trade confirmation emails as a backup to the live HTTP client (Phase 2b) while IE's public API remains Bearer-only. The upstream parser emits storage.model.Position; we emit canonical Activity with the broker-sync invariants: account_id="invest-engine-primary" (sink remaps to Wealthfolio UUID), account_type=ISA, currency=GBP, and external_id="invest-engine:<fingerprint>" where the fingerprint is a SHA-256 of (date|symbol|quantity|unit_price) — deterministic so repeat imports of the same email dedup at the sync-record layer. This change: - Top-level `parse_invest_engine_email(raw_email: bytes) -> list[Activity]` extracts the text/plain body from an RFC 2822 message and dispatches to the line-based parser. - `_parse_rfc2822_lines(body)` tries the v2 layout first (newer IE format where `Date: DD Month` is on line 2 and the year on line 3), then the v1 layout (where the day alone is on line 2 and `Month YYYY` on line 3). v3 and v4 variants are re-added in a follow-up if we find fixtures where they matter — initial fixture coverage hits v2. - Drops the upstream `_ticker_post_processing` VUAG→VUAG.L hack. Wealthfolio's /import/check endpoint resolves exchange suffixes; the Trading212 provider also emits suffix-free tickers (e.g. `VUAG`), so staying consistent avoids double-mapping. - Notes field records the parse-strategy tag ("rfc2822-v2") plus the matched line for debugging. Test plan: poetry run pytest tests/providers/parsers/ -q → 3 passed in 0.03s poetry run mypy broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → Success: no issues found in 2 source files poetry run ruff check broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → All checks passed! poetry run yapf --diff broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → clean (no diff) Manual verification: load the fixture email, call the parser, inspect the returned Activity has symbol=VUAG, quantity=59.539562, unit_price=60.46, date=2023-01-17, external_id starts with invest-engine:.
2026-04-17 21:49:52 +00:00
from broker_sync.models import AccountType, Activity, ActivityType
_ACCOUNT_ID = "invest-engine-primary"
_CURRENCY_SIGN = "£"
Add HTML table fallback for InvestEngine email parser Context: Plain-text IE emails vanished around 2024-Q2 when IE switched to an HTML-only template with per-order nested summary tables. The RFC 2822 line parser returns [] on those modern emails, so we need a fallback that walks the HTML table structure. Upstream _extract_from_html parsed a fixed DOM path (table[1].tr[10]. table) and only handled ONE order per email. The real IE HTML template nests one summary <table> per ticker inside the second top-level table — multiple orders in a single batched confirmation are common — so this port walks every leaf table (no child <table>) and interprets each one as an independent trade summary. Structural (non-leaf) tables are skipped to avoid double-counting via get_text(). This change: - `_parse_html_tables(body)` extracts the date once from the full text then walks leaf tables looking for "Bought N @ £P" rows. - `_try_html_summary_table` parses one leaf; returns None on structural tables or missing ticker/qty/price — so a partial email yields only its intact orders (the "2 orders, 1 parseable → 1 returned" invariant works by construction without raising). - `parse_invest_engine_email` now falls through text/plain → text/html in the multipart message, picking the first strategy that returns activities. Order matters: text/plain wins when both succeed because the RFC 2822 strategy is the more constrained grammar. - Regexes are module-level constants so they compile once per process. Fixture `html_two_orders.eml` is a minimal-but-realistic multipart email with two nested summary tables (VUAG + SWDA), no personal data beyond tickers/qty/price. Test plan: poetry run pytest tests/providers/parsers/ -q → 5 passed in 0.16s poetry run mypy broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → Success: no issues found in 2 source files poetry run ruff check broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → All checks passed! poetry run yapf --diff broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → clean (no diff) Manual verification: load html_two_orders.eml, call parse_invest_engine_email, assert len == 2 with both expected tickers (VUAG, SWDA) and numbers, dates set to 2026-04-01.
2026-04-17 21:58:15 +00:00
# HTML trade summary rows have the shape "Bought <qty> @ £<price> per share".
_BOUGHT_RE = re.compile(
r"Bought\s+([0-9]+(?:\.[0-9]+)?)\s*@\s*" + re.escape(_CURRENCY_SIGN) + r"([0-9]+(?:\.[0-9]+)?)",
re.IGNORECASE,
)
# Ticker lines look like "Vanguard S&P 500: VUAG" — we want the last
# all-caps token after the colon.
_TICKER_RE = re.compile(r":\s*([A-Z][A-Z0-9]{1,9})\s*$")
# Date rows contain "Date: DD Month YYYY".
_DATE_RE = re.compile(
r"Date:\s*([0-9]{1,2})\s+([A-Za-z]+)\s+([0-9]{4})",
re.IGNORECASE,
)
Add InvestEngine email parser — RFC 2822 v1/v2 line format Context: The old finance/ app had a 324-line IE message parser with four line-based variants (v1/v2/v3/v4) plus an HTML strategy and a CSV fallback. Port into broker-sync so we can consume IE trade confirmation emails as a backup to the live HTTP client (Phase 2b) while IE's public API remains Bearer-only. The upstream parser emits storage.model.Position; we emit canonical Activity with the broker-sync invariants: account_id="invest-engine-primary" (sink remaps to Wealthfolio UUID), account_type=ISA, currency=GBP, and external_id="invest-engine:<fingerprint>" where the fingerprint is a SHA-256 of (date|symbol|quantity|unit_price) — deterministic so repeat imports of the same email dedup at the sync-record layer. This change: - Top-level `parse_invest_engine_email(raw_email: bytes) -> list[Activity]` extracts the text/plain body from an RFC 2822 message and dispatches to the line-based parser. - `_parse_rfc2822_lines(body)` tries the v2 layout first (newer IE format where `Date: DD Month` is on line 2 and the year on line 3), then the v1 layout (where the day alone is on line 2 and `Month YYYY` on line 3). v3 and v4 variants are re-added in a follow-up if we find fixtures where they matter — initial fixture coverage hits v2. - Drops the upstream `_ticker_post_processing` VUAG→VUAG.L hack. Wealthfolio's /import/check endpoint resolves exchange suffixes; the Trading212 provider also emits suffix-free tickers (e.g. `VUAG`), so staying consistent avoids double-mapping. - Notes field records the parse-strategy tag ("rfc2822-v2") plus the matched line for debugging. Test plan: poetry run pytest tests/providers/parsers/ -q → 3 passed in 0.03s poetry run mypy broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → Success: no issues found in 2 source files poetry run ruff check broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → All checks passed! poetry run yapf --diff broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → clean (no diff) Manual verification: load the fixture email, call the parser, inspect the returned Activity has symbol=VUAG, quantity=59.539562, unit_price=60.46, date=2023-01-17, external_id starts with invest-engine:.
2026-04-17 21:49:52 +00:00
def parse_invest_engine_email(raw_email: bytes) -> list[Activity]:
"""Parse an IE trade confirmation email into Activity records.
Add CSV attachment fallback for InvestEngine email parser Context: IE has not (yet) sent CSV-attached statements in production, but the upstream parser had _extract_positions_csv as a third fallback for exactly this case. Keeping the fallback preserves behaviour-parity with the legacy parser and makes future statement support one fixture away — the shape is documented by column set, not scraped live. Unlike the upstream which split the body on whitespace and broke on any embedded commas in names, this port walks real MIME attachments using Python's csv.DictReader. A part qualifies as CSV if: - its Content-Type is text/csv / application/csv / application/vnd.ms-excel, OR - its filename ends in .csv (defence against IE mis-labelling the part) Rows missing required columns or containing unparseable numbers/dates are skipped silently — consistent with the "partial match" contract: a half-corrupt CSV yields whatever rows were intact. Required columns: ticker, unit_price, quantity, date (YYYY-MM-DD), currency. Non-GBP rows are filtered because the IE ISA is strictly sterling — flagging this assumption in the review notes. This change: - Adds `_parse_csv_attachment(raw_email)` as the third strategy after text/plain and text/html; it re-parses the raw email bytes so we can inspect Content-Type/filename on each part. - Flags symbols/currencies, filters non-GBP, and runs each row through the shared `_build_activity` so external_id formation matches every other strategy (dedup stays consistent across strategies). - Fixture `csv_attachment.eml` has three rows (VUAG, SWDA, VUSA) in a `text/csv` part with a `.csv` filename — covers both detection paths. Test plan: poetry run pytest tests/providers/parsers/ -q → 6 passed in 0.15s poetry run mypy broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → clean poetry run ruff check broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → All checks passed! poetry run yapf --diff → clean (no diff) Manual verification: load csv_attachment.eml, call parse_invest_engine_email, assert 3 activities each with symbol in {VUAG,SWDA,VUSA}, currency=GBP, notes containing "csv".
2026-04-17 22:01:46 +00:00
Tries RFC 2822 body lines first, then HTML tables, then a CSV
attachment. Returns an empty list when nothing matches never
raises on malformed input.
Add InvestEngine email parser — RFC 2822 v1/v2 line format Context: The old finance/ app had a 324-line IE message parser with four line-based variants (v1/v2/v3/v4) plus an HTML strategy and a CSV fallback. Port into broker-sync so we can consume IE trade confirmation emails as a backup to the live HTTP client (Phase 2b) while IE's public API remains Bearer-only. The upstream parser emits storage.model.Position; we emit canonical Activity with the broker-sync invariants: account_id="invest-engine-primary" (sink remaps to Wealthfolio UUID), account_type=ISA, currency=GBP, and external_id="invest-engine:<fingerprint>" where the fingerprint is a SHA-256 of (date|symbol|quantity|unit_price) — deterministic so repeat imports of the same email dedup at the sync-record layer. This change: - Top-level `parse_invest_engine_email(raw_email: bytes) -> list[Activity]` extracts the text/plain body from an RFC 2822 message and dispatches to the line-based parser. - `_parse_rfc2822_lines(body)` tries the v2 layout first (newer IE format where `Date: DD Month` is on line 2 and the year on line 3), then the v1 layout (where the day alone is on line 2 and `Month YYYY` on line 3). v3 and v4 variants are re-added in a follow-up if we find fixtures where they matter — initial fixture coverage hits v2. - Drops the upstream `_ticker_post_processing` VUAG→VUAG.L hack. Wealthfolio's /import/check endpoint resolves exchange suffixes; the Trading212 provider also emits suffix-free tickers (e.g. `VUAG`), so staying consistent avoids double-mapping. - Notes field records the parse-strategy tag ("rfc2822-v2") plus the matched line for debugging. Test plan: poetry run pytest tests/providers/parsers/ -q → 3 passed in 0.03s poetry run mypy broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → Success: no issues found in 2 source files poetry run ruff check broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → All checks passed! poetry run yapf --diff broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → clean (no diff) Manual verification: load the fixture email, call the parser, inspect the returned Activity has symbol=VUAG, quantity=59.539562, unit_price=60.46, date=2023-01-17, external_id starts with invest-engine:.
2026-04-17 21:49:52 +00:00
"""
msg = email.message_from_bytes(raw_email)
Add HTML table fallback for InvestEngine email parser Context: Plain-text IE emails vanished around 2024-Q2 when IE switched to an HTML-only template with per-order nested summary tables. The RFC 2822 line parser returns [] on those modern emails, so we need a fallback that walks the HTML table structure. Upstream _extract_from_html parsed a fixed DOM path (table[1].tr[10]. table) and only handled ONE order per email. The real IE HTML template nests one summary <table> per ticker inside the second top-level table — multiple orders in a single batched confirmation are common — so this port walks every leaf table (no child <table>) and interprets each one as an independent trade summary. Structural (non-leaf) tables are skipped to avoid double-counting via get_text(). This change: - `_parse_html_tables(body)` extracts the date once from the full text then walks leaf tables looking for "Bought N @ £P" rows. - `_try_html_summary_table` parses one leaf; returns None on structural tables or missing ticker/qty/price — so a partial email yields only its intact orders (the "2 orders, 1 parseable → 1 returned" invariant works by construction without raising). - `parse_invest_engine_email` now falls through text/plain → text/html in the multipart message, picking the first strategy that returns activities. Order matters: text/plain wins when both succeed because the RFC 2822 strategy is the more constrained grammar. - Regexes are module-level constants so they compile once per process. Fixture `html_two_orders.eml` is a minimal-but-realistic multipart email with two nested summary tables (VUAG + SWDA), no personal data beyond tickers/qty/price. Test plan: poetry run pytest tests/providers/parsers/ -q → 5 passed in 0.16s poetry run mypy broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → Success: no issues found in 2 source files poetry run ruff check broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → All checks passed! poetry run yapf --diff broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → clean (no diff) Manual verification: load html_two_orders.eml, call parse_invest_engine_email, assert len == 2 with both expected tickers (VUAG, SWDA) and numbers, dates set to 2026-04-01.
2026-04-17 21:58:15 +00:00
text_body = _extract_part_body(msg, "text/plain")
if text_body is not None:
activities = _parse_rfc2822_lines(text_body)
if activities:
return activities
html_body = _extract_part_body(msg, "text/html")
if html_body is not None:
activities = _parse_html_tables(html_body)
if activities:
return activities
Add CSV attachment fallback for InvestEngine email parser Context: IE has not (yet) sent CSV-attached statements in production, but the upstream parser had _extract_positions_csv as a third fallback for exactly this case. Keeping the fallback preserves behaviour-parity with the legacy parser and makes future statement support one fixture away — the shape is documented by column set, not scraped live. Unlike the upstream which split the body on whitespace and broke on any embedded commas in names, this port walks real MIME attachments using Python's csv.DictReader. A part qualifies as CSV if: - its Content-Type is text/csv / application/csv / application/vnd.ms-excel, OR - its filename ends in .csv (defence against IE mis-labelling the part) Rows missing required columns or containing unparseable numbers/dates are skipped silently — consistent with the "partial match" contract: a half-corrupt CSV yields whatever rows were intact. Required columns: ticker, unit_price, quantity, date (YYYY-MM-DD), currency. Non-GBP rows are filtered because the IE ISA is strictly sterling — flagging this assumption in the review notes. This change: - Adds `_parse_csv_attachment(raw_email)` as the third strategy after text/plain and text/html; it re-parses the raw email bytes so we can inspect Content-Type/filename on each part. - Flags symbols/currencies, filters non-GBP, and runs each row through the shared `_build_activity` so external_id formation matches every other strategy (dedup stays consistent across strategies). - Fixture `csv_attachment.eml` has three rows (VUAG, SWDA, VUSA) in a `text/csv` part with a `.csv` filename — covers both detection paths. Test plan: poetry run pytest tests/providers/parsers/ -q → 6 passed in 0.15s poetry run mypy broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → clean poetry run ruff check broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → All checks passed! poetry run yapf --diff → clean (no diff) Manual verification: load csv_attachment.eml, call parse_invest_engine_email, assert 3 activities each with symbol in {VUAG,SWDA,VUSA}, currency=GBP, notes containing "csv".
2026-04-17 22:01:46 +00:00
csv_activities = _parse_csv_attachment(raw_email)
if csv_activities:
return csv_activities
Add HTML table fallback for InvestEngine email parser Context: Plain-text IE emails vanished around 2024-Q2 when IE switched to an HTML-only template with per-order nested summary tables. The RFC 2822 line parser returns [] on those modern emails, so we need a fallback that walks the HTML table structure. Upstream _extract_from_html parsed a fixed DOM path (table[1].tr[10]. table) and only handled ONE order per email. The real IE HTML template nests one summary <table> per ticker inside the second top-level table — multiple orders in a single batched confirmation are common — so this port walks every leaf table (no child <table>) and interprets each one as an independent trade summary. Structural (non-leaf) tables are skipped to avoid double-counting via get_text(). This change: - `_parse_html_tables(body)` extracts the date once from the full text then walks leaf tables looking for "Bought N @ £P" rows. - `_try_html_summary_table` parses one leaf; returns None on structural tables or missing ticker/qty/price — so a partial email yields only its intact orders (the "2 orders, 1 parseable → 1 returned" invariant works by construction without raising). - `parse_invest_engine_email` now falls through text/plain → text/html in the multipart message, picking the first strategy that returns activities. Order matters: text/plain wins when both succeed because the RFC 2822 strategy is the more constrained grammar. - Regexes are module-level constants so they compile once per process. Fixture `html_two_orders.eml` is a minimal-but-realistic multipart email with two nested summary tables (VUAG + SWDA), no personal data beyond tickers/qty/price. Test plan: poetry run pytest tests/providers/parsers/ -q → 5 passed in 0.16s poetry run mypy broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → Success: no issues found in 2 source files poetry run ruff check broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → All checks passed! poetry run yapf --diff broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → clean (no diff) Manual verification: load html_two_orders.eml, call parse_invest_engine_email, assert len == 2 with both expected tickers (VUAG, SWDA) and numbers, dates set to 2026-04-01.
2026-04-17 21:58:15 +00:00
return []
Add InvestEngine email parser — RFC 2822 v1/v2 line format Context: The old finance/ app had a 324-line IE message parser with four line-based variants (v1/v2/v3/v4) plus an HTML strategy and a CSV fallback. Port into broker-sync so we can consume IE trade confirmation emails as a backup to the live HTTP client (Phase 2b) while IE's public API remains Bearer-only. The upstream parser emits storage.model.Position; we emit canonical Activity with the broker-sync invariants: account_id="invest-engine-primary" (sink remaps to Wealthfolio UUID), account_type=ISA, currency=GBP, and external_id="invest-engine:<fingerprint>" where the fingerprint is a SHA-256 of (date|symbol|quantity|unit_price) — deterministic so repeat imports of the same email dedup at the sync-record layer. This change: - Top-level `parse_invest_engine_email(raw_email: bytes) -> list[Activity]` extracts the text/plain body from an RFC 2822 message and dispatches to the line-based parser. - `_parse_rfc2822_lines(body)` tries the v2 layout first (newer IE format where `Date: DD Month` is on line 2 and the year on line 3), then the v1 layout (where the day alone is on line 2 and `Month YYYY` on line 3). v3 and v4 variants are re-added in a follow-up if we find fixtures where they matter — initial fixture coverage hits v2. - Drops the upstream `_ticker_post_processing` VUAG→VUAG.L hack. Wealthfolio's /import/check endpoint resolves exchange suffixes; the Trading212 provider also emits suffix-free tickers (e.g. `VUAG`), so staying consistent avoids double-mapping. - Notes field records the parse-strategy tag ("rfc2822-v2") plus the matched line for debugging. Test plan: poetry run pytest tests/providers/parsers/ -q → 3 passed in 0.03s poetry run mypy broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → Success: no issues found in 2 source files poetry run ruff check broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → All checks passed! poetry run yapf --diff broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → clean (no diff) Manual verification: load the fixture email, call the parser, inspect the returned Activity has symbol=VUAG, quantity=59.539562, unit_price=60.46, date=2023-01-17, external_id starts with invest-engine:.
2026-04-17 21:49:52 +00:00
Add HTML table fallback for InvestEngine email parser Context: Plain-text IE emails vanished around 2024-Q2 when IE switched to an HTML-only template with per-order nested summary tables. The RFC 2822 line parser returns [] on those modern emails, so we need a fallback that walks the HTML table structure. Upstream _extract_from_html parsed a fixed DOM path (table[1].tr[10]. table) and only handled ONE order per email. The real IE HTML template nests one summary <table> per ticker inside the second top-level table — multiple orders in a single batched confirmation are common — so this port walks every leaf table (no child <table>) and interprets each one as an independent trade summary. Structural (non-leaf) tables are skipped to avoid double-counting via get_text(). This change: - `_parse_html_tables(body)` extracts the date once from the full text then walks leaf tables looking for "Bought N @ £P" rows. - `_try_html_summary_table` parses one leaf; returns None on structural tables or missing ticker/qty/price — so a partial email yields only its intact orders (the "2 orders, 1 parseable → 1 returned" invariant works by construction without raising). - `parse_invest_engine_email` now falls through text/plain → text/html in the multipart message, picking the first strategy that returns activities. Order matters: text/plain wins when both succeed because the RFC 2822 strategy is the more constrained grammar. - Regexes are module-level constants so they compile once per process. Fixture `html_two_orders.eml` is a minimal-but-realistic multipart email with two nested summary tables (VUAG + SWDA), no personal data beyond tickers/qty/price. Test plan: poetry run pytest tests/providers/parsers/ -q → 5 passed in 0.16s poetry run mypy broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → Success: no issues found in 2 source files poetry run ruff check broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → All checks passed! poetry run yapf --diff broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → clean (no diff) Manual verification: load html_two_orders.eml, call parse_invest_engine_email, assert len == 2 with both expected tickers (VUAG, SWDA) and numbers, dates set to 2026-04-01.
2026-04-17 21:58:15 +00:00
def _extract_part_body(msg: Message, content_type: str) -> str | None:
"""Return the first sub-part of the given content type, or None."""
Add InvestEngine email parser — RFC 2822 v1/v2 line format Context: The old finance/ app had a 324-line IE message parser with four line-based variants (v1/v2/v3/v4) plus an HTML strategy and a CSV fallback. Port into broker-sync so we can consume IE trade confirmation emails as a backup to the live HTTP client (Phase 2b) while IE's public API remains Bearer-only. The upstream parser emits storage.model.Position; we emit canonical Activity with the broker-sync invariants: account_id="invest-engine-primary" (sink remaps to Wealthfolio UUID), account_type=ISA, currency=GBP, and external_id="invest-engine:<fingerprint>" where the fingerprint is a SHA-256 of (date|symbol|quantity|unit_price) — deterministic so repeat imports of the same email dedup at the sync-record layer. This change: - Top-level `parse_invest_engine_email(raw_email: bytes) -> list[Activity]` extracts the text/plain body from an RFC 2822 message and dispatches to the line-based parser. - `_parse_rfc2822_lines(body)` tries the v2 layout first (newer IE format where `Date: DD Month` is on line 2 and the year on line 3), then the v1 layout (where the day alone is on line 2 and `Month YYYY` on line 3). v3 and v4 variants are re-added in a follow-up if we find fixtures where they matter — initial fixture coverage hits v2. - Drops the upstream `_ticker_post_processing` VUAG→VUAG.L hack. Wealthfolio's /import/check endpoint resolves exchange suffixes; the Trading212 provider also emits suffix-free tickers (e.g. `VUAG`), so staying consistent avoids double-mapping. - Notes field records the parse-strategy tag ("rfc2822-v2") plus the matched line for debugging. Test plan: poetry run pytest tests/providers/parsers/ -q → 3 passed in 0.03s poetry run mypy broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → Success: no issues found in 2 source files poetry run ruff check broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → All checks passed! poetry run yapf --diff broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → clean (no diff) Manual verification: load the fixture email, call the parser, inspect the returned Activity has symbol=VUAG, quantity=59.539562, unit_price=60.46, date=2023-01-17, external_id starts with invest-engine:.
2026-04-17 21:49:52 +00:00
if msg.is_multipart():
for part in msg.walk():
Add HTML table fallback for InvestEngine email parser Context: Plain-text IE emails vanished around 2024-Q2 when IE switched to an HTML-only template with per-order nested summary tables. The RFC 2822 line parser returns [] on those modern emails, so we need a fallback that walks the HTML table structure. Upstream _extract_from_html parsed a fixed DOM path (table[1].tr[10]. table) and only handled ONE order per email. The real IE HTML template nests one summary <table> per ticker inside the second top-level table — multiple orders in a single batched confirmation are common — so this port walks every leaf table (no child <table>) and interprets each one as an independent trade summary. Structural (non-leaf) tables are skipped to avoid double-counting via get_text(). This change: - `_parse_html_tables(body)` extracts the date once from the full text then walks leaf tables looking for "Bought N @ £P" rows. - `_try_html_summary_table` parses one leaf; returns None on structural tables or missing ticker/qty/price — so a partial email yields only its intact orders (the "2 orders, 1 parseable → 1 returned" invariant works by construction without raising). - `parse_invest_engine_email` now falls through text/plain → text/html in the multipart message, picking the first strategy that returns activities. Order matters: text/plain wins when both succeed because the RFC 2822 strategy is the more constrained grammar. - Regexes are module-level constants so they compile once per process. Fixture `html_two_orders.eml` is a minimal-but-realistic multipart email with two nested summary tables (VUAG + SWDA), no personal data beyond tickers/qty/price. Test plan: poetry run pytest tests/providers/parsers/ -q → 5 passed in 0.16s poetry run mypy broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → Success: no issues found in 2 source files poetry run ruff check broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → All checks passed! poetry run yapf --diff broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → clean (no diff) Manual verification: load html_two_orders.eml, call parse_invest_engine_email, assert len == 2 with both expected tickers (VUAG, SWDA) and numbers, dates set to 2026-04-01.
2026-04-17 21:58:15 +00:00
if part.get_content_type() == content_type:
return _decode_payload(part)
Add InvestEngine email parser — RFC 2822 v1/v2 line format Context: The old finance/ app had a 324-line IE message parser with four line-based variants (v1/v2/v3/v4) plus an HTML strategy and a CSV fallback. Port into broker-sync so we can consume IE trade confirmation emails as a backup to the live HTTP client (Phase 2b) while IE's public API remains Bearer-only. The upstream parser emits storage.model.Position; we emit canonical Activity with the broker-sync invariants: account_id="invest-engine-primary" (sink remaps to Wealthfolio UUID), account_type=ISA, currency=GBP, and external_id="invest-engine:<fingerprint>" where the fingerprint is a SHA-256 of (date|symbol|quantity|unit_price) — deterministic so repeat imports of the same email dedup at the sync-record layer. This change: - Top-level `parse_invest_engine_email(raw_email: bytes) -> list[Activity]` extracts the text/plain body from an RFC 2822 message and dispatches to the line-based parser. - `_parse_rfc2822_lines(body)` tries the v2 layout first (newer IE format where `Date: DD Month` is on line 2 and the year on line 3), then the v1 layout (where the day alone is on line 2 and `Month YYYY` on line 3). v3 and v4 variants are re-added in a follow-up if we find fixtures where they matter — initial fixture coverage hits v2. - Drops the upstream `_ticker_post_processing` VUAG→VUAG.L hack. Wealthfolio's /import/check endpoint resolves exchange suffixes; the Trading212 provider also emits suffix-free tickers (e.g. `VUAG`), so staying consistent avoids double-mapping. - Notes field records the parse-strategy tag ("rfc2822-v2") plus the matched line for debugging. Test plan: poetry run pytest tests/providers/parsers/ -q → 3 passed in 0.03s poetry run mypy broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → Success: no issues found in 2 source files poetry run ruff check broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → All checks passed! poetry run yapf --diff broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → clean (no diff) Manual verification: load the fixture email, call the parser, inspect the returned Activity has symbol=VUAG, quantity=59.539562, unit_price=60.46, date=2023-01-17, external_id starts with invest-engine:.
2026-04-17 21:49:52 +00:00
return None
Add HTML table fallback for InvestEngine email parser Context: Plain-text IE emails vanished around 2024-Q2 when IE switched to an HTML-only template with per-order nested summary tables. The RFC 2822 line parser returns [] on those modern emails, so we need a fallback that walks the HTML table structure. Upstream _extract_from_html parsed a fixed DOM path (table[1].tr[10]. table) and only handled ONE order per email. The real IE HTML template nests one summary <table> per ticker inside the second top-level table — multiple orders in a single batched confirmation are common — so this port walks every leaf table (no child <table>) and interprets each one as an independent trade summary. Structural (non-leaf) tables are skipped to avoid double-counting via get_text(). This change: - `_parse_html_tables(body)` extracts the date once from the full text then walks leaf tables looking for "Bought N @ £P" rows. - `_try_html_summary_table` parses one leaf; returns None on structural tables or missing ticker/qty/price — so a partial email yields only its intact orders (the "2 orders, 1 parseable → 1 returned" invariant works by construction without raising). - `parse_invest_engine_email` now falls through text/plain → text/html in the multipart message, picking the first strategy that returns activities. Order matters: text/plain wins when both succeed because the RFC 2822 strategy is the more constrained grammar. - Regexes are module-level constants so they compile once per process. Fixture `html_two_orders.eml` is a minimal-but-realistic multipart email with two nested summary tables (VUAG + SWDA), no personal data beyond tickers/qty/price. Test plan: poetry run pytest tests/providers/parsers/ -q → 5 passed in 0.16s poetry run mypy broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → Success: no issues found in 2 source files poetry run ruff check broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → All checks passed! poetry run yapf --diff broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → clean (no diff) Manual verification: load html_two_orders.eml, call parse_invest_engine_email, assert len == 2 with both expected tickers (VUAG, SWDA) and numbers, dates set to 2026-04-01.
2026-04-17 21:58:15 +00:00
if msg.get_content_type() == content_type:
return _decode_payload(msg)
return None
def _decode_payload(part: Message) -> str | None:
payload = part.get_payload(decode=True)
Add InvestEngine email parser — RFC 2822 v1/v2 line format Context: The old finance/ app had a 324-line IE message parser with four line-based variants (v1/v2/v3/v4) plus an HTML strategy and a CSV fallback. Port into broker-sync so we can consume IE trade confirmation emails as a backup to the live HTTP client (Phase 2b) while IE's public API remains Bearer-only. The upstream parser emits storage.model.Position; we emit canonical Activity with the broker-sync invariants: account_id="invest-engine-primary" (sink remaps to Wealthfolio UUID), account_type=ISA, currency=GBP, and external_id="invest-engine:<fingerprint>" where the fingerprint is a SHA-256 of (date|symbol|quantity|unit_price) — deterministic so repeat imports of the same email dedup at the sync-record layer. This change: - Top-level `parse_invest_engine_email(raw_email: bytes) -> list[Activity]` extracts the text/plain body from an RFC 2822 message and dispatches to the line-based parser. - `_parse_rfc2822_lines(body)` tries the v2 layout first (newer IE format where `Date: DD Month` is on line 2 and the year on line 3), then the v1 layout (where the day alone is on line 2 and `Month YYYY` on line 3). v3 and v4 variants are re-added in a follow-up if we find fixtures where they matter — initial fixture coverage hits v2. - Drops the upstream `_ticker_post_processing` VUAG→VUAG.L hack. Wealthfolio's /import/check endpoint resolves exchange suffixes; the Trading212 provider also emits suffix-free tickers (e.g. `VUAG`), so staying consistent avoids double-mapping. - Notes field records the parse-strategy tag ("rfc2822-v2") plus the matched line for debugging. Test plan: poetry run pytest tests/providers/parsers/ -q → 3 passed in 0.03s poetry run mypy broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → Success: no issues found in 2 source files poetry run ruff check broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → All checks passed! poetry run yapf --diff broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → clean (no diff) Manual verification: load the fixture email, call the parser, inspect the returned Activity has symbol=VUAG, quantity=59.539562, unit_price=60.46, date=2023-01-17, external_id starts with invest-engine:.
2026-04-17 21:49:52 +00:00
if isinstance(payload, bytes):
Add HTML table fallback for InvestEngine email parser Context: Plain-text IE emails vanished around 2024-Q2 when IE switched to an HTML-only template with per-order nested summary tables. The RFC 2822 line parser returns [] on those modern emails, so we need a fallback that walks the HTML table structure. Upstream _extract_from_html parsed a fixed DOM path (table[1].tr[10]. table) and only handled ONE order per email. The real IE HTML template nests one summary <table> per ticker inside the second top-level table — multiple orders in a single batched confirmation are common — so this port walks every leaf table (no child <table>) and interprets each one as an independent trade summary. Structural (non-leaf) tables are skipped to avoid double-counting via get_text(). This change: - `_parse_html_tables(body)` extracts the date once from the full text then walks leaf tables looking for "Bought N @ £P" rows. - `_try_html_summary_table` parses one leaf; returns None on structural tables or missing ticker/qty/price — so a partial email yields only its intact orders (the "2 orders, 1 parseable → 1 returned" invariant works by construction without raising). - `parse_invest_engine_email` now falls through text/plain → text/html in the multipart message, picking the first strategy that returns activities. Order matters: text/plain wins when both succeed because the RFC 2822 strategy is the more constrained grammar. - Regexes are module-level constants so they compile once per process. Fixture `html_two_orders.eml` is a minimal-but-realistic multipart email with two nested summary tables (VUAG + SWDA), no personal data beyond tickers/qty/price. Test plan: poetry run pytest tests/providers/parsers/ -q → 5 passed in 0.16s poetry run mypy broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → Success: no issues found in 2 source files poetry run ruff check broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → All checks passed! poetry run yapf --diff broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → clean (no diff) Manual verification: load html_two_orders.eml, call parse_invest_engine_email, assert len == 2 with both expected tickers (VUAG, SWDA) and numbers, dates set to 2026-04-01.
2026-04-17 21:58:15 +00:00
return payload.decode(part.get_content_charset() or "utf-8", errors="replace")
Add InvestEngine email parser — RFC 2822 v1/v2 line format Context: The old finance/ app had a 324-line IE message parser with four line-based variants (v1/v2/v3/v4) plus an HTML strategy and a CSV fallback. Port into broker-sync so we can consume IE trade confirmation emails as a backup to the live HTTP client (Phase 2b) while IE's public API remains Bearer-only. The upstream parser emits storage.model.Position; we emit canonical Activity with the broker-sync invariants: account_id="invest-engine-primary" (sink remaps to Wealthfolio UUID), account_type=ISA, currency=GBP, and external_id="invest-engine:<fingerprint>" where the fingerprint is a SHA-256 of (date|symbol|quantity|unit_price) — deterministic so repeat imports of the same email dedup at the sync-record layer. This change: - Top-level `parse_invest_engine_email(raw_email: bytes) -> list[Activity]` extracts the text/plain body from an RFC 2822 message and dispatches to the line-based parser. - `_parse_rfc2822_lines(body)` tries the v2 layout first (newer IE format where `Date: DD Month` is on line 2 and the year on line 3), then the v1 layout (where the day alone is on line 2 and `Month YYYY` on line 3). v3 and v4 variants are re-added in a follow-up if we find fixtures where they matter — initial fixture coverage hits v2. - Drops the upstream `_ticker_post_processing` VUAG→VUAG.L hack. Wealthfolio's /import/check endpoint resolves exchange suffixes; the Trading212 provider also emits suffix-free tickers (e.g. `VUAG`), so staying consistent avoids double-mapping. - Notes field records the parse-strategy tag ("rfc2822-v2") plus the matched line for debugging. Test plan: poetry run pytest tests/providers/parsers/ -q → 3 passed in 0.03s poetry run mypy broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → Success: no issues found in 2 source files poetry run ruff check broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → All checks passed! poetry run yapf --diff broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → clean (no diff) Manual verification: load the fixture email, call the parser, inspect the returned Activity has symbol=VUAG, quantity=59.539562, unit_price=60.46, date=2023-01-17, external_id starts with invest-engine:.
2026-04-17 21:49:52 +00:00
if isinstance(payload, str):
return payload
return None
def _parse_rfc2822_lines(body: str) -> list[Activity]:
"""Try each line-based body format (v1/v2) and return matches.
Corresponds to `_extract_position_v1` and `_extract_position_v2` in
the upstream parser. Returns a one-element list on success, `[]`
Add HTML table fallback for InvestEngine email parser Context: Plain-text IE emails vanished around 2024-Q2 when IE switched to an HTML-only template with per-order nested summary tables. The RFC 2822 line parser returns [] on those modern emails, so we need a fallback that walks the HTML table structure. Upstream _extract_from_html parsed a fixed DOM path (table[1].tr[10]. table) and only handled ONE order per email. The real IE HTML template nests one summary <table> per ticker inside the second top-level table — multiple orders in a single batched confirmation are common — so this port walks every leaf table (no child <table>) and interprets each one as an independent trade summary. Structural (non-leaf) tables are skipped to avoid double-counting via get_text(). This change: - `_parse_html_tables(body)` extracts the date once from the full text then walks leaf tables looking for "Bought N @ £P" rows. - `_try_html_summary_table` parses one leaf; returns None on structural tables or missing ticker/qty/price — so a partial email yields only its intact orders (the "2 orders, 1 parseable → 1 returned" invariant works by construction without raising). - `parse_invest_engine_email` now falls through text/plain → text/html in the multipart message, picking the first strategy that returns activities. Order matters: text/plain wins when both succeed because the RFC 2822 strategy is the more constrained grammar. - Regexes are module-level constants so they compile once per process. Fixture `html_two_orders.eml` is a minimal-but-realistic multipart email with two nested summary tables (VUAG + SWDA), no personal data beyond tickers/qty/price. Test plan: poetry run pytest tests/providers/parsers/ -q → 5 passed in 0.16s poetry run mypy broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → Success: no issues found in 2 source files poetry run ruff check broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → All checks passed! poetry run yapf --diff broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → clean (no diff) Manual verification: load html_two_orders.eml, call parse_invest_engine_email, assert len == 2 with both expected tickers (VUAG, SWDA) and numbers, dates set to 2026-04-01.
2026-04-17 21:58:15 +00:00
otherwise. v3/v4 are not ported no surviving fixtures exist and
the HTML fallback covers newer formats.
Add InvestEngine email parser — RFC 2822 v1/v2 line format Context: The old finance/ app had a 324-line IE message parser with four line-based variants (v1/v2/v3/v4) plus an HTML strategy and a CSV fallback. Port into broker-sync so we can consume IE trade confirmation emails as a backup to the live HTTP client (Phase 2b) while IE's public API remains Bearer-only. The upstream parser emits storage.model.Position; we emit canonical Activity with the broker-sync invariants: account_id="invest-engine-primary" (sink remaps to Wealthfolio UUID), account_type=ISA, currency=GBP, and external_id="invest-engine:<fingerprint>" where the fingerprint is a SHA-256 of (date|symbol|quantity|unit_price) — deterministic so repeat imports of the same email dedup at the sync-record layer. This change: - Top-level `parse_invest_engine_email(raw_email: bytes) -> list[Activity]` extracts the text/plain body from an RFC 2822 message and dispatches to the line-based parser. - `_parse_rfc2822_lines(body)` tries the v2 layout first (newer IE format where `Date: DD Month` is on line 2 and the year on line 3), then the v1 layout (where the day alone is on line 2 and `Month YYYY` on line 3). v3 and v4 variants are re-added in a follow-up if we find fixtures where they matter — initial fixture coverage hits v2. - Drops the upstream `_ticker_post_processing` VUAG→VUAG.L hack. Wealthfolio's /import/check endpoint resolves exchange suffixes; the Trading212 provider also emits suffix-free tickers (e.g. `VUAG`), so staying consistent avoids double-mapping. - Notes field records the parse-strategy tag ("rfc2822-v2") plus the matched line for debugging. Test plan: poetry run pytest tests/providers/parsers/ -q → 3 passed in 0.03s poetry run mypy broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → Success: no issues found in 2 source files poetry run ruff check broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → All checks passed! poetry run yapf --diff broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → clean (no diff) Manual verification: load the fixture email, call the parser, inspect the returned Activity has symbol=VUAG, quantity=59.539562, unit_price=60.46, date=2023-01-17, external_id starts with invest-engine:.
2026-04-17 21:49:52 +00:00
"""
for parser in (_try_v2, _try_v1):
result = parser(body)
if result is not None:
return [result]
return []
def _try_v2(body: str) -> Activity | None:
"""Parse body with v2 layout: `Date: DD Month` on line 2, year on line 3."""
lines = body.splitlines()
if len(lines) < 6:
return None
try:
day_str, month = lines[2].split()[-2:]
year = lines[3].split()[0]
on_date = datetime.strptime(f"{day_str}-{month}-{year}", "%d-%B-%Y")
symbol = lines[4].split(":")[1].split()[0].strip()
unit_price = Decimal(lines[4].split(_CURRENCY_SIGN)[1].split()[0])
quantity = Decimal(lines[4].split("Bought")[1].split()[0])
except (ValueError, IndexError):
return None
return _build_activity(
on_date=on_date,
symbol=symbol,
quantity=quantity,
unit_price=unit_price,
strategy="rfc2822-v2",
matched=lines[4],
)
def _try_v1(body: str) -> Activity | None:
"""Parse body with v1 layout: `Date: DD` on line 2, `Month YYYY` on line 3."""
lines = body.splitlines()
if len(lines) < 6:
return None
try:
day = int(lines[2].split("Date: ")[1])
month, year = (lines[3].split(" ")[0]).split()
on_date = datetime.strptime(f"{day}-{month}-{year}", "%d-%B-%Y")
symbol = lines[4].split(":")[1].split()[0].strip()
quantity = Decimal(lines[4].split("Bought")[1].split()[0])
price_str = lines[4].split("Bought")[1].split("@")[1].split()[0].split(_CURRENCY_SIGN)[1]
unit_price = Decimal(price_str)
except (ValueError, IndexError):
return None
return _build_activity(
on_date=on_date,
symbol=symbol,
quantity=quantity,
unit_price=unit_price,
strategy="rfc2822-v1",
matched=lines[4],
)
Add HTML table fallback for InvestEngine email parser Context: Plain-text IE emails vanished around 2024-Q2 when IE switched to an HTML-only template with per-order nested summary tables. The RFC 2822 line parser returns [] on those modern emails, so we need a fallback that walks the HTML table structure. Upstream _extract_from_html parsed a fixed DOM path (table[1].tr[10]. table) and only handled ONE order per email. The real IE HTML template nests one summary <table> per ticker inside the second top-level table — multiple orders in a single batched confirmation are common — so this port walks every leaf table (no child <table>) and interprets each one as an independent trade summary. Structural (non-leaf) tables are skipped to avoid double-counting via get_text(). This change: - `_parse_html_tables(body)` extracts the date once from the full text then walks leaf tables looking for "Bought N @ £P" rows. - `_try_html_summary_table` parses one leaf; returns None on structural tables or missing ticker/qty/price — so a partial email yields only its intact orders (the "2 orders, 1 parseable → 1 returned" invariant works by construction without raising). - `parse_invest_engine_email` now falls through text/plain → text/html in the multipart message, picking the first strategy that returns activities. Order matters: text/plain wins when both succeed because the RFC 2822 strategy is the more constrained grammar. - Regexes are module-level constants so they compile once per process. Fixture `html_two_orders.eml` is a minimal-but-realistic multipart email with two nested summary tables (VUAG + SWDA), no personal data beyond tickers/qty/price. Test plan: poetry run pytest tests/providers/parsers/ -q → 5 passed in 0.16s poetry run mypy broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → Success: no issues found in 2 source files poetry run ruff check broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → All checks passed! poetry run yapf --diff broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → clean (no diff) Manual verification: load html_two_orders.eml, call parse_invest_engine_email, assert len == 2 with both expected tickers (VUAG, SWDA) and numbers, dates set to 2026-04-01.
2026-04-17 21:58:15 +00:00
def _parse_html_tables(body: str) -> list[Activity]:
"""Parse an HTML body with per-order nested summary tables.
Walks every leaf <table> (a table with no child tables); each leaf
carries one trade summary (ticker, bought line, total, ISIN + order
id). Tables that don't contain the expected shape are skipped, so a
partially corrupted email yields only its intact orders.
"""
soup = BeautifulSoup(body, "html.parser")
on_date = _extract_html_date(soup)
if on_date is None:
return []
activities: list[Activity] = []
for table in soup.find_all("table"):
if table.find("table") is not None:
continue
activity = _try_html_summary_table(table, on_date)
if activity is not None:
activities.append(activity)
return activities
def _extract_html_date(soup: BeautifulSoup) -> datetime | None:
match = _DATE_RE.search(soup.get_text(" ", strip=True))
if match is None:
return None
day, month, year = match.groups()
try:
return datetime.strptime(f"{day}-{month}-{year}", "%d-%B-%Y")
except ValueError:
return None
def _try_html_summary_table(nested: object, on_date: datetime) -> Activity | None:
"""Interpret a leaf <table> as a single trade summary.
Returns None if the table is structural (no "Bought N @ £P" row) or
any required field is missing.
"""
get_text = getattr(nested, "get_text", None)
if get_text is None:
return None
text = get_text(" ", strip=True)
bought = _BOUGHT_RE.search(text)
if bought is None:
return None
symbol = _extract_html_symbol(nested)
if symbol is None:
return None
quantity = Decimal(bought.group(1))
unit_price = Decimal(bought.group(2))
return _build_activity(
on_date=on_date,
symbol=symbol,
quantity=quantity,
unit_price=unit_price,
strategy="html",
matched=text[:200],
)
def _extract_html_symbol(nested: object) -> str | None:
find_all = getattr(nested, "find_all", None)
if find_all is None:
return None
for cell in find_all("td"):
cell_text = cell.get_text(" ", strip=True)
m = _TICKER_RE.search(cell_text)
if m is not None:
return m.group(1)
return None
Add CSV attachment fallback for InvestEngine email parser Context: IE has not (yet) sent CSV-attached statements in production, but the upstream parser had _extract_positions_csv as a third fallback for exactly this case. Keeping the fallback preserves behaviour-parity with the legacy parser and makes future statement support one fixture away — the shape is documented by column set, not scraped live. Unlike the upstream which split the body on whitespace and broke on any embedded commas in names, this port walks real MIME attachments using Python's csv.DictReader. A part qualifies as CSV if: - its Content-Type is text/csv / application/csv / application/vnd.ms-excel, OR - its filename ends in .csv (defence against IE mis-labelling the part) Rows missing required columns or containing unparseable numbers/dates are skipped silently — consistent with the "partial match" contract: a half-corrupt CSV yields whatever rows were intact. Required columns: ticker, unit_price, quantity, date (YYYY-MM-DD), currency. Non-GBP rows are filtered because the IE ISA is strictly sterling — flagging this assumption in the review notes. This change: - Adds `_parse_csv_attachment(raw_email)` as the third strategy after text/plain and text/html; it re-parses the raw email bytes so we can inspect Content-Type/filename on each part. - Flags symbols/currencies, filters non-GBP, and runs each row through the shared `_build_activity` so external_id formation matches every other strategy (dedup stays consistent across strategies). - Fixture `csv_attachment.eml` has three rows (VUAG, SWDA, VUSA) in a `text/csv` part with a `.csv` filename — covers both detection paths. Test plan: poetry run pytest tests/providers/parsers/ -q → 6 passed in 0.15s poetry run mypy broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → clean poetry run ruff check broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → All checks passed! poetry run yapf --diff → clean (no diff) Manual verification: load csv_attachment.eml, call parse_invest_engine_email, assert 3 activities each with symbol in {VUAG,SWDA,VUSA}, currency=GBP, notes containing "csv".
2026-04-17 22:01:46 +00:00
_CSV_CONTENT_TYPES = {"text/csv", "application/csv", "application/vnd.ms-excel"}
# Required columns for the CSV attachment strategy. IE has not (yet) sent
# CSV-attached statements in production — the column set here mirrors the
# upstream _extract_positions_csv contract (ticker, buy_price, num_shares,
# buy_date, currency) with modern names.
_CSV_COLUMNS = {"ticker", "unit_price", "quantity", "date", "currency"}
def _parse_csv_attachment(raw_email: bytes) -> list[Activity]:
"""Parse a CSV attachment from the email into Activity records.
Walks every MIME part, picks the first one with a CSV-ish content
type OR a `.csv` filename, and iterates its rows. Rows missing a
required column or with an unparseable number/date are skipped.
"""
msg = email.message_from_bytes(raw_email)
csv_text = _extract_csv_attachment_text(msg)
if csv_text is None:
return []
reader = csv.DictReader(io.StringIO(csv_text))
fieldnames = set(reader.fieldnames or [])
if not _CSV_COLUMNS.issubset(fieldnames):
return []
activities: list[Activity] = []
for row in reader:
activity = _csv_row_to_activity(row)
if activity is not None:
activities.append(activity)
return activities
def _extract_csv_attachment_text(msg: Message) -> str | None:
for part in msg.walk():
if not _looks_like_csv_part(part):
continue
payload = part.get_payload(decode=True)
if isinstance(payload, bytes):
return payload.decode(part.get_content_charset() or "utf-8", errors="replace")
if isinstance(payload, str):
return payload
return None
def _looks_like_csv_part(part: Message) -> bool:
if part.get_content_type() in _CSV_CONTENT_TYPES:
return True
filename = part.get_filename()
return isinstance(filename, str) and filename.lower().endswith(".csv")
def _csv_row_to_activity(row: dict[str, str]) -> Activity | None:
try:
on_date = datetime.strptime(row["date"], "%Y-%m-%d")
symbol = row["ticker"].strip()
quantity = Decimal(row["quantity"])
unit_price = Decimal(row["unit_price"])
currency = row["currency"].strip() or "GBP"
except (KeyError, ValueError, InvalidOperation):
return None
if not symbol or currency != "GBP":
return None
return _build_activity(
on_date=on_date,
symbol=symbol,
quantity=quantity,
unit_price=unit_price,
strategy="csv",
matched=f"{symbol},{unit_price},{quantity},{row['date']}",
)
Add InvestEngine email parser — RFC 2822 v1/v2 line format Context: The old finance/ app had a 324-line IE message parser with four line-based variants (v1/v2/v3/v4) plus an HTML strategy and a CSV fallback. Port into broker-sync so we can consume IE trade confirmation emails as a backup to the live HTTP client (Phase 2b) while IE's public API remains Bearer-only. The upstream parser emits storage.model.Position; we emit canonical Activity with the broker-sync invariants: account_id="invest-engine-primary" (sink remaps to Wealthfolio UUID), account_type=ISA, currency=GBP, and external_id="invest-engine:<fingerprint>" where the fingerprint is a SHA-256 of (date|symbol|quantity|unit_price) — deterministic so repeat imports of the same email dedup at the sync-record layer. This change: - Top-level `parse_invest_engine_email(raw_email: bytes) -> list[Activity]` extracts the text/plain body from an RFC 2822 message and dispatches to the line-based parser. - `_parse_rfc2822_lines(body)` tries the v2 layout first (newer IE format where `Date: DD Month` is on line 2 and the year on line 3), then the v1 layout (where the day alone is on line 2 and `Month YYYY` on line 3). v3 and v4 variants are re-added in a follow-up if we find fixtures where they matter — initial fixture coverage hits v2. - Drops the upstream `_ticker_post_processing` VUAG→VUAG.L hack. Wealthfolio's /import/check endpoint resolves exchange suffixes; the Trading212 provider also emits suffix-free tickers (e.g. `VUAG`), so staying consistent avoids double-mapping. - Notes field records the parse-strategy tag ("rfc2822-v2") plus the matched line for debugging. Test plan: poetry run pytest tests/providers/parsers/ -q → 3 passed in 0.03s poetry run mypy broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → Success: no issues found in 2 source files poetry run ruff check broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → All checks passed! poetry run yapf --diff broker_sync/providers/parsers/invest_engine.py tests/providers/parsers/test_invest_engine.py → clean (no diff) Manual verification: load the fixture email, call the parser, inspect the returned Activity has symbol=VUAG, quantity=59.539562, unit_price=60.46, date=2023-01-17, external_id starts with invest-engine:.
2026-04-17 21:49:52 +00:00
def _build_activity(
*,
on_date: datetime,
symbol: str,
quantity: Decimal,
unit_price: Decimal,
strategy: str,
matched: str,
) -> Activity:
fingerprint = _fingerprint(on_date, symbol, quantity, unit_price)
return Activity(
external_id=f"invest-engine:{fingerprint}",
account_id=_ACCOUNT_ID,
account_type=AccountType.ISA,
date=on_date,
activity_type=ActivityType.BUY,
currency="GBP",
symbol=symbol,
quantity=quantity,
unit_price=unit_price,
notes=f"[{strategy}] {matched.strip()}",
)
def _fingerprint(date: datetime, symbol: str, quantity: Decimal, unit_price: Decimal) -> str:
key = f"{date.isoformat()}|{symbol}|{quantity}|{unit_price}"
return hashlib.sha256(key.encode("utf-8")).hexdigest()[:16]