đź”’ FOIP / PII Scanner (AI Auditor)

Goal: Catch privacy-risk text inside Notes / Comments / Description fields before exports get shared, emailed, archived, or turned into reports. This is a fast pre-screen that produces evidence tables for review (not a legal determination).

Problem
Unstructured text is where privacy leaks survive

Procurement exports can look clean (VendorID, amounts, POs) until the Notes field contains personal emails, names, phone numbers, or confidential context. Structured columns are governed; free-text is not.

FOIP/PII risk Hidden in Notes Exportable evidence
Output
Evidence table (exception list)

Each finding becomes a row with InvoiceID, RiskContent (the exact text), and DetectedFlags. This supports review workflows and audit trails.

InvoiceID RiskContent DetectedFlags

âś… What the scanner flags

AI
Named Entity Recognition (NER)

Detects likely PERSON names in text (entity group: PER) and applies a confidence threshold to reduce noise.

PER entities Confidence threshold Lower noise
Deterministic
Heuristics for machine-detectable patterns

Flags email-like strings using simple checks (contains @ and .). Additional patterns (phone/postal/keywords) can be added as policy controls.

POSSIBLE_EMAIL Extensible rules Policy dial

AI Rules Scanner

AI Rules Scanner Flow Diagram

threshold_tradeoff

threshold_tradeoff
Evidence schema
What a finding row contains

Output is designed for review: a clear identifier, the exact risky text, and the flags that triggered it.

InvoiceID RiskContent DetectedFlags
Design rule
Evidence first, decision later

The scanner produces an exception list for human review. It does not attempt to adjudicate compliance outcomes.

Review workflow Audit trail

Confidence threshold (policy dial)

Default
Why a threshold exists

NER always returns guesses. A threshold prevents low-confidence fragments from becoming noise in audit evidence.

Example: 0.85 Noise control Explainable behavior
Mode
Tune to workflow risk appetite

High-risk review may prefer more findings (lower threshold). High-volume operations may prefer less noise (higher threshold). Treat this as a versioned control like any other audit rule.

False positives False negatives Versioned policy

False positives vs false negatives

False positives
Flags that are not truly privacy issues

Common causes include vendor names that resemble people, short token fragments, or context-free name detection.

Mitigate via threshold Allow/deny lists (future) Multi-signal rules (future)
False negatives
Missed privacy risks

Caused by initials, unusual formatting, non-English names, or PII types not covered by the model (phones/addresses).

Add phone/postal patterns Keyword rules Model swap option

Performance & stability

Runtime
Load once, reuse in Streamlit

The model is cached in the Streamlit app so it loads once per session. Subsequent runs reuse memory and stay responsive.

Cached model Fast re-runs
CI
AI optional for stable builds

Transformers dependencies can be heavy/volatile in CI. The pipeline supports SKIP_AI=1 so deterministic checks and tests stay reliable.

Stable GitHub Actions SKIP_AI=1

What “good” looks like (expected outputs)

CLI
Terminal findings summary

Console output shows total findings and a compact table of flagged rows.

Finding count Flag types
Exports + UI
CSV evidence + Streamlit tab

A timestamped findings CSV appears under data/audit_reports/, and Streamlit shows the Findings tab with a download button.

Timestamped CSV Dashboard table Download

Evidence (media)

Clip C
AI FOIP/PII scan demo

Command: python src/ai_auditor.py (show findings printed + CSV created)

20–40s video Terminal output CSV evidence
Streamlit
Findings tab + download

Show the “FOIP/PII Findings” summary card, the Findings table, and the export/download area.

Summary card Findings table Download button
Clip C — AI scan (screenshot)
FOIP/PII findings evidence (CSV)

CSV file in repo: data/audit_reports/foip_ai_findings_20260131_224156.csv

Open findings CSV (GitHub)

Dashboard home (video)