🔒 FOIP / PII Scanner (AI Auditor)

Goal: Catch privacy-risk text inside Notes / Comments / Description fields before exports get shared, emailed, archived, or turned into reports. This is a fast pre-screen that produces evidence tables for review (not a legal determination).

Problem

Unstructured text is where privacy leaks survive

Procurement exports can look clean (VendorID, amounts, POs) until the Notes field contains personal emails, names, phone numbers, or confidential context. Structured columns are governed; free-text is not.

FOIP/PII risk Hidden in Notes Exportable evidence

Output

Evidence table (exception list)

Each finding becomes a row with InvoiceID, RiskContent (the exact text), and DetectedFlags. This supports review workflows and audit trails.

InvoiceID RiskContent DetectedFlags

✅ What the scanner flags

Named Entity Recognition (NER)

Detects likely PERSON names in text (entity group: PER) and applies a confidence threshold to reduce noise.

PER entities Confidence threshold Lower noise

Deterministic

Heuristics for machine-detectable patterns

Flags email-like strings using simple checks (contains @ and .). Additional patterns (phone/postal/keywords) can be added as policy controls.

POSSIBLE_EMAIL Extensible rules Policy dial

AI Rules Scanner

AI Rules Scanner Flow Diagram

threshold_tradeoff

Evidence schema

What a finding row contains

Output is designed for review: a clear identifier, the exact risky text, and the flags that triggered it.

InvoiceID RiskContent DetectedFlags

Design rule

Evidence first, decision later

The scanner produces an exception list for human review. It does not attempt to adjudicate compliance outcomes.

Review workflow Audit trail

Confidence threshold (policy dial)

Default

Why a threshold exists

NER always returns guesses. A threshold prevents low-confidence fragments from becoming noise in audit evidence.

Example: 0.85 Noise control Explainable behavior

Mode

Tune to workflow risk appetite

High-risk review may prefer more findings (lower threshold). High-volume operations may prefer less noise (higher threshold). Treat this as a versioned control like any other audit rule.

False positives False negatives Versioned policy

False positives vs false negatives

False positives

Flags that are not truly privacy issues

Common causes include vendor names that resemble people, short token fragments, or context-free name detection.

Mitigate via threshold Allow/deny lists (future) Multi-signal rules (future)

False negatives

Missed privacy risks

Caused by initials, unusual formatting, non-English names, or PII types not covered by the model (phones/addresses).

Add phone/postal patterns Keyword rules Model swap option

Performance & stability

Runtime

Load once, reuse in Streamlit

The model is cached in the Streamlit app so it loads once per session. Subsequent runs reuse memory and stay responsive.

Cached model Fast re-runs

AI optional for stable builds

Transformers dependencies can be heavy/volatile in CI. The pipeline supports SKIP_AI=1 so deterministic checks and tests stay reliable.

Stable GitHub Actions SKIP_AI=1

What “good” looks like (expected outputs)

CLI

Terminal findings summary

Console output shows total findings and a compact table of flagged rows.

Finding count Flag types

Exports + UI

CSV evidence + Streamlit tab

A timestamped findings CSV appears under data/audit_reports/, and Streamlit shows the Findings tab with a download button.

Timestamped CSV Dashboard table Download

Evidence (media)

Clip C

AI FOIP/PII scan demo

Command: python src/ai_auditor.py (show findings printed + CSV created)

20–40s video Terminal output CSV evidence

Streamlit

Findings tab + download

Show the “FOIP/PII Findings” summary card, the Findings table, and the export/download area.

Summary card Findings table Download button

Clip C — AI scan (screenshot)

FOIP/PII findings evidence (CSV)

CSV file in repo: data/audit_reports/foip_ai_findings_20260131_224156.csv

Open findings CSV (GitHub)

Dashboard home (video)