Data Generator — Synthetic ERP Exports (Dirty-by-Design)

This project includes a data generator that creates realistic-looking procurement exports without using any real organizational data.

It produces two files that simulate what an auditor typically receives from an ERP system:

The generator is intentionally dirty-by-design so the audit engine has something meaningful to detect.


What gets created

Outputs

File Path What it represents
Vendor Master data/raw_erp_dump/vendor_master.csv Approved vendors (trusted reference)
Invoices Export data/raw_erp_dump/invoices.xlsx Operational invoices dump (untrusted input)

Open the generated files (your published docs copies)

These links assume the files are available under your published docs (recommended).


How to run it

python src/data_generator.py

Expected terminal output:


Why this exists (real-world reason)

Audit tooling fails in demos when the dataset is too clean. In real procurement exports:

This generator makes sure the dashboard and evidence exports always have realistic findings to show.


What issues are intentionally seeded

1) Ghost Vendors (anti-join violations)

Invoices include vendor IDs that do not exist in the Vendor Master.

Why this matters:

What the audit engine should detect:

Evidence output produced later:

Published evidence copy (recommended):

Screenshot proof:


2) PO Variance Breaches (variance math)

Some invoices are generated with invoice totals that differ materially from the PO amount.

Variance formula used by the engine:

variance = abs(invoice_amount - po_amount) / po_amount

Why this matters:

Evidence output produced later:

Published evidence copy (recommended):

Screenshot proof:


3 High-Value Invoices (threshold-based monitoring)

A portion of invoices are generated above a configurable high-value threshold.

Why this matters:

Evidence output produced later:

Published evidence copy (recommended):

Screenshot proof:


4 FOIP/PII risks embedded in Notes (unstructured text)

The Notes field is where “human messiness” lives:

Why this matters:

Evidence output produced later:

Published evidence copy (recommended):

FOIP evidence capture (video):


Video capture checklist

  1. Run:

    python src/data_generator.py
    
  2. Open:

    • data/raw_erp_dump/vendor_master.csv
    • data/raw_erp_dump/invoices.xlsx
  3. Briefly scroll to show:

    • Vendor IDs in master
    • Invoice rows containing:

      • a ghost vendor ID (e.g., VENDOR-999)
      • variance between InvoiceAmount and PO_Amount
      • a Notes row with PII-like content

Clip A (your capture):