Skip to content

Download Manifest DB Quality Checks

This page defines a repeatable quality audit for backend/exports/normattiva/ita/download_manifest.db.

Scope

Checks cover:

  1. Referential integrity (dm_act_alias, dm_act_file, dm_download_event).
  2. Alias uniqueness (kind, value).
  3. Catalog ownership drift (non-owner residual rows).
  4. Ambiguity backlog (dm_download_ambiguity).
  5. Coverage of global-owner acts (source='avanzata').
  6. Logical repetition in dm_act_file.
  7. Runtime strict-ownership misses (reason='unmapped_owner_alias').

Runbook

Run from backend/:

uv run python - <<'PY'
import json
import sqlite3
from pathlib import Path

db = Path("exports/normattiva/ita/download_manifest.db")
conn = sqlite3.connect(db)
conn.row_factory = sqlite3.Row
c = conn.cursor()

out = {}
out["fk_violations"] = len(c.execute("PRAGMA foreign_key_check").fetchall())
out["rows_dm_act_catalog"] = c.execute("SELECT COUNT(*) FROM dm_act_catalog").fetchone()[0]
out["rows_dm_act_alias"] = c.execute("SELECT COUNT(*) FROM dm_act_alias").fetchone()[0]
out["rows_dm_act_file"] = c.execute("SELECT COUNT(*) FROM dm_act_file").fetchone()[0]
out["rows_dm_download_event"] = c.execute("SELECT COUNT(*) FROM dm_download_event").fetchone()[0]
out["rows_dm_download_ambiguity"] = c.execute(
    "SELECT COUNT(*) FROM dm_download_ambiguity"
).fetchone()[0]
out["unresolved_ambiguity"] = c.execute(
    "SELECT COUNT(*) FROM dm_download_ambiguity WHERE resolved = 0"
).fetchone()[0]
out["unmapped_owner_alias_unresolved"] = c.execute(
    """
    SELECT COUNT(*)
    FROM dm_download_ambiguity
    WHERE resolved = 0
      AND reason = 'unmapped_owner_alias'
    """
).fetchone()[0]
out["duplicate_alias_kind_value"] = c.execute(
    """
    SELECT COUNT(*)
    FROM (
      SELECT kind, value, COUNT(*) AS c
      FROM dm_act_alias
      GROUP BY kind, value
      HAVING c > 1
    )
    """
).fetchone()[0]
out["aliases_with_missing_act"] = c.execute(
    """
    SELECT COUNT(*)
    FROM dm_act_alias a
    LEFT JOIN dm_act_catalog c ON c.act_pk = a.act_pk
    WHERE c.act_pk IS NULL
    """
).fetchone()[0]
out["files_with_missing_act"] = c.execute(
    """
    SELECT COUNT(*)
    FROM dm_act_file f
    LEFT JOIN dm_act_catalog c ON c.act_pk = f.act_pk
    WHERE c.act_pk IS NULL
    """
).fetchone()[0]
out["events_with_missing_act"] = c.execute(
    """
    SELECT COUNT(*)
    FROM dm_download_event e
    LEFT JOIN dm_act_catalog c ON c.act_pk = e.act_pk
    WHERE e.act_pk IS NOT NULL AND c.act_pk IS NULL
    """
).fetchone()[0]
out["owner_total_acts"] = c.execute(
    "SELECT COUNT(*) FROM dm_act_catalog WHERE source='avanzata'"
).fetchone()[0]
out["owner_acts_with_any_file"] = c.execute(
    """
    SELECT COUNT(DISTINCT a.act_pk)
    FROM dm_act_catalog a
    JOIN dm_act_file f ON f.act_pk = a.act_pk
    WHERE a.source='avanzata'
    """
).fetchone()[0]
out["catalog_non_owner_rows"] = c.execute(
    "SELECT COUNT(*) FROM dm_act_catalog WHERE COALESCE(source,'') <> 'avanzata'"
).fetchone()[0]
out["non_owner_rows_referenced"] = c.execute(
    """
    WITH non_owner AS (
      SELECT act_pk FROM dm_act_catalog WHERE COALESCE(source,'') <> 'avanzata'
    ),
    refs AS (
      SELECT act_pk FROM dm_act_file
      UNION
      SELECT act_pk FROM dm_download_event WHERE act_pk IS NOT NULL
    )
    SELECT COUNT(*) FROM non_owner n JOIN refs r ON r.act_pk = n.act_pk
    """
).fetchone()[0]
out["act_file_distinct_paths"] = c.execute(
    "SELECT COUNT(DISTINCT path) FROM dm_act_file"
).fetchone()[0]
out["artifact_unique_bytes"] = c.execute(
    "SELECT COALESCE(SUM(size), 0) FROM dm_artifact_blob"
).fetchone()[0]
out["artifact_logical_bytes"] = c.execute(
    """
    SELECT COALESCE(SUM(b.size), 0)
    FROM dm_act_file f
    JOIN dm_artifact_blob b ON b.blob_id = f.blob_id
    """
).fetchone()[0]
print(json.dumps(out, ensure_ascii=False, indent=2))
conn.close()
PY

Baseline (2026-02-11)

Observed on backend/exports/normattiva/ita/download_manifest.db:

  1. Integrity:
  2. fk_violations: 0
  3. duplicate_alias_kind_value: 0
  4. aliases_with_missing_act: 0
  5. files_with_missing_act: 0
  6. events_with_missing_act: 0
  7. Catalog:
  8. owner_total_acts: 199946
  9. owner_acts_with_any_file: 180153 (90.10%)
  10. catalog_non_owner_rows: 6
  11. non_owner_rows_referenced: 6
  12. Ambiguity:
  13. rows_dm_download_ambiguity: 0
  14. unresolved_ambiguity: 0
  15. Repetition/dedup:
  16. rows_dm_act_file: 1450073
  17. act_file_distinct_paths: 1171222
  18. repeated path rows: 278851
  19. artifact_unique_bytes: 82203117877
  20. artifact_logical_bytes: 97088725051

Interpretation

  1. Data consistency is currently strong (all core integrity checks clean).
  2. Coverage of owner acts is high but incomplete (about 10% still missing files).
  3. Logical repetition in dm_act_file is expected because repeated observations are kept for auditability; storage uses blob dedup to avoid proportional disk growth.
  4. Remaining non-owner acts are a known residual set and are still referenced.

Follow-up Checks

After each full download run:

  1. Re-run the runbook query and compare against this baseline.
  2. Confirm catalog_non_owner_rows does not increase.
  3. Confirm unresolved_ambiguity remains 0 (or declines if non-zero).
  4. Track owner coverage trend (owner_acts_with_any_file / owner_total_acts).