Download Manifest DB Quality Checks
This page defines a repeatable quality audit for
backend/exports/normattiva/ita/download_manifest.db.
Scope
Checks cover:
- Referential integrity (
dm_act_alias,dm_act_file,dm_download_event). - Alias uniqueness (
kind,value). - Catalog ownership drift (non-owner residual rows).
- Ambiguity backlog (
dm_download_ambiguity). - Coverage of global-owner acts (
source='avanzata'). - Logical repetition in
dm_act_file. - Runtime strict-ownership misses (
reason='unmapped_owner_alias').
Runbook
Run from backend/:
uv run python - <<'PY'
import json
import sqlite3
from pathlib import Path
db = Path("exports/normattiva/ita/download_manifest.db")
conn = sqlite3.connect(db)
conn.row_factory = sqlite3.Row
c = conn.cursor()
out = {}
out["fk_violations"] = len(c.execute("PRAGMA foreign_key_check").fetchall())
out["rows_dm_act_catalog"] = c.execute("SELECT COUNT(*) FROM dm_act_catalog").fetchone()[0]
out["rows_dm_act_alias"] = c.execute("SELECT COUNT(*) FROM dm_act_alias").fetchone()[0]
out["rows_dm_act_file"] = c.execute("SELECT COUNT(*) FROM dm_act_file").fetchone()[0]
out["rows_dm_download_event"] = c.execute("SELECT COUNT(*) FROM dm_download_event").fetchone()[0]
out["rows_dm_download_ambiguity"] = c.execute(
"SELECT COUNT(*) FROM dm_download_ambiguity"
).fetchone()[0]
out["unresolved_ambiguity"] = c.execute(
"SELECT COUNT(*) FROM dm_download_ambiguity WHERE resolved = 0"
).fetchone()[0]
out["unmapped_owner_alias_unresolved"] = c.execute(
"""
SELECT COUNT(*)
FROM dm_download_ambiguity
WHERE resolved = 0
AND reason = 'unmapped_owner_alias'
"""
).fetchone()[0]
out["duplicate_alias_kind_value"] = c.execute(
"""
SELECT COUNT(*)
FROM (
SELECT kind, value, COUNT(*) AS c
FROM dm_act_alias
GROUP BY kind, value
HAVING c > 1
)
"""
).fetchone()[0]
out["aliases_with_missing_act"] = c.execute(
"""
SELECT COUNT(*)
FROM dm_act_alias a
LEFT JOIN dm_act_catalog c ON c.act_pk = a.act_pk
WHERE c.act_pk IS NULL
"""
).fetchone()[0]
out["files_with_missing_act"] = c.execute(
"""
SELECT COUNT(*)
FROM dm_act_file f
LEFT JOIN dm_act_catalog c ON c.act_pk = f.act_pk
WHERE c.act_pk IS NULL
"""
).fetchone()[0]
out["events_with_missing_act"] = c.execute(
"""
SELECT COUNT(*)
FROM dm_download_event e
LEFT JOIN dm_act_catalog c ON c.act_pk = e.act_pk
WHERE e.act_pk IS NOT NULL AND c.act_pk IS NULL
"""
).fetchone()[0]
out["owner_total_acts"] = c.execute(
"SELECT COUNT(*) FROM dm_act_catalog WHERE source='avanzata'"
).fetchone()[0]
out["owner_acts_with_any_file"] = c.execute(
"""
SELECT COUNT(DISTINCT a.act_pk)
FROM dm_act_catalog a
JOIN dm_act_file f ON f.act_pk = a.act_pk
WHERE a.source='avanzata'
"""
).fetchone()[0]
out["catalog_non_owner_rows"] = c.execute(
"SELECT COUNT(*) FROM dm_act_catalog WHERE COALESCE(source,'') <> 'avanzata'"
).fetchone()[0]
out["non_owner_rows_referenced"] = c.execute(
"""
WITH non_owner AS (
SELECT act_pk FROM dm_act_catalog WHERE COALESCE(source,'') <> 'avanzata'
),
refs AS (
SELECT act_pk FROM dm_act_file
UNION
SELECT act_pk FROM dm_download_event WHERE act_pk IS NOT NULL
)
SELECT COUNT(*) FROM non_owner n JOIN refs r ON r.act_pk = n.act_pk
"""
).fetchone()[0]
out["act_file_distinct_paths"] = c.execute(
"SELECT COUNT(DISTINCT path) FROM dm_act_file"
).fetchone()[0]
out["artifact_unique_bytes"] = c.execute(
"SELECT COALESCE(SUM(size), 0) FROM dm_artifact_blob"
).fetchone()[0]
out["artifact_logical_bytes"] = c.execute(
"""
SELECT COALESCE(SUM(b.size), 0)
FROM dm_act_file f
JOIN dm_artifact_blob b ON b.blob_id = f.blob_id
"""
).fetchone()[0]
print(json.dumps(out, ensure_ascii=False, indent=2))
conn.close()
PY
Baseline (2026-02-11)
Observed on backend/exports/normattiva/ita/download_manifest.db:
- Integrity:
fk_violations:0duplicate_alias_kind_value:0aliases_with_missing_act:0files_with_missing_act:0events_with_missing_act:0- Catalog:
owner_total_acts:199946owner_acts_with_any_file:180153(90.10%)catalog_non_owner_rows:6non_owner_rows_referenced:6- Ambiguity:
rows_dm_download_ambiguity:0unresolved_ambiguity:0- Repetition/dedup:
rows_dm_act_file:1450073act_file_distinct_paths:1171222- repeated path rows:
278851 artifact_unique_bytes:82203117877artifact_logical_bytes:97088725051
Interpretation
- Data consistency is currently strong (all core integrity checks clean).
- Coverage of owner acts is high but incomplete (about 10% still missing files).
- Logical repetition in
dm_act_fileis expected because repeated observations are kept for auditability; storage uses blob dedup to avoid proportional disk growth. - Remaining non-owner acts are a known residual set and are still referenced.
Follow-up Checks
After each full download run:
- Re-run the runbook query and compare against this baseline.
- Confirm
catalog_non_owner_rowsdoes not increase. - Confirm
unresolved_ambiguityremains0(or declines if non-zero). - Track owner coverage trend (
owner_acts_with_any_file / owner_total_acts).