Normattiva async export – investigation notes
This document summarizes the experiments and findings around the Normattiva
asynchronous export API, as used by ItalyDownloader and ad‑hoc scripts.
See also:
High‑level goals
- Replicate the full Normattiva dataset for Italy:
- Multivigente (
richiestaExport="M") – last N years with full temporal history. - Vigente (
richiestaExport="V") – acts in force at the query date. - Originario (
richiestaExport="O") – original versions. - Start from the documentation claim:
Allo stato dell’arte è consentito interrogare e scaricare i dati nelle seguenti modalità:
- Tutti gli atti normativi pubblicati sul portale Normattiva in versione “originale”
- Tutti gli atti normativi pubblicati sul portale Normattiva in versione vigente alla data di interrogazione
- Gli atti normativi pubblicati negli ultimi 5 anni in versione “multivigenza”
Downloader implementation (summary)
ItalyDownloader(backend/src/law_graph/pipelines/normattiva/download/italy.py):- Uses the async export endpoint:
POST /api/v1/ricerca-asincrona/nuova-ricercaPUT /api/v1/ricerca-asincrona/conferma-ricercaGET /api/v1/ricerca-asincrona/check-status/{token}GET /api/v1/collections/download/collection-asincrona/{token}
- Always sends
tipoRicerca="A"(advanced search) with aRicercaAvanzataFilterDtopayload (wrapped asAdvancedSearchFilter). - Modes:
- History window: from
history_start_year=1861up to a cutoff around 2020 (configurable), for V and O. - Recent / rolling window: from the cutoff to “now” for M.
- History window: from
- Writes files under:
exports/normattiva/ita/{law_id}/{format_key}/{modalita}/{filename}
- Tracks runs in
exports/normattiva/ita/download_manifest.dbwith:format_key("akn"/"json")modalita(currently"R", responsive)richiesta_export("M","V","O")files_written,bytes_written, window timestamps.
Initial behavior and issues
- First full run after implementing the downloader:
- Multivigente (M) windows behaved as expected:
- Non‑zero
files_writtenfor"akn"and"json". - Earliest publication date on disk: 2020‑11‑16.
- Non‑zero
- History windows for V and O:
- Log messages showed expected windows:
Window 1861-01-01 -> 2020-11-11 formato=akn richiesta_export=V modalita=R- Same for JSON and later for O.
- However,
_download_window_chunklogged: No entries returned for formato=… richiesta_export=V/O …- Result:
files_written=0for all history runs in the manifest.
- Subsequent runs:
- Because the manifest recorded history runs with
files_written=0and timestamps at theend_cap,_build_filtertreated V/O as already “synced through” the cutoff and skipped them entirely.
- Because the manifest recorded history runs with
Overlap / date overflow bug
- While experimenting with very large
--overlap-daysvalues, we hit: OverflowError: date value out of rangein_build_filter.- Root cause:
- We computed
start_date = (last_run - timedelta(days=overlap)).date()without clampingoverlap, so a huge overlap could push the date beforedate.min. - Fix:
-
Clamp overlap against the distance from
date.min: -
Added
test_build_filter_does_not_overflowto assert we no longer underflow and thatstart_dateis at least the mode’sstart_floor.
Experiments with history windows
To understand why V/O exports were empty, we added and used
scripts/normattiva_experiment.py. This script:
- Constructs
AdvancedSearchFilterpayloads for various combinations. - Calls the same async endpoints as the downloader.
- Prints the number of entries in the resulting ZIP archive for each run.
Key experiments (all formato="AKN", modalita="R", tipoRicerca="A" unless noted):
Broad 1950 windows (publication date)
V-1950/O-1950:dataInizioPubblicazione="1950-01-01"dataFinePubblicazione="1950-12-31"- Result: 0 files in the downloaded archive.
V-1950-lim/O-1950-lim:- Same as above, but narrower window:
dataFinePubblicazione="1950-03-31".
- Result: 0 files.
V-1950-vig/O-1950-vig:- Publication + vigenza filters:
dataInizioPubblicazione="1950-01-01"dataFinePubblicazione="1950-12-31"limitaAnniVigenza=TruedataInizioVigenza="1950-01-01"dataFineVigenza="1950-12-31"
- Result: 0 files.
Conclusion: even reasonably small, well‑formed 1950 date windows for V or O yield empty ZIPs via the async export API.
Emanazione windows
V-1950-emanazione/O-1950-emanazione:dataInizioEmanazione="1950-01-01"dataFineEmanazione="1950-12-31"- Result: 0 files.
Provvedimento year only
V-anno-1950/O-anno-1950:annoProvvedimento=1950orderType="DESC"- Result: 0 files.
Simple search (tipoRicerca="S")
V-simple/O-simple(not currently in the script, but tested separately):tipoRicerca="S"with a simple payload:denominazioneAtto="LEGGE"testoRicerca="decreto"
- Response:
- HTTP 400 with
code="1502": "Numero di atti superiore al limite consentito di 7000, raffinare la ricerca".
- HTTP 400 with
Conclusion: simple search is available but heavily constrained; broad text filters without strong narrowing are rejected at request time, and we did not find a combination that both passes the limit and returns historical originario or vigente acts in bulk.
Positive result: per‑act exports via codice redazionale
We eventually found a combination that reliably returns historical data:
- Experiments
V-codice-1942/O-codice-1942: richiestaExport="V"or"O".tipoRicerca="A".-
Filter payload:
-
This corresponds to the “codice della crisi d’impresa” base act:
regio decreto 16 marzo 1942, n. 267(Disciplina del fallimento…).
- Results:
V-codice-1942: 1126 files in the archive.O-codice-1942: 1126 files in the archive.
This confirms:
- The async export API does have access to:
- Originario content for very old acts.
- Vigente content for the same acts.
- When we narrow the search to a specific
codiceRedazionale, the service returns a large, non‑empty archive for both V and O.
In other words: the data exists, but bulk historical queries (based solely on publication/vigenza/emanazione date ranges) are not producing any results via the async export API.
Interpretation vs documentation
- Normattiva’s documentation says:
- You can retrieve:
- All originario acts.
- All vigente acts.
- Last 5 years in multivigente form.
- What we actually observe via the async export API:
- Multivigente (
richiestaExport="M"):- Works as advertised for recent/restricted windows.
- Originario/Vigente (
"O","V"):- Broad historical windows (even per year) return empty ZIPs.
- Highly targeted searches by
codiceRedazionalefor individual acts do return the expected files.
- Hypotheses:
- The async bulk export might have hidden limits or require additional filters for historical corpora (e.g. per‑class, per‑codice redazionale buckets).
- Documentation may be describing the conceptual capabilities of the system,
while the public async export implementation is optimized for:
- “Last N years” in multivigente, and
- Per‑act exports in originario/vigente.
Practical conclusions for law_graph
- Multivigente is reliable for automation.
-
ItalyDownloadercan safely rely onrichiestaExport="M"for regular synchronization, and the manifest logic for M behaves correctly. -
Bulk V/O history via async export appears infeasible (for now).
- The downloader’s history windows for V/O over 1861–2020 do not return any files, even when narrowed to a single year or quarter.
-
We treat these runs as “best effort” but do not rely on them to populate the historical corpus.
-
Per‑act backfill is possible if we know the codes.
- Given a list of
codiceRedazionaleidentifiers, we can:- Issue async V/O exports per code.
- Store them using the same datastore/manifest structure.
-
There is, however, no obvious API call to enumerate all
codiceRedazionalevalues; we would need an external index/dump or a separate crawling strategy. -
Recommended next steps (future work).
- Document clearly that:
- Multivigente sync is supported and tested.
- Full originario/vigente history sync is not currently guaranteed due to Normattiva API behavior.
- Optionally:
- Add a separate, opt‑in “backfill by codice redazionale” tool that consumes an external list of codes and uses the working V/O per‑act flow.
- Contact Normattiva to clarify:
- Whether there is a supported way to perform bulk historical originario/vigente exports via async APIs.
- Whether any additional filters are required to avoid empty archives for historical windows.