Skip to content

Normattiva async export – investigation notes

This document summarizes the experiments and findings around the Normattiva asynchronous export API, as used by ItalyDownloader and ad‑hoc scripts.

See also:

High‑level goals

  • Replicate the full Normattiva dataset for Italy:
  • Multivigente (richiestaExport="M") – last N years with full temporal history.
  • Vigente (richiestaExport="V") – acts in force at the query date.
  • Originario (richiestaExport="O") – original versions.
  • Start from the documentation claim:

Allo stato dell’arte è consentito interrogare e scaricare i dati nelle seguenti modalità:

  • Tutti gli atti normativi pubblicati sul portale Normattiva in versione “originale”
  • Tutti gli atti normativi pubblicati sul portale Normattiva in versione vigente alla data di interrogazione
  • Gli atti normativi pubblicati negli ultimi 5 anni in versione “multivigenza”

Downloader implementation (summary)

  • ItalyDownloader (backend/src/law_graph/pipelines/normattiva/download/italy.py):
  • Uses the async export endpoint:
    • POST /api/v1/ricerca-asincrona/nuova-ricerca
    • PUT /api/v1/ricerca-asincrona/conferma-ricerca
    • GET /api/v1/ricerca-asincrona/check-status/{token}
    • GET /api/v1/collections/download/collection-asincrona/{token}
  • Always sends tipoRicerca="A" (advanced search) with a RicercaAvanzataFilterDto payload (wrapped as AdvancedSearchFilter).
  • Modes:
    • History window: from history_start_year=1861 up to a cutoff around 2020 (configurable), for V and O.
    • Recent / rolling window: from the cutoff to “now” for M.
  • Writes files under:
    • exports/normattiva/ita/{law_id}/{format_key}/{modalita}/{filename}
  • Tracks runs in exports/normattiva/ita/download_manifest.db with:
    • format_key ("akn" / "json")
    • modalita (currently "R", responsive)
    • richiesta_export ("M", "V", "O")
    • files_written, bytes_written, window timestamps.

Initial behavior and issues

  • First full run after implementing the downloader:
  • Multivigente (M) windows behaved as expected:
    • Non‑zero files_written for "akn" and "json".
    • Earliest publication date on disk: 2020‑11‑16.
  • History windows for V and O:
    • Log messages showed expected windows:
    • Window 1861-01-01 -> 2020-11-11 formato=akn richiesta_export=V modalita=R
    • Same for JSON and later for O.
    • However, _download_window_chunk logged:
    • No entries returned for formato=… richiesta_export=V/O …
    • Result: files_written=0 for all history runs in the manifest.
  • Subsequent runs:
    • Because the manifest recorded history runs with files_written=0 and timestamps at the end_cap, _build_filter treated V/O as already “synced through” the cutoff and skipped them entirely.

Overlap / date overflow bug

  • While experimenting with very large --overlap-days values, we hit:
  • OverflowError: date value out of range in _build_filter.
  • Root cause:
  • We computed start_date = (last_run - timedelta(days=overlap)).date() without clamping overlap, so a huge overlap could push the date before date.min.
  • Fix:
  • Clamp overlap against the distance from date.min:

    max_overlap = (last_run.date() - date.min).days
    safe_overlap = min(overlap, max_overlap)
    start_date = (last_run - timedelta(days=safe_overlap)).date()
    
  • Added test_build_filter_does_not_overflow to assert we no longer underflow and that start_date is at least the mode’s start_floor.

Experiments with history windows

To understand why V/O exports were empty, we added and used scripts/normattiva_experiment.py. This script:

  • Constructs AdvancedSearchFilter payloads for various combinations.
  • Calls the same async endpoints as the downloader.
  • Prints the number of entries in the resulting ZIP archive for each run.

Key experiments (all formato="AKN", modalita="R", tipoRicerca="A" unless noted):

Broad 1950 windows (publication date)

  • V-1950 / O-1950:
  • dataInizioPubblicazione="1950-01-01"
  • dataFinePubblicazione="1950-12-31"
  • Result: 0 files in the downloaded archive.
  • V-1950-lim / O-1950-lim:
  • Same as above, but narrower window:
    • dataFinePubblicazione="1950-03-31".
  • Result: 0 files.
  • V-1950-vig / O-1950-vig:
  • Publication + vigenza filters:
    • dataInizioPubblicazione="1950-01-01"
    • dataFinePubblicazione="1950-12-31"
    • limitaAnniVigenza=True
    • dataInizioVigenza="1950-01-01"
    • dataFineVigenza="1950-12-31"
  • Result: 0 files.

Conclusion: even reasonably small, well‑formed 1950 date windows for V or O yield empty ZIPs via the async export API.

Emanazione windows

  • V-1950-emanazione / O-1950-emanazione:
  • dataInizioEmanazione="1950-01-01"
  • dataFineEmanazione="1950-12-31"
  • Result: 0 files.

Provvedimento year only

  • V-anno-1950 / O-anno-1950:
  • annoProvvedimento=1950
  • orderType="DESC"
  • Result: 0 files.

Simple search (tipoRicerca="S")

  • V-simple / O-simple (not currently in the script, but tested separately):
  • tipoRicerca="S" with a simple payload:
    • denominazioneAtto="LEGGE"
    • testoRicerca="decreto"
  • Response:
    • HTTP 400 with code="1502":
    • "Numero di atti superiore al limite consentito di 7000, raffinare la ricerca".

Conclusion: simple search is available but heavily constrained; broad text filters without strong narrowing are rejected at request time, and we did not find a combination that both passes the limit and returns historical originario or vigente acts in bulk.

Positive result: per‑act exports via codice redazionale

We eventually found a combination that reliably returns historical data:

  • Experiments V-codice-1942 / O-codice-1942:
  • richiestaExport="V" or "O".
  • tipoRicerca="A".
  • Filter payload:

    {
      "codiceRedazionale": "1942-03-16-267",
      "orderType": "DESC"
    }
    
  • This corresponds to the “codice della crisi d’impresa” base act:

    • regio decreto 16 marzo 1942, n. 267 (Disciplina del fallimento…).
  • Results:
    • V-codice-1942: 1126 files in the archive.
    • O-codice-1942: 1126 files in the archive.

This confirms:

  • The async export API does have access to:
  • Originario content for very old acts.
  • Vigente content for the same acts.
  • When we narrow the search to a specific codiceRedazionale, the service returns a large, non‑empty archive for both V and O.

In other words: the data exists, but bulk historical queries (based solely on publication/vigenza/emanazione date ranges) are not producing any results via the async export API.

Interpretation vs documentation

  • Normattiva’s documentation says:
  • You can retrieve:
    • All originario acts.
    • All vigente acts.
    • Last 5 years in multivigente form.
  • What we actually observe via the async export API:
  • Multivigente (richiestaExport="M"):
    • Works as advertised for recent/restricted windows.
  • Originario/Vigente ("O", "V"):
    • Broad historical windows (even per year) return empty ZIPs.
    • Highly targeted searches by codiceRedazionale for individual acts do return the expected files.
  • Hypotheses:
  • The async bulk export might have hidden limits or require additional filters for historical corpora (e.g. per‑class, per‑codice redazionale buckets).
  • Documentation may be describing the conceptual capabilities of the system, while the public async export implementation is optimized for:
    • “Last N years” in multivigente, and
    • Per‑act exports in originario/vigente.

Practical conclusions for law_graph

  1. Multivigente is reliable for automation.
  2. ItalyDownloader can safely rely on richiestaExport="M" for regular synchronization, and the manifest logic for M behaves correctly.

  3. Bulk V/O history via async export appears infeasible (for now).

  4. The downloader’s history windows for V/O over 1861–2020 do not return any files, even when narrowed to a single year or quarter.
  5. We treat these runs as “best effort” but do not rely on them to populate the historical corpus.

  6. Per‑act backfill is possible if we know the codes.

  7. Given a list of codiceRedazionale identifiers, we can:
    • Issue async V/O exports per code.
    • Store them using the same datastore/manifest structure.
  8. There is, however, no obvious API call to enumerate all codiceRedazionale values; we would need an external index/dump or a separate crawling strategy.

  9. Recommended next steps (future work).

  10. Document clearly that:
    • Multivigente sync is supported and tested.
    • Full originario/vigente history sync is not currently guaranteed due to Normattiva API behavior.
  11. Optionally:
    • Add a separate, opt‑in “backfill by codice redazionale” tool that consumes an external list of codes and uses the working V/O per‑act flow.
    • Contact Normattiva to clarify:
    • Whether there is a supported way to perform bulk historical originario/vigente exports via async APIs.
    • Whether any additional filters are required to avoid empty archives for historical windows.