Discussions

Ask a Question
Back to all

Broken File in Veraset Visit Dataset

There seems to be broken parquet partition file for downloading veraset visit data.


api_key = "akv1_xVXxxxxxxxxxxxxxL"
data_id = "prj_6rg3whup__fldr_8zme9bwbekydvezq"   # Veraset Visits dataset
# data_id = "fldr_d7cqgtcj3nyi4usp"   # Veraset home visits dataset
# data_id = "fldr_gfv4qahxiwsd4dwy"   # Veraset work visits dataset
start_date = "2024-01-01"
end_date = "2025-04-30"
output_path = "/data_jbod/personal/albert/dewey/NYTX2025/visits.parquet"

set_api_key(api_key)

urls = get_dataset_files(
    data_id,
    partition_key_after=start_date,
    partition_key_before=end_date,
    to_list=True
)
con = duckdb.connect()
con.execute("INSTALL httpfs; LOAD httpfs;")
con.execute("""
    COPY (
        SELECT *
        FROM read_parquet($url)
        WHERE lower(state) IN ('new york', 'texas')
        AND lower(city) IN ('dallas', 'new york city')
    )
    TO $output_pathq
    (FORMAT PARQUET, COMPRESSION ZSTD);
    """, {"url": urls, "output_path": output_path})
print(f"✅ Downloaded filtered dataset to {output_path}")
---------------------------------------------------------------------------
HTTPException                             Traceback (most recent call last)
Cell In[2], line 19
     17 con = duckdb.connect()
     18 con.execute("INSTALL httpfs; LOAD httpfs;")
---> 19 con.execute("""
     20     COPY (
     21         SELECT *
     22         FROM read_parquet($url)
     23         WHERE lower(state) IN ('new york', 'texas')
     24         AND lower(city) IN ('dallas', 'new york city')
     25     )
     26     TO $output_path
     27     (FORMAT PARQUET, COMPRESSION ZSTD);
     28     """, {"url": urls, "output_path": output_path})
     29 print(f"✅ Downloaded filtered dataset to {output_path}")

HTTPException: HTTP Error: HTTP GET error on 'https://api.deweydata.io/api/v2/downloads/019b5981-c2b5-7782-9a23-88b188b59122.snappy.parquet?secret=Cesl5k52zEgytTR' (HTTP 500)