Discussions
Veraset data download speeds
13 hours ago by Kit McLean
I'm downloading Veraset Home, Work, and Visits data on a university HPC cluster using DuckDB (httpfs) and am running into slow download speeds and occasional HTTP timeouts.
Current workflow:
- Query API for shard URLs by month
- Create a manifest of URLs
- Process the manifest using a SLURM array
- Each task reads ~250 parquet shards via DuckDB, filters to a Michigan/7-county study region, and writes filtered parquet outputs to local storage
- DuckDB settings:
threads=4,http_timeout=600,http_retries=5
These months contain thousands of shards (e.g., Work Visits 2023-12 has 8,192 shards), and data retrieval appears to be the primary bottleneck.
My questions:
- Is this manifest + SLURM array + DuckDB workflow the recommended approach for accessing Veraset data at this scale?
- Are there DuckDB settings, batching strategies, or concurrency levels you would recommend to improve throughput?
- Are there known rate limits or performance considerations when accessing large numbers of parquet shards?
- For a project requiring multiple years of Home, Work, and Visits data for a multi-county region, is there a more efficient workflow you would recommend (e.g., Spark, bulk export, different partitioning strategy, etc.)?
Any guidance would be greatly appreciated.