Discussions

Ask a Question
Back to all

Veraset data download speeds

I'm downloading Veraset Home, Work, and Visits data on a university HPC cluster using DuckDB (httpfs) and am running into slow download speeds and occasional HTTP timeouts.

Current workflow:

  • Query API for shard URLs by month
  • Create a manifest of URLs
  • Process the manifest using a SLURM array
  • Each task reads ~250 parquet shards via DuckDB, filters to a Michigan/7-county study region, and writes filtered parquet outputs to local storage
  • DuckDB settings: threads=4, http_timeout=600, http_retries=5

These months contain thousands of shards (e.g., Work Visits 2023-12 has 8,192 shards), and data retrieval appears to be the primary bottleneck.

My questions:

  1. Is this manifest + SLURM array + DuckDB workflow the recommended approach for accessing Veraset data at this scale?
  2. Are there DuckDB settings, batching strategies, or concurrency levels you would recommend to improve throughput?
  3. Are there known rate limits or performance considerations when accessing large numbers of parquet shards?
  4. For a project requiring multiple years of Home, Work, and Visits data for a multi-county region, is there a more efficient workflow you would recommend (e.g., Spark, bulk export, different partitioning strategy, etc.)?

Any guidance would be greatly appreciated.