Using DuckDB with Dewey

One of the best ways to get just the data you need is using DuckDB. Using its API allows you to specify the data you need before downloading. All you need to provide is your list of URLs from your file list request and a select statement.

📘
To use DuckDB for this use case, the dataset will need to be hosted by Dewey in Parquet format. Most file-only datasets on Dewey are Parquet (Veraset, GWS, etc.), however, some are CSV.

🚧
When using DuckDB to access data, always use the COPY TO statement to download your data instead of writing it to a dataframe or other in-memory object.
This ensures your data persists on disk, providing a stable and reproducible version that can be safely used in downstream workflows.

DuckDB + Dewey Data Walkthrough

Getting Started

Install and import necessary libraries

import duckdb
import polars as pl

from deweypy.auth import set_api_key
from deweypy.download.synchronous import get_dataset_files

Set your API key and dataset URL

api_key = "<KEY>"
data_id = "<ID>"

set_api_key(api_key) # Pass API key here to authenticate future requests

Get your file list and define date partition range. Many datasets on Dewey are partitioned by date. This is the first opportunity to reduce the amount of data you are accessing. This will filter the API response to the specified date range, not reduce the underlying data available at the source. You can inspect the list if needed.

urls = get_dataset_files(data_id,
                        partition_key_after = "<YYYY-MM-DD>",
                        partition_key_before = "<YYYY-MM-DD>",
                        to_list=True)

urls[:10]

Accessing Data with DuckDB

Set up your DuckDB connection. You will need httpfs to preform the required requests.

con = duckdb.connect()
con.execute("INSTALL httpfs; LOAD httpfs;")

With Parquet files, you can query only the data you need instead of downloading entire datasets. Because Parquet is columnar, you can select specific columns efficiently, and by using range requests with standard SQL, you can also limit rows to the relevant subset.

In the example below, we’ll demonstrate this approach using Veraset visits data. Make sure to set your directory.

con.execute("""
  COPY (
    SELECT
      local_timestamp,
      location_name,
      state,
      city,
      naics_code
    FROM read_parquet($urls)
    WHERE state = 'tennessee' AND naics_code = 712190
  )
  TO '{<YOUR_DIRECTORY>}'
  (FORMAT PARQUET,
   PARTITION_BY (city),
   ROW_GROUP_SIZE 256000,
   COMPRESSION ZSTD);
""", {"urls": urls})

Once the data is written out as Parquet, you can lazily load it into an analysis tool such as Polars (or any engine that supports Parquet). Lazy loading means queries are optimized and executed only when needed, making exploration efficient even on large datasets..

# Lazily scan the partitioned Parquet files with Hive-style directory layout
lf = pl.scan_parquet('<YOUR_DIRECTORY>/*/*.parquet', hive_partitioning=True)

# Count records per city and return the top results
lf.group_by("city").len().sort("len", descending=True).collect()