Using DuckDB with Dewey

One of the best ways to get just the data you need is using DuckDB. Using its API allows you to specify the data you need before downloading. All you need to provide is your list of URLs from your file list request and a select statement.

📘

To use DuckDB for this use case, the dataset will need to be hosted by Dewey in Parquet format. Most file-only datasets on Dewey are Parquet (Veraset, GWS, etc.), however, some are CSV.

Getting Started

  1. Install and import necessary libraries
import duckdb
import deweydatapy as ddp
import polars as pl
  1. Set your API key and dataset URL
api_key = "<KEY>"
data_url = "<URL>"
  1. Get your file list and define date partition range, store response in a dataframe. Many datasets on Dewey are partitioned by date. This is the first opportunity to reduce the amount of data you are accessing. This will filter the API response to the specified date range, not reduce the underlying data available at the source. You can inspect the dataframe if needed.
files_df = ddp.get_file_list(api_key, data_url,
                             start_date = '<YYYY-MM-DD>',
                             end_date = '<YYYY-MM-DD>',
                             print_info = True);

files_df.head()
  1. Store your signed URLs as a list. We will use this list as our access point for DuckDB
url = files_df['link']
urls = url.tolist()

Accessing Data with DuckDB

  1. Set up your DuckDB connection. You will need httpfs to preform the required requests.
con = duckdb.connect()
con.execute("INSTALL httpfs; LOAD httpfs;")
  1. With Parquet files, you can query only the data you need instead of downloading entire datasets. Because Parquet is columnar, you can select specific columns efficiently, and by using range requests with standard SQL, you can also limit rows to the relevant subset.

    In the example below, we’ll demonstrate this approach using Veraset visits data. Make sure to set your directory.

con.execute("""
  COPY (
    SELECT
      local_timestamp,
      location_name,
      state,
      city,
      naics_code
    FROM read_parquet($urls)
    WHERE state = 'tennessee' AND naics_code = 712190
  )
  TO '{<YOUR_DIRECTORY>}'
  (FORMAT PARQUET,
   PARTITION_BY (city),
   ROW_GROUP_SIZE 256000,
   COMPRESSION ZSTD);
""", {"urls": urls})
  1. Once the data is written out as Parquet, you can lazily load it into an analysis tool such as Polars (or any engine that supports Parquet). Lazy loading means queries are optimized and executed only when needed, making exploration efficient even on large datasets..
# Lazily scan the partitioned Parquet files with Hive-style directory layout
lf = pl.scan_parquet('<YOUR_DIRECTORY>/*/*/*.parquet', hive_partitioning=True)

# Count records per city and return the top results
lf.group_by("city").len().sort("len", descending=True).collect()