Working with R

Overview

Dewey makes it easy to work with large datasets in R using familiar, flexible tools like tidyverse, arrow, and DuckDB. These examples show how to load data efficiently, inspect schemas, filter rows before download, and quickly summarize large files without requiring heavy local setup.

With just a few lines of code, you can connect to your files, run performant queries, and start analyzing immediately — no complex configuration required.

Downloading Data

At this time, no native R package exists that can authenticate with the Dewey API or download datasets directly from within R. Unlike the deweypy client in Python, which provides built-in authentication, folder browsing, bulk filtering, and direct download workflows, the R ecosystem does not currently include an equivalent client library.

We recommend using the Quickstart: Dewey Client to download data to your local machine, and then loading the downloaded files into R.

Below, you’ll find example code snippets that demonstrate how to:

Use the arrow package to load Parquet (.parquet) files into R and combine multiple files into a single dataset.
Use the vroom package to load compressed CSV files (.csv.gz) into R.

If your data are in Parquet format, you can also use the duckdb package to query and filter the data directly on disk before materializing the results into R as a single dataset. For compressed CSV files (.csv.gz), duckdb can read the files, optionally convert them to .parquet, and then allow you to run efficient filtered queries against the converted data.

Loading Data into R(Studio)

Handling Different Data Types

Datasets downloaded from Dewey may come in different file formats depending on storage requirements or download preferences. This section provides guidance on how to load your data into R, whether the files are in .parquet or .csv.gz format.

Parquet files

Many of Dewey’s datasets are provided as .parquet files due to their efficient storage and query performance. To load .parquet files into R, you can use the arrow or duckdb packages, depending on whether you need to filter the data before bringing it into R.

Arrow

arrow provides a fast and memory-efficient way to read .parquet files into R. It supports loading single files or entire directories of .parquet files and returns the result as a tidy, in-memory dataset. arrow is ideal when you want to load the full dataset directly without applying filters first.

# ----------------------------------------------------------------------------------
# Optional: Install packages 
# Remove the "#" on the line below to install the arrow package (only needed once).
# ----------------------------------------------------------------------------------
# install.packages("arrow")

# -------------------------
# Load required libraries
# -------------------------
library(arrow)     # For working with Parquet datasets efficiently (no full in-memory load required)

# -------------------------------------------------------------
# Point to the local folder that contains Dewey Parquet files
# -------------------------------------------------------------
# This folder should contain one or more .parquet files downloaded from Dewey.
path <- "YOUR FILEPATH"

# Example:
# path <- "C:/Users/user1/Documents/dewey-downloads/mydata"

# ------------------------------------------------
# Create an Arrow Dataset from the Parquet files
# ------------------------------------------------
# open_dataset() creates a lazy Arrow Dataset that can be queried without immediately loading everything into memory.
lazy_data <- open_dataset(path, format = "parquet")

# --------------------------------------------------------------
# Materialize the full dataset into R as a data.frame / tibble
# --------------------------------------------------------------
# collect() pulls the data from disk (or remote storage) into R memory.
# For very large Dewey datasets, consider filtering or selecting columns before calling collect().
data <- collect(lazy_data)

# View the first six rows of your dataset
head(data)

CSV Files

Some of Dewey’s datasets are delivered in CSV format and are provided as compressed CSV files (.csv.gz) to reduce file size and improve download performance. These files can be loaded directly into R using the vroom package, which efficiently reads and combines multiple compressed CSVs into a single dataset.

duckdb only queries and filter .parquet files. To utilize duckdb you will need to convert the files to .parquet first. There is a simple workflow to do this within R. The second tab of the code box bellow provides the coding for transforming .csv.gz files to .parquet

# ----------------------------------------------------------------------------------
# Optional: Install packages 
# Remove the "#" below to install readr (only needed once)
# ----------------------------------------------------------------------------------
# install.packages("readr")

# -------------------------
# Load required libraries
# -------------------------
library(readr)      # For fast, tidy reading of CSV and CSV.GZ files

# -----------------------------------------------------------------------
# Point to the local folder that contains Dewey compressed CSV (.csv.gz)
# -----------------------------------------------------------------------
path <- "YOUR FILEPATH"

# Example:
# path <- "C:/Users/user1/Documents/dewey-downloads/mydata"

# ----------------------------------------------------------
# Load all .csv.gz files in the folder into a single dataset
# ----------------------------------------------------------
# read_csv() automatically decompresses .gz files.
files <- list.files(path, pattern = "\\.csv\\.gz$", full.names = TRUE)

data <- do.call(dplyr::bind_rows, lapply(files, read_csv))

# --------------------------------------------------------------
# View the first rows of your dataset to inspect the content
# --------------------------------------------------------------
head(data)

Filter Data

DuckDB

If you want to filter Dewey datasets in R, you can use DuckDB, but only after the files have been downloaded to your local machine. At this time, there is no R package equivalent to deweypy that can fetch file URLs and enable DuckDB to pre-scan or pre-filter remote data prior to download.

The best workflow is:

Download the dataset locally first.
Then use DuckDB in R to filter, reshape, and query the data before loading it into memory.

This gives you the full power of DuckDB, just after download rather than before.

# Workflow for .parquet files



# ----------------------------------------------------------------------------------
# Optional: Install packages 
# Remove the "#" on the line below to install the duckdb package (only needed once).
# ----------------------------------------------------------------------------------
# install.packages("duckdb")
# install.packages("DBI")

# -------------------------
# Load required libraries
# -------------------------
library(DBI)       # For database connections
library(duckdb)    # For querying Parquet efficiently using SQL (filter before load)

# -------------------------------------------------------------
# Point to the local folder that contains Dewey Parquet files
# -------------------------------------------------------------
# This folder should contain one or more .parquet files downloaded from Dewey.
path <- "YOUR FILEPATH"

# Example:
# path <- "C:/Users/user1/Documents/dewey-downloads/mydata"

# -----------------------------------------------
# Create a DuckDB connection (in-memory database)
# -----------------------------------------------
con <- dbConnect(duckdb(), dbdir = ":memory:")

#---------------------------------------------------------------------------------
# Preview five rows from the Parquet files
# This helps view the data and see a sample of the column names and table values
#---------------------------------------------------------------------------------
sample_query <- paste0("
  SELECT *
  FROM read_parquet('", path, "/*.parquet')
  LIMIT 5
")

sample_preview <- dbGetQuery(con, sample_query)

head(sample_preview)

# -------------------------------------------------------------------------
# Query and FILTER the Parquet files BEFORE loading them into R
# -------------------------------------------------------------------------
# Replace the WHERE clause with your desired filters.
# DuckDB reads only the necessary row groups and columns from disk.
query <- paste0("
  SELECT *
FROM read_parquet('", path, "/*.parquet')
  -- Example filters (remove the -- from the lines below to activate filters:
  -- WHERE state = 'WA'
  -- AND naics_code = '448120'
")

# --------------------------------------------------------------
# Materialize the filtered data into R as a data.frame / tibble
# --------------------------------------------------------------
# dbGetQuery() runs the SQL query and returns only the filtered rows.
data <- dbGetQuery(con, query)

# View the first six rows of your filtered dataset
head(data)

# -------------------------------
# Disconnect DuckDB when finished
# -------------------------------
dbDisconnect(con, shutdown = TRUE)

# Workflow for .csv.gz files



# ----------------------------------------------------------------------------------
# Optional: Install packages 
# Remove the "#" on the line below to install the duckdb package (only needed once).
# ----------------------------------------------------------------------------------
# install.packages("duckdb")
# install.packages("DBI")

# -------------------------
# Load required libraries
# -------------------------
library(DBI)       # For database connections
library(duckdb)    # For querying data efficiently using SQL (filter before load)

# ----------------------------------------------------------------------
# Point to the local folder that contains Dewey compressed CSV (.csv.gz)
# ----------------------------------------------------------------------
# This folder should contain one or more .csv.gz files downloaded from Dewey.
path <- "YOUR FILEPATH"

# Example:
# path <- "C:/Users/user1/Documents/dewey-downloads/mydata-csv"

# -----------------------------------------------
# Create a DuckDB connection (in-memory database)
# -----------------------------------------------
con <- dbConnect(duckdb(), dbdir = ":memory:")

#---------------------------------------------------------------------------------
# Preview five rows from the CSV files
# This helps view the data and see a sample of the column names and table values
#---------------------------------------------------------------------------------
sample_query <- paste0("
  SELECT *
  FROM read_csv_auto('", path, "/*.csv.gz')
  LIMIT 5
")

sample_preview <- dbGetQuery(con, sample_query)

head(sample_preview)

# -------------------------------------------------------------------------
# Query and FILTER the CSV files BEFORE loading them into R
# -------------------------------------------------------------------------
# Replace the WHERE clause with your desired filters.
# DuckDB reads only the necessary columns and rows from the CSV files.
query <- paste0("
  SELECT *
  FROM read_csv_auto('", path, "/*.csv.gz')
  -- Example filters (remove the -- from the lines below to activate filters):
  -- WHERE state = 'WA'
  -- AND naics_code = '448120'
")

# --------------------------------------------------------------
# Materialize the filtered data into R as a data.frame / tibble
# --------------------------------------------------------------
# dbGetQuery() runs the SQL query and returns only the filtered rows.
data <- dbGetQuery(con, query)

# View the first six rows of your filtered dataset
head(data)

# -------------------------------
# Disconnect DuckDB when finished
# -------------------------------
dbDisconnect(con, shutdown = TRUE)

Data Exploration & Visualization

The following section provides a set of quick, practical Exploratory Data Analysis (EDA) tools you can run immediately after loading a Dewey dataset into R. These commands help you validate the structure of the dataset, check for missing values, understand column types, and identify potential issues before running deeper analysis. You’ll generate summary statistics, inspect unique values, measure correlations between numeric fields, and visualize distributions across variables. This workflow is designed to give you a fast, high-level understanding of your dataset's shape, quality, and behavior so you can confidently move into more advanced filtering, modeling, or visualization steps.

# -------------------------
# Load required libraries
# -------------------------
library(dplyr)
library(ggplot2)
library(reshape2)
library(tidyr)

# ------------------------------------------------
# Check dataset dimensions (rows x columns)
# ------------------------------------------------
dim(data)

# ------------------------------------------------
# View structure, column types, and sample values
# ------------------------------------------------
str(data)

# ------------------------------------------------
# Get summary statistics for each column
# ------------------------------------------------
summary(data)

# ------------------------------------------------
# Count missing values in each column
# ------------------------------------------------
colSums(is.na(data))

# ------------------------------------------------
# Count unique values per column
# ------------------------------------------------
sapply(data, function(x) length(unique(x)))

# ------------------------------------------------
# Select numeric columns for correlation analysis
# ------------------------------------------------
num <- data %>% select(where(is.numeric))

# ------------------------------------------------
# Compute correlation matrix
# ------------------------------------------------
corr <- cor(num, use = "pairwise.complete.obs")

# ------------------------------------------------
# Visualize correlation matrix as a heatmap
# ------------------------------------------------
ggplot(melt(corr), aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "red", high = "blue", mid = "white") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# ------------------------------------------------
# Plot histograms for all numeric variables
# ------------------------------------------------
data %>%
  select(where(is.numeric)) %>%
  pivot_longer(everything()) %>%
  ggplot(aes(value)) +
  geom_histogram(bins = 30) +
  facet_wrap(~ name, scales = "free") +
  theme_minimal()