Working with Python

Overview

Dewey makes it easy to work with large datasets in Python using DuckDB, Polars, and other lightweight tools. These examples demonstrate how to load data efficiently, query Parquet files at scale, filter before downloading, and run fast analyses with minimal setup.

You can get started in seconds with simple, copy-and-paste snippets designed to help you explore, aggregate, and visualize your data reliably.

Downloading Data

We recommend using the Quickstart: Dewey Client to download data to your local machine, and then loading the downloaded files into Python.

Below, you’ll find example code snippets that demonstrate how to:

Use Python to load Parquet (.parquet) or compressed CSV (.csv.gz) files into a DataFrame.
Combine multiple files into a single dataset and begin exploring the data.

With deweypy, you can use duckdb to query and filter Dewey datasets before downloading any files, as long as the data are in .parquet format. DuckDB can read the dataset metadata, apply your filters, and download only the required partitions or rows. For datasets provided in .csv or .csv.gz format, however, you will need to download the files locally first and then use duckdb to filter them when loading the data into Python.

View the DuckDB tutorial

Loading Data into Python

The code snippets below assume that the data are downloaded onto your local machine after having utilized the Quickstart Dewey Client of if you downloaded your data from the Dewey platform directly.

Datasets downloaded from Dewey may come in different file formats depending on storage requirements or download preferences. The subsections below provide guidance on how to load your data into python, whether the files are in .parquet or .csv.gz format.

Parquet Files

duckdb is the recommended method for loading data into a Python environment once it has been downloaded locally. The example below demonstrates how DuckDB can read multiple .parquet files, combine them into a single dataset, and convert the result into a pandas DataFrame that is ready for analysis. view the unfiltered tab in the code box below to view an example of how to load the entire set of data.

duckdb is a versatile query engine that allows you to load and filter your datasets efficiently. You can also use DuckDB to query and filter data before downloading it to your local machine by following this tutorial: DuckDB tutorial. To utilize duckdb to filter your data when loading from a local copy, view the tab below titled "filtered."

import duckdb
import pandas as pd

folder_path = r"YOUR FILE PATH" 
# example r"C:/Users/Documents/Dewey/dewey-downloads/rental-data-united-states"

con = duckdb.connect()

# Run SQL → DuckDB Relation → Convert to pandas DataFrame
df = con.execute(
    f"SELECT * FROM read_parquet('{folder_path}/*.parquet')"
).df()

print(df.shape)
df.head

import duckdb
import pandas as pd

folder_path = r"YOUR FILE PATH" # example r"C:/Users/Documents/Dewey/dewey-downloads/rental-data-united-states"

con = duckdb.connect()

# Run SQL → DuckDB Relation → Convert to pandas DataFrame
df = con.execute(
    f"""
    --SELECT ONLY THE FIELDS YOU NEED
    SELECT 
        ADDRESS,
        AVAILABLE_AT,
        BEDS,
        CITY,
        RENT_PRICE,
        SQFT
    FROM read_parquet('{folder_path}/*.parquet')
    --APPLY ANY FILTERS YOUR NEED
    WHERE CITY = 'Encinitas'
    --AND SQFT >= 900
    """
).df()


print(df.shape)
df.head()

CSV Files

import duckdb
import pandas as pd

folder_path = r"YOUR FILE PATH"  
# example: r"C:/Users/Documents/Dewey/dewey-downloads/rental-data-united-states"

con = duckdb.connect()

df = con.execute(
    f"SELECT * FROM read_csv_auto('{folder_path}/*.csv.gz')"
).df()

print(df.shape)
df.head()

import duckdb
import pandas as pd

folder_path = r"YOUR FILE PATH"  
# example: r"C:/Users/Documents/Dewey/dewey-downloads/rental-data-united-states"

con = duckdb.connect()

df = con.execute(
    f"""
    -- SELECT ONLY THE FIELDS YOU NEED
    SELECT 
        ADDRESS,
        AVAILABLE_AT,
        BEDS,
        CITY,
        RENT_PRICE,
        SQFT
    FROM read_csv_auto('{folder_path}/*.csv.gz')

    -- APPLY ANY FILTERS YOU NEED
    WHERE CITY = 'Encinitas'
    -- AND SQFT >= 900
    """
).df()

print(df.shape)
df.head()

Data Exploration & Visualization

The following section provides a set of quick, practical Exploratory Data Analysis (EDA) tools you can run immediately after loading a Dewey dataset into Python. These commands help you validate the structure of the dataset, inspect data types, check for missing values, and identify potential data quality issues before performing deeper analysis. You’ll generate summary statistics, review unique values, measure correlations between numeric fields, and visualize distributions across variables. This workflow is designed to give you a fast, high-level understanding of your dataset’s shape, quality, and behavior so you can confidently move into more advanced filtering, modeling, or visualization steps.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# -------------------------
# DIMENSIONS
# -------------------------
print("Shape:", df.shape)

# -------------------------
# STRUCTURE & SAMPLE
# -------------------------
df.info()
print(df.head())

# -------------------------
# SUMMARY STATS
# -------------------------
print(df.describe(include="all"))

# -------------------------
# MISSING VALUES
# -------------------------
print(df.isna().sum())

# -------------------------
# UNIQUE VALUES
# -------------------------
print(df.nunique())

# -------------------------
# NUMERIC SUBSET
# -------------------------
num = df.select_dtypes(include=[np.number])

# -------------------------
# CORRELATION
# -------------------------
corr = num.corr()
print(corr)

# -------------------------
# CORRELATION HEATMAP
# -------------------------
plt.figure(figsize=(10, 8))
sns.heatmap(corr, cmap="RdBu_r", center=0, linewidths=.5)
plt.title("Correlation Heatmap")
plt.show()

# -------------------------
# NUMERIC HISTOGRAM FACETS
# -------------------------
num_melted = num.melt(var_name="variable", value_name="value")

g = sns.FacetGrid(num_melted, col="variable", col_wrap=4, sharex=False, sharey=False)
g.map(plt.hist, "value", bins=30, edgecolor="black")
plt.show()