Documentation

Pipeline Architecture

Smoothcomp Stats is built as an end-to-end analytics pipeline. The project starts with saved web pages, turns them into structured event and match records, models those records with SQL and dbt, and serves frontend-ready JSON files through an Astro site.

End-to-End Flow

From Smoothcomp pages to explorer dashboards.

Collect

Python scraping jobs collect event and match pages from Smoothcomp and save raw HTML for later reprocessing.

Parse

BeautifulSoup-based parsers extract event and match records from saved HTML files.

Store

Raw HTML and curated Parquet datasets are stored in S3 so they can be queried and rebuilt when needed.

Model

Athena, Glue, DuckDB, and dbt transform raw records into analytics-ready tables.

Serve

Python export jobs write summary JSON files to S3, and Astro pages fetch those files to power the explorer.

Architecture Diagram

High-level system architecture

The main pipeline separates collection, parsing, modeling, and frontend delivery. Raw HTML is preserved so parsing logic can be improved without needing to re-scrape every page.


flowchart TB
  Smoothcomp["Smoothcomp.com"]
  Scraper["Python / Playwright Scraper"]
  RawHTML["Raw HTML in S3"]
  Parser["Python / BeautifulSoup Parser"]
  Parquet["Events + Matches Parquet in S3"]
  Athena["Athena + Glue Tables"]
  DBT["dbt Models"]
  JSON["Summary JSON in S3"]
  Astro["Astro Frontend"]

  Smoothcomp --> Scraper
  Scraper --> RawHTML
  RawHTML --> Parser
  Parser --> Parquet
  Parquet --> Athena
  Athena --> DBT
  DBT --> JSON
  JSON --> Astro

Data Collection

Scraping event and match result pages.

The data collection process focuses on past Smoothcomp events and match result pages. Historical data is collected in bulk, while newer events can be processed on a recurring cadence as the dataset grows.

The scraper is containerized because it can run longer than a simple serverless function should, and because browser automation requires more dependencies than a lightweight Python runtime. AWS Fargate is a good fit because it can run the container without managing a server directly.

Raw HTML is saved before parsing. This makes the pipeline more resilient: if parsing logic changes, the raw page can be reprocessed without collecting the same page again.

Parsed Match Data

What gets extracted from each match.

Event name and Smoothcomp event ID
Event location and date
Match ID and event ID
Athlete names and athlete IDs
Club/team names
Winner and loser indicators
Submission vs decision outcomes
Style, age, gender, and skill labels where available
Raw bracket tags used for downstream corrections and bracket counts

Storage Layer

S3 stores both raw and curated data.

The pipeline stores raw HTML separately from parsed Parquet outputs. Parsed event and match records are written to structured folders in S3, which makes them queryable through Athena and Glue. The same S3 bucket also stores generated JSON summaries used by the frontend explorer pages.

Raw HTML

Preserved source pages that can be reprocessed when parsing logic improves.

Parquet Tables

Curated event and match records optimized for analytical querying.

Summary JSON

Frontend-ready files for recent events, event summaries, club summaries, athlete pages, and homepage stats.

Transformation Layer

dbt turns raw records into analytics models.

Once raw event and match records are available in Athena tables, dbt is used to build structured models for analysis and frontend delivery. The dbt project organizes logic into reusable layers rather than putting every transformation into one large query.

DuckDB is used for local development where practical, while Athena and Glue support the cloud production workflow over S3-backed Parquet files.

Model Layers

Staging

Lightly cleaned source tables and raw dimensions that standardize incoming data.

Intermediate

Reusable transformations that join, enrich, deduplicate, and normalize the raw match/event records.

Marts

Final summary models used by the site, including event summaries, club summaries, athlete summaries, homepage stats, and future rankings.

dbt Flow

From source records to explorer-ready marts.

flowchart TB
  Staging["Staging Models"] --> Intermediate["Intermediate Models"]
  Intermediate --> Marts["Analytics Marts"]
  Marts --> ExportJobs["Python JSON Export Jobs"]
  ExportJobs --> PublicJSON["Public S3 JSON Files"]
  PublicJSON --> ExplorerPages["Astro Explorer Pages"]

Serving Layer

Astro pages fetch JSON from S3.

Instead of querying Athena from the website, summary tables are exported into JSON files and hosted in S3. This keeps page loads simple and avoids running live database queries for every visitor.

Explorer pages fetch the JSON they need: recent events, event summaries, club summaries, athlete summaries, and homepage stats. This keeps the frontend static while still allowing the data to refresh when export jobs run.

Data Quality

Overrides handle inconsistent labels.

Competition data is not always labeled consistently. Some federations omit style, some events use different bracket naming conventions, and some pages include incomplete information.

The project uses override logic in the modeling layer so federation-specific and event-specific corrections can be applied without rewriting the parser every time a new labeling issue appears.

Known Limitations

Important caveats in the dataset.

Some events do not clearly label winners, so those matches are excluded.
Some match pages omit style, age, or skill labels.
Division and bracket naming can vary across federations.
Weight classes are not used yet because labels are inconsistent across events.
Some federation-specific or event-specific overrides are needed to correctly label matches.

Technology Stack

Main tools used in the project.

Python

Used for scraping, HTML parsing, data exports, S3 writes, and batch processing jobs.

Playwright

Used by the scraper when browser automation is needed for long-running collection jobs.

BeautifulSoup

Used to parse saved HTML and extract structured event and match records.

Amazon S3

Stores raw HTML, curated Parquet files, and public JSON files consumed by the frontend.

AWS Fargate

Runs containerized scraping and processing jobs that are too long-running or dependency-heavy for Lambda.

Athena / Glue

Provides queryable external tables over Parquet data stored in S3.

dbt

Transforms raw and intermediate tables into analytics marts used by the explorer pages.

DuckDB

Supports local development and testing of dbt models without relying on Athena for every change.

Astro

Serves the frontend explorer pages and fetches summary JSON files from S3.

Next Steps

Where the architecture is heading.

The current architecture is built around batch processing and static JSON delivery. That works well for a portfolio-scale analytics site because the data can be refreshed on a schedule without needing a live application backend.

Future improvements could include orchestration with Airflow or Dagster, richer public ranking models, opponent-strength adjusted metrics, and more reusable visualization components.