Documentation

How Smoothcomp Stats is built.

This documentation explains the data pipeline, modeling decisions, metrics, and frontend delivery system behind Smoothcomp Stats. The goal is to make the project understandable both as a jiu-jitsu analytics site and as a portfolio-quality data engineering project.

Pipeline Summary

From scraped web pages to explorer dashboards.

Collect

Python scraping jobs collect event and match pages and preserve raw HTML for reprocessing.

Parse

HTML is parsed into structured event and match records, then stored as Parquet files.

Model

Athena, Glue, DuckDB, and dbt transform raw records into analytics-ready tables.

Export

Python jobs export summarized data into JSON files stored in S3.

Explore

Astro pages fetch JSON data and render event, club, and athlete explorer pages.

Documentation Sections

Start with the architecture, then drill into models and metrics.

Pipeline Architecture

Data Model

Planned

Definitions for the core tables used across the project, including events, matches, athletes, clubs, event summaries, club summaries, and athlete summaries.

Open section →

Metrics Definitions

Planned

Definitions for win rate, submission win rate, submission loss rate, brackets, event index, style mix, rivalries, and future ELO-based metrics.

Open section →

Frontend JSON Delivery

Planned

How dbt summary tables are exported into S3 JSON files and consumed by Astro explorer pages.

Open section →

Known Limitations

Planned

Important caveats around scraped data quality, missing labels, inconsistent divisions, duplicate names, style overrides, and incomplete match attributes.

Open section →

Future Work

Planned

Potential improvements such as ELO ratings, opponent-strength adjustments, rankings, better visualizations, and additional competitor modeling.

Open section →