Collect
Python scraping jobs collect event and match pages and preserve raw HTML for reprocessing.
Documentation
This documentation explains the data pipeline, modeling decisions, metrics, and frontend delivery system behind Smoothcomp Stats. The goal is to make the project understandable both as a jiu-jitsu analytics site and as a portfolio-quality data engineering project.
Pipeline Summary
Python scraping jobs collect event and match pages and preserve raw HTML for reprocessing.
HTML is parsed into structured event and match records, then stored as Parquet files.
Athena, Glue, DuckDB, and dbt transform raw records into analytics-ready tables.
Python jobs export summarized data into JSON files stored in S3.
Astro pages fetch JSON data and render event, club, and athlete explorer pages.
Documentation Sections
How data moves from Smoothcomp pages into raw HTML, parsed Parquet files, Athena tables, dbt models, S3 JSON, and the Astro frontend.
Definitions for the core tables used across the project, including events, matches, athletes, clubs, event summaries, club summaries, and athlete summaries.
Definitions for win rate, submission win rate, submission loss rate, brackets, event index, style mix, rivalries, and future ELO-based metrics.
How dbt summary tables are exported into S3 JSON files and consumed by Astro explorer pages.
Important caveats around scraped data quality, missing labels, inconsistent divisions, duplicate names, style overrides, and incomplete match attributes.
Potential improvements such as ELO ratings, opponent-strength adjustments, rankings, better visualizations, and additional competitor modeling.