Collect
Python scraping jobs collect event and match pages from Smoothcomp and save raw HTML for later reprocessing.
Documentation
Smoothcomp Stats is built as an end-to-end analytics pipeline. The project starts with saved web pages, turns them into structured event and match records, models those records with SQL and dbt, and serves frontend-ready JSON files through an Astro site.
End-to-End Flow
Python scraping jobs collect event and match pages from Smoothcomp and save raw HTML for later reprocessing.
BeautifulSoup-based parsers extract event and match records from saved HTML files.
Raw HTML and curated Parquet datasets are stored in S3 so they can be queried and rebuilt when needed.
Athena, Glue, DuckDB, and dbt transform raw records into analytics-ready tables.
Python export jobs write summary JSON files to S3, and Astro pages fetch those files to power the explorer.
Architecture Diagram
The main pipeline separates collection, parsing, modeling, and frontend delivery. Raw HTML is preserved so parsing logic can be improved without needing to re-scrape every page.
flowchart TB
Smoothcomp["Smoothcomp.com"]
Scraper["Python / Playwright Scraper"]
RawHTML["Raw HTML in S3"]
Parser["Python / BeautifulSoup Parser"]
Parquet["Events + Matches Parquet in S3"]
Athena["Athena + Glue Tables"]
DBT["dbt Models"]
JSON["Summary JSON in S3"]
Astro["Astro Frontend"]
Smoothcomp --> Scraper
Scraper --> RawHTML
RawHTML --> Parser
Parser --> Parquet
Parquet --> Athena
Athena --> DBT
DBT --> JSON
JSON --> Astro
Data Collection
The data collection process focuses on past Smoothcomp events and match result pages. Historical data is collected in bulk, while newer events can be processed on a recurring cadence as the dataset grows.
The scraper is containerized because it can run longer than a simple serverless function should, and because browser automation requires more dependencies than a lightweight Python runtime. AWS Fargate is a good fit because it can run the container without managing a server directly.
Raw HTML is saved before parsing. This makes the pipeline more resilient: if parsing logic changes, the raw page can be reprocessed without collecting the same page again.
Parsed Match Data
Storage Layer
The pipeline stores raw HTML separately from parsed Parquet outputs. Parsed event and match records are written to structured folders in S3, which makes them queryable through Athena and Glue. The same S3 bucket also stores generated JSON summaries used by the frontend explorer pages.
Preserved source pages that can be reprocessed when parsing logic improves.
Curated event and match records optimized for analytical querying.
Frontend-ready files for recent events, event summaries, club summaries, athlete pages, and homepage stats.
Transformation Layer
Once raw event and match records are available in Athena tables, dbt is used to build structured models for analysis and frontend delivery. The dbt project organizes logic into reusable layers rather than putting every transformation into one large query.
DuckDB is used for local development where practical, while Athena and Glue support the cloud production workflow over S3-backed Parquet files.
Model Layers
Lightly cleaned source tables and raw dimensions that standardize incoming data.
Reusable transformations that join, enrich, deduplicate, and normalize the raw match/event records.
Final summary models used by the site, including event summaries, club summaries, athlete summaries, homepage stats, and future rankings.
dbt Flow
flowchart TB
Staging["Staging Models"] --> Intermediate["Intermediate Models"]
Intermediate --> Marts["Analytics Marts"]
Marts --> ExportJobs["Python JSON Export Jobs"]
ExportJobs --> PublicJSON["Public S3 JSON Files"]
PublicJSON --> ExplorerPages["Astro Explorer Pages"]
Serving Layer
Instead of querying Athena from the website, summary tables are exported into JSON files and hosted in S3. This keeps page loads simple and avoids running live database queries for every visitor.
Explorer pages fetch the JSON they need: recent events, event summaries, club summaries, athlete summaries, and homepage stats. This keeps the frontend static while still allowing the data to refresh when export jobs run.
Data Quality
Competition data is not always labeled consistently. Some federations omit style, some events use different bracket naming conventions, and some pages include incomplete information.
The project uses override logic in the modeling layer so federation-specific and event-specific corrections can be applied without rewriting the parser every time a new labeling issue appears.
Known Limitations
Technology Stack
Used for scraping, HTML parsing, data exports, S3 writes, and batch processing jobs.
Used by the scraper when browser automation is needed for long-running collection jobs.
Used to parse saved HTML and extract structured event and match records.
Stores raw HTML, curated Parquet files, and public JSON files consumed by the frontend.
Runs containerized scraping and processing jobs that are too long-running or dependency-heavy for Lambda.
Provides queryable external tables over Parquet data stored in S3.
Transforms raw and intermediate tables into analytics marts used by the explorer pages.
Supports local development and testing of dbt models without relying on Athena for every change.
Serves the frontend explorer pages and fetches summary JSON files from S3.
Next Steps
The current architecture is built around batch processing and static JSON delivery. That works well for a portfolio-scale analytics site because the data can be refreshed on a schedule without needing a live application backend.
Future improvements could include orchestration with Airflow or Dagster, richer ranking models, ELO-style ratings, opponent-strength adjusted metrics, and more reusable visualization components.