๐ผ Data Engineering Portfolio
Designing scalable, cloud-native data pipelines that power decision-making across healthcare, retail, and public services.
Iโm a Data Engineer based in Vancouver with 5+ years of experience across data engineering, business intelligence, and analytics. I design cloud-native pipelines and automate workflows that turn raw data into actionable insights. My work spans healthcare, retail, and public-sector projects. With 3+ years building cloud pipelines and 2+ years as a BI/ETL Developer, Iโm skilled in Python, SQL, Apache Airflow, and AWS (S3, Lambda, Redshift). Currently expanding my expertise in Azure data services and Databricks, alongside modern data stack practices, to build next-generation data platforms that support real-time insights and scalability.
Scenario: Simulated a real-time clickstream pipeline where user interaction events (e.g., click
, view
, signup
) are sent to AWS Kinesis, processed using AWS Glue, and queried using AWS Athena.
๐งฐ Stack: Python โข AWS Kinesis โข AWS Glue โข AWS Athena โข S3 โข boto3 โข .env โข Shell
boto3
๐ Potential Impact: Scalable, real-time pipelines for product analytics, marketing, gaming, and clickstream use cases.
๐งช Tested On: GitHub Codespaces & AWS Console (Kinesis, Glue, S3, Athena)
๐ View GitHub Repo
Scenario: Gaming companies need real-time analytics on player activity to optimize engagement, matchmaking, and monetization.
๐ View GitHub Repo
Solution: Simulated streaming of player events into a data lake with transformations and aggregations to deliver analytics-ready datasets for dashboards and retention analysis.
โ
Impact: Reduced reporting lag from hours to seconds, enabling near real-time insights for live ops decisions.
๐งฐ Stack: Apache Kafka (or AWS Kinesis), AWS S3, DynamoDB, Apache Airflow, Spark
๐งช Tested On: Local Kafka, AWS Localstack, GitHub Codespaces
Scenario: Legacy Windows Task Scheduler jobs needed modernization for reliability and observability.
๐ View GitHub Repo
Solution: Migrated jobs into modular Airflow DAGs containerized with Docker, storing artifacts in S3 and standardizing logging/retries.
โ
Impact: Up to 50% reduction in manual errors and improved job monitoring/alerting.
๐งฐ Stack: Python, Apache Airflow, Docker, AWS S3
๐งช Tested On: Local Docker, GitHub Codespaces
Scenario: Legacy workflows lacked observability, scalability, and centralized monitoring.
๐ View GitHub Repo
Solution: Built scalable ETL from APIs to Redshift with Airflow orchestration and CloudWatch alerting; standardized schemas and error handling.
โ
Impact: ~30% faster troubleshooting via unified logging/metrics; more consistent SLAs.
๐งฐ Stack: Apache Airflow, AWS Redshift, CloudWatch
๐งช Tested On: AWS Free Tier, Docker
Scenario: Marketing teams need faster feedback loops from ad campaigns to optimize spend and performance.
๐ View GitHub Repo
Solution: Simulated real-time ingestion of campaign data with PySpark + Delta patterns for incremental insights.
โ
Impact: Reduced reporting lag from 24h โ ~1h, enabling quicker optimization cycles.
๐งฐ Stack: PySpark, Databricks, GitHub Actions, AWS S3
๐งช Tested On: Databricks Community Edition, GitHub CI/CD
Scenario: Healthcare projects using FHIR require clean, analytics-ready datasets while preserving clinical context.
๐ View GitHub Repo
Solution: Processed synthetic Synthea FHIR JSON into relational models for downstream analytics and ML.
โ
Impact: Cut preprocessing time by ~60%; improved data quality and analysis readiness.
๐งฐ Stack: Python, Pandas, Synthea, SQLite, Streamlit
๐งช Tested On: Local + Streamlit + BigQuery-compatible
Scenario: Enterprises need scalable ETL for large sales datasets to drive timely BI and planning.
๐ View GitHub Repo
Solution: Production-style PySpark ETL to ingest/transform into Delta Lake with partitioning and optimization.
โ
Impact: ~40% faster transformations and improved reporting accuracy with Delta optimizations.
๐งฐ Stack: PySpark, Delta Lake, AWS S3
๐งช Tested On: Local Databricks + S3
Scenario: Manual job tracking is slow and error-prone for candidates and recruiters.
๐ View GitHub Repo
Solution: Serverless scraping with scheduled invocations and structured S3 outputs for analysis.
โ
Impact: Automated lead sourcing and job search analytics with minimal maintenance overhead.
๐งฐ Stack: AWS Lambda, EventBridge, BeautifulSoup, S3, CloudWatch
๐งช Tested On: AWS Free Tier