๐ผ Data Engineering Portfolio
Designing scalable, cloud-native data pipelines that power decision-making across healthcare, retail, and public services.
Iโm a Data Engineer based in Vancouver with over 5 years of experience spanning data engineering, business intelligence, and analytics). I specialize in designing cloud native ETL/ELT pipelines and automating data workflows that transform raw data into actionable insights.
My background includes work across healthcare, retail, and public-sector environments, where Iโve delivered scalable and reliable data solutions. With 3+ years building cloud data pipelines and 2+ years as a BI/ETL Developer, I bring strong expertise in Python, SQL, Apache Airflow, and AWS (S3, Lambda, Redshift).
Iโm currently expanding my skills in Azure and Databricks, focusing on modern data stack architecturesโincluding Delta Lake, Medallion design, and real-time streamingโto build next generation data platforms that drive performance, reliability, and business value.
Scenario: Retail organizations needed an automated cloud data pipeline to consolidate and analyze sales data from multiple regions.
๐ View GitHub Repo
Solution: Developed a cloud-native ETL pipeline using Azure Data Factory that ingests, transforms, and loads retail sales data from on-prem SQL Server to Azure Data Lake and Azure SQL Database. Implemented parameterized pipelines, incremental data loads, and monitoring through ADF logs.
โ
Impact: Improved reporting efficiency by 45%, automated data refresh cycles, and reduced manual dependencies.
๐งฐ Stack: Azure Data Factory ยท Azure SQL Database ยท Blob Storage ยท Power BI
๐งช Tested On: Azure Free Tier + GitHub Codespaces
Scenario: Designed and implemented a complete end-to-end ETL pipeline in Azure Databricks, applying the Medallion Architecture (Bronze โ Silver โ Gold) to build a modern data lakehouse for analytics.
๐ View GitHub Repo
Solution: Developed a multi-layer Delta Lake pipeline to ingest, cleanse, and aggregate retail data using PySpark and SQL within Databricks notebooks. Implemented data quality rules, incremental MERGE operations, and created analytical views for dashboards.
โ
Impact: Improved data reliability and reduced transformation latency by enabling efficient, governed, and automated data processing in the Databricks ecosystem.
๐งฐ Stack: Azure Databricks, Delta Lake, PySpark, Spark SQL, Unity Catalog, Power BI Cloud
๐งช Tested On: Azure Databricks Community Edition + GitHub Codespaces
Scenario: Legacy workflows lacked observability, scalability, and centralized monitoring.
๐ View GitHub Repo
Solution: Built scalable ETL from APIs to Redshift with Airflow orchestration and CloudWatch alerting; standardized schemas and error handling.
โ
Impact: ~30% faster troubleshooting via unified logging/metrics; more consistent SLAs.
๐งฐ Stack: Apache Airflow, AWS Redshift, CloudWatch
๐งช Tested On: AWS Free Tier, Docker
Scenario: Legacy Windows Task Scheduler jobs needed modernization for reliability and observability.
๐ View GitHub Repo
Solution: Migrated jobs into modular Airflow DAGs containerized with Docker, storing artifacts in S3 and standardizing logging/retries.
โ
Impact: Up to 50% reduction in manual errors and improved job monitoring/alerting.
๐งฐ Stack: Python, Apache Airflow, Docker, AWS S3
๐งช Tested On: Local Docker, GitHub Codespaces
Scenario: Marketing teams need faster feedback loops from ad campaigns to optimize spend and performance.
๐ View GitHub Repo
Solution: Simulated real-time ingestion of campaign data with PySpark + Delta patterns for incremental insights.
โ
Impact: Reduced reporting lag from 24h โ ~1h, enabling quicker optimization cycles.
๐งฐ Stack: PySpark, Databricks, GitHub Actions, AWS S3
๐งช Tested On: Databricks Community Edition, GitHub CI/CD
Scenario: Gaming companies need real-time analytics on player activity to optimize engagement and retention.
๐ View GitHub Repo
โ
Impact: Reduced reporting lag from hours โ seconds for live ops insights.
๐งฐ Stack: Kafka / AWS Kinesis, Airflow, S3, Spark
Scenario: Enterprises need scalable ETL for large sales datasets to drive timely BI and planning.
๐ View GitHub Repo
Solution: Production-style PySpark ETL to ingest/transform into Delta Lake with partitioning and optimization.
โ
Impact: ~40% faster transformations and improved reporting accuracy with Delta optimizations.
๐งฐ Stack: PySpark, Delta Lake, AWS S3
๐งช Tested On: Local Databricks + S3
Scenario: Healthcare projects using FHIR require clean, analytics-ready datasets while preserving clinical context.
๐ View GitHub Repo
โ
Impact: Cut preprocessing time by ~60%; improved data quality.
๐งฐ Stack: Python, Pandas, SQLite, Streamlit
Scenario: Simulated a real-time clickstream pipeline where user interaction events are sent to AWS Kinesis, processed with Glue, and queried in Athena.
๐งฐ Stack: Python โข AWS Kinesis โข AWS Glue โข AWS Athena โข S3
โ
Impact: Built a reusable pattern for clickstream and analytics pipelines.
Scenario: Manual job tracking is slow and error-prone for candidates and recruiters.
๐ View GitHub Repo
โ
Impact: Automated lead sourcing and job search analytics.
๐งฐ Stack: AWS Lambda, EventBridge, BeautifulSoup, S3