data-engineering-portfolio

Bita Ashoori

๐Ÿ’ผ Data Engineering Portfolio

Designing scalable, cloud-native data pipelines that power decision-making across healthcare, retail, and public services.

Bita Ashoori

About Me

Iโ€™m a Data Engineer based in Vancouver with 5+ years of experience across data engineering, business intelligence, and analytics. I design cloud-native pipelines and automate workflows that turn raw data into actionable insights. My work spans healthcare, retail, and public-sector projects. With 3+ years building cloud pipelines and 2+ years as a BI/ETL Developer, Iโ€™m skilled in Python, SQL, Apache Airflow, and AWS (S3, Lambda, Redshift). Currently expanding my expertise in Azure data services and Databricks, alongside modern data stack practices, to build next-generation data platforms that support real-time insights and scalability.

Contact Me

GitHub   LinkedIn   Resume


๐Ÿ”— Quick Navigation


Project Highlights

๐Ÿš€ Real-Time Event Processing with AWS Kinesis, Glue & Athena

Scenario: Simulated a real-time clickstream pipeline where user interaction events (e.g., click, view, signup) are sent to AWS Kinesis, processed using AWS Glue, and queried using AWS Athena.

๐Ÿงฐ Stack: Python โ€ข AWS Kinesis โ€ข AWS Glue โ€ข AWS Athena โ€ข S3 โ€ข boto3 โ€ข .env โ€ข Shell

๐Ÿ“Š Potential Impact: Scalable, real-time pipelines for product analytics, marketing, gaming, and clickstream use cases.

๐Ÿงช Tested On: GitHub Codespaces & AWS Console (Kinesis, Glue, S3, Athena)

Real-Time Pipeline Diagram </p>

๐Ÿ”— View GitHub Repo


๐ŸŽฎ Real-Time Player Pipeline

Scenario: Gaming companies need real-time analytics on player activity to optimize engagement, matchmaking, and monetization.
๐Ÿ“Ž View GitHub Repo
Solution: Simulated streaming of player events into a data lake with transformations and aggregations to deliver analytics-ready datasets for dashboards and retention analysis.
โœ… Impact: Reduced reporting lag from hours to seconds, enabling near real-time insights for live ops decisions.
๐Ÿงฐ Stack: Apache Kafka (or AWS Kinesis), AWS S3, DynamoDB, Apache Airflow, Spark
๐Ÿงช Tested On: Local Kafka, AWS Localstack, GitHub Codespaces

Real-Time Player Pipeline Diagram


๐Ÿ› ๏ธ Airflow AWS Modernization

Scenario: Legacy Windows Task Scheduler jobs needed modernization for reliability and observability.
๐Ÿ“Ž View GitHub Repo
Solution: Migrated jobs into modular Airflow DAGs containerized with Docker, storing artifacts in S3 and standardizing logging/retries.
โœ… Impact: Up to 50% reduction in manual errors and improved job monitoring/alerting.
๐Ÿงฐ Stack: Python, Apache Airflow, Docker, AWS S3
๐Ÿงช Tested On: Local Docker, GitHub Codespaces

Airflow AWS Diagram


โ˜๏ธ Cloud ETL Modernization

Scenario: Legacy workflows lacked observability, scalability, and centralized monitoring.
๐Ÿ“Ž View GitHub Repo
Solution: Built scalable ETL from APIs to Redshift with Airflow orchestration and CloudWatch alerting; standardized schemas and error handling.
โœ… Impact: ~30% faster troubleshooting via unified logging/metrics; more consistent SLAs.
๐Ÿงฐ Stack: Apache Airflow, AWS Redshift, CloudWatch
๐Ÿงช Tested On: AWS Free Tier, Docker

Cloud ETL Diagram


โšก Real-Time Marketing Pipeline

Scenario: Marketing teams need faster feedback loops from ad campaigns to optimize spend and performance.
๐Ÿ“Ž View GitHub Repo
Solution: Simulated real-time ingestion of campaign data with PySpark + Delta patterns for incremental insights.
โœ… Impact: Reduced reporting lag from 24h โ†’ ~1h, enabling quicker optimization cycles.
๐Ÿงฐ Stack: PySpark, Databricks, GitHub Actions, AWS S3
๐Ÿงช Tested On: Databricks Community Edition, GitHub CI/CD

Real-Time Marketing Pipeline Diagram


๐Ÿฅ FHIR Healthcare Pipeline

Scenario: Healthcare projects using FHIR require clean, analytics-ready datasets while preserving clinical context.
๐Ÿ“Ž View GitHub Repo
Solution: Processed synthetic Synthea FHIR JSON into relational models for downstream analytics and ML.
โœ… Impact: Cut preprocessing time by ~60%; improved data quality and analysis readiness.
๐Ÿงฐ Stack: Python, Pandas, Synthea, SQLite, Streamlit
๐Ÿงช Tested On: Local + Streamlit + BigQuery-compatible

FHIR Pipeline Diagram


๐Ÿ“ˆ PySpark Sales Pipeline

Scenario: Enterprises need scalable ETL for large sales datasets to drive timely BI and planning.
๐Ÿ“Ž View GitHub Repo
Solution: Production-style PySpark ETL to ingest/transform into Delta Lake with partitioning and optimization.
โœ… Impact: ~40% faster transformations and improved reporting accuracy with Delta optimizations.
๐Ÿงฐ Stack: PySpark, Delta Lake, AWS S3
๐Ÿงช Tested On: Local Databricks + S3

PySpark Sales Pipeline Diagram


๐Ÿ” LinkedIn Scraper (Lambda)

Scenario: Manual job tracking is slow and error-prone for candidates and recruiters.
๐Ÿ“Ž View GitHub Repo
Solution: Serverless scraping with scheduled invocations and structured S3 outputs for analysis.
โœ… Impact: Automated lead sourcing and job search analytics with minimal maintenance overhead.
๐Ÿงฐ Stack: AWS Lambda, EventBridge, BeautifulSoup, S3, CloudWatch
๐Ÿงช Tested On: AWS Free Tier

LinkedIn Scraper Diagram