data-engineering-portfolio

Bita Ashoori

Bita Ashoori

πŸ’Ό Data Engineering Portfolio

About Me

I’m a Data Engineer based in Vancouver with over 5 years of experience across data engineering, business intelligence, and analytics. I specialize in building clean, cloud-native data pipelines and automating workflows that help organizations turn raw data into smart decisions. I have 3+ years of experience building and maintaining cloud-based pipelines and 2+ years as a BI/ETL Developer. I’m skilled in Python, SQL, Apache Airflow, AWS (S3, Lambda, Redshift), and modern orchestration techniques.

Contact Me

πŸ’» Explore my work on GitHub
πŸ”— Connect with me on LinkedIn
πŸ“§ Contact me at bitaashoori20@gmail.com
πŸ“„ Download My Resume


Project Highlights

πŸ› οΈ Airflow AWS Modernization

Scenario: Businesses needed faster feedback loops from ad campaigns to optimize performance and engagement. πŸ“ŽGitHub Repo
Solution: Migrated legacy Windows Task Scheduler jobs into modular Airflow DAGs with Docker and AWS S3.
βœ… Potential Impact: Could reduce manual errors by up to 50% and improve job monitoring and reliability in real-world environments.

🧰 Stack: Python, Apache Airflow, Docker, AWS S3
πŸ§ͺ Tested On: Local Docker, GitHub Codespaces

Airflow AWS Diagram


⚑ Real-Time Marketing Pipeline

Scenario: Businesses needed faster feedback loops from ad campaigns to optimize performance and engagement. πŸ“ŽGitHub Repo
Solution: Simulates real-time ingestion of campaign data, transforming and storing insights using PySpark and Delta Lake.
βœ… Potential Impact: May reduce reporting lag from 24 hours to 1 hour, enabling faster marketing insights and campaign optimization.

🧰 Stack: PySpark, Databricks, GitHub Actions, AWS S3
πŸ§ͺ Tested On: Databricks Community Edition, GitHub CI/CD

Real-Time Pipeline Diagram


☁️ Cloud ETL Modernization

Scenario: Legacy workflows lacked observability, scalability, and centralized monitoringβ€”critical for modern data teams. πŸ“ŽGitHub Repo
Solution: Built a scalable and maintainable ETL pipeline for structured data movement from APIs to Redshift with alerting via CloudWatch.
βœ… Potential Impact: Should improve troubleshooting efficiency by ~30% with enhanced logging and monitoring practices.

🧰 Stack: Apache Airflow, AWS Redshift, CloudWatch
πŸ§ͺ Tested On: AWS Free Tier, Docker

Cloud ETL Diagram


πŸ₯ FHIR Healthcare Pipeline

Scenario: Healthcare projects using FHIR data require a clean, structured pipeline to support downstream analytics and ML. πŸ“ŽGitHub Repo
Solution: Processes synthetic healthcare records in FHIR JSON format and converts them into clean, queryable relational tables.
βœ… Potential Impact: Designed to reduce preprocessing time by 60% and prepare healthcare data for analytics and ML workloads.

🧰 Stack: Python, Pandas, Synthea, SQLite, Streamlit
πŸ§ͺ Tested On: Local + Streamlit + BigQuery-compatible

FHIR Pipeline Diagram


πŸ” LinkedIn Scraper (Lambda)

Scenario: Manual job tracking and lead sourcing is time-consuming and unscalable. πŸ“ŽGitHub Repo
Solution: Automates job scraping from LinkedIn using serverless AWS Lambda and stores structured output in S3.
βœ… Potential Impact: Can automate job scraping workflows and enable structured job search analysis without manual effort.

🧰 Stack: AWS Lambda, EventBridge, BeautifulSoup, S3, CloudWatch
πŸ§ͺ Tested On: AWS Free Tier

LinkedIn Scraper Diagram


πŸ“ˆ PySpark Sales Pipeline

Scenario: Businesses need scalable ETL systems to process large sales datasets for timely business intelligence reporting. πŸ“ŽGitHub Repo
Solution: A production-ready PySpark ETL that ingests and transforms high-volume sales data into Delta Lake for BI.
βœ… Potential Impact: Built to cut transformation runtimes by 40% and improve sales reporting accuracy through Delta Lake optimization.

🧰 Stack: PySpark, Delta Lake, AWS S3
πŸ§ͺ Tested On: Local Databricks + S3

PySpark Pipeline Diagram