open portfolio

About

Headshot of Ali Abouelazm

Junior Data Engineer shaping end-to-end data ecosystems: sourcing, cleansing, modeling, and delivering insights that accelerate experimentation and product decisions.

Blend software engineering, statistics, and ML/AI to transform messy datasets into reliable products: feature stores, experimentation platforms, forecasting services, and data apps used daily by partners.

Collaborate with product managers, data scientists, analytics engineers, and operations teams to frame ambiguous questions, design measurable roadmaps, and ship scalable solutions.

Obsessed with hardening data quality, documenting domain knowledge, and communicating the "so what" behind every model to drive action for both technical and business leaders.

Currently automating ingestion pipelines and evaluation dashboards for ML/AI teams.

  • Based in Sugar Land, TX
  • Open to internships in Data Science & ML/AI

Projects

clinix.ai

Python | pandas | NumPy | scikit-learn | SQLAlchemy | FastAPI | Streamlit | OpenAI/Anthropic | SQLite | matplotlib

A medical triage and symptom-to-risk assessment system that combines LLM semantic interpretation with classical machine learning models for intelligent healthcare decision support. Accepts free-text symptom descriptions and uses OpenAI GPT and Anthropic Claude models to parse unstructured natural language into structured medical features including symptom severity, duration, associated symptoms, and patient demographics. The parsed features are fed into trained machine learning models (Logistic Regression and Random Forest) that compute risk scores and assign triage categories (low, medium, high, critical) based on clinical guidelines. Built with SQLAlchemy ORM for SQLite database management, storing patient records, symptom data, model predictions, and outcome tracking. Features a FastAPI backend with REST endpoints for symptom submission, risk assessment, and patient history retrieval. Includes a Streamlit dashboard with interactive triage visualization, real-time analytics on patient flow and risk distribution, model performance metrics, and administrative tools for healthcare providers. Demonstrates end-to-end ML/AI pipeline development with emphasis on interpretability and healthcare workflow integration.

PL Predictor

Python | pandas | NumPy | scikit-learn | XGBoost | BeautifulSoup | Selenium | Streamlit

An English Premier League match outcome predictor that classifies match results as Home Win, Draw, or Away Win using XGBoost. Demonstrates comprehensive data acquisition through both public historical datasets and web scraping techniques with BeautifulSoup and Selenium for real-time data updates. Implemented robust data cleaning and preprocessing pipelines to handle missing values, type conversions, and data standardization across multiple sources. Engineered features including rolling team performance metrics (goals scored/conceded, points per match), form-based features for home and away contexts, and difference features comparing team strengths. Trained an XGBoost classifier with time-based train/validation splits and comprehensive evaluation metrics. Includes explicit feature importance visualization and analysis, post-processing utilities for probability conversion and label mapping, and an interactive Streamlit demo app that allows users to predict match outcomes with real-time feature importance displays.

Localytics

Python | R | SQL | scikit-learn | GeoPandas | Tableau

A comprehensive market segmentation and geospatial analytics project combining demographic and behavioral data analysis. Applied clustering algorithms (K-means, hierarchical clustering) and regression models using scikit-learn to identify distinct customer segments and predict market trends. Leveraged GeoPandas for spatial analysis and geographic insights, processing geospatial datasets to uncover location-based patterns and correlations. Engineered features from demographic data including income levels, age distributions, and population density metrics. Built interactive dashboards in Tableau to visualize patterns, segment distributions, and geographic heatmaps, effectively communicating findings to stakeholders. Implemented data preprocessing pipelines to clean and standardize multi-source datasets, ensuring data quality and consistency across demographic and geographic dimensions.

Stockly

Python | pandas | NumPy | scikit-learn | TensorFlow/Keras | SQLite | matplotlib | Streamlit

A production-quality stock market prediction and backtesting system with SQLite-based data storage, comprehensive feature engineering, and multiple ML models. Implements a complete pipeline from data acquisition (Alpha Vantage API and CSV ingestion) through SQL schema design for prices, features, targets, and predictions. Engineered technical indicators including RSI, MACD, moving averages, and volatility measures. Trained both baseline classical models (Logistic Regression, Random Forest) and deep learning sequence models (LSTM/GRU) for next-day direction classification and return prediction. Built time-series-aware backtesting with rolling window evaluation, performance metrics, and strategy comparison against buy-and-hold. Features a deliberately retro, blocky pixel-style visualizer in Streamlit with square markers, step plots, and a bold color palette for displaying price charts, prediction signals, and cumulative returns.

SwimMatch

HTML | CSS | Web Development | Business Operations | Marketing

A private swim lesson platform connecting students with experienced instructors for personalized 1-on-1 coaching in backyard pools. Developed and maintained the SwimMatch website using HTML and CSS, improving user experience and reducing coach-student match time. Successfully coached 30+ students, leading to significant improvement in swimming skills and confidence. Generated $11,000 in monthly revenue within the first month of operation, showcasing strong business acumen and market demand. Implemented online booking and scheduling systems, increasing efficiency in managing coach-student sessions. Coordinated a team of swim coaches and created marketing campaigns through social media platforms, expanding the user base by 15% in the first two months.

Experience & Leadership

    Skills/Interests

    Languages

    Python

    R

    SQL

    Java

    JavaScript

    TypeScript

    C/C++

    HTML/CSS

    AI/ML

    scikit-learn

    XGBoost

    CatBoost

    LightGBM

    TensorFlow

    PyTorch

    Keras

    Transformers

    Data/Viz

    pandas

    NumPy

    SciPy

    Dask

    GeoPandas

    Statsmodels

    Matplotlib

    Seaborn

    Plotly

    Tableau

    Interests

    Traveling

    Soccer

    Swimming

    Philosophy

    Family

    Food

    Gym

    Resume

    My Life in Data

    GitHub

    Loading...

    Total Commits

    Projects

    8

    Completed Projects

    Technologies

    25+

    Tools & Languages

    Experience

    2+

    Years in Data Science

    Daily Routine

    Learning Progress

    Tech Stack Usage

    My Typing Rhythm