Maneesh Ujji

Applied AI / AI Systems Engineer
$Building AI systems with retrieval, evaluation, and human oversight.|

How a ticket moves through the system
New ticket "Can't access meal plan" Classify + retrieve sentence-transformers → ChromaDB Draft response LLM + retrieved docs with citations Confidence scoring · 0.87 route by confidence score Human review → Send approve / edit / reject · with citations Flag for review draft + uncertainty flag Escalate to human agent, no AI response score > 0.7 0.4 to 0.7 score < 0.4 Knowledge base product docs · past resolutions · policies embedded vectors (ChromaDB) retrieval feeds classifier green · human decides amber · system flags uncertainty red · no AI response attempted
// why i built it this way

Three confidence tiers, not two. Most systems do "confident" vs "not confident." I added a middle tier: the system still drafts, but explicitly flags its uncertainty so the reviewer knows to look closer.

Citations aren't optional. Every draft includes the source docs it pulled from. If the retrieval was weak or irrelevant, that's visible, so the reviewer can see the system's reasoning and not just its output.

Below 0.4, no draft at all. The system doesn't guess. It escalates directly to a human agent with the raw ticket. Some problems shouldn't have an AI answer.

RAGChromaDBsentence-transformersPythonHuman-in-the-Loop
How four agents handle an email
Incoming email "Brand collab offer" Classifier agent collab / support / personal / spam Retrieval agent semantic search from KB Drafting agent Gemini + retrieved context Logger agent follow-up + due date spam → Archive no further processing EVALUATION PIPELINE Accuracy per category collab · support · personal · spam tracks drift over time F1 Score per class precision + recall balanced catches rare-class regressions Confusion matrix vs baseline per-cell misclassification deltas shows where the model breaks
// why i built it this way

Four specialized agents, not one general one. Each agent has a single job. The classifier doesn't draft. The drafter doesn't classify. This makes each component testable and replaceable independently.

Spam gets killed early. The classifier's first job is to stop spam from consuming any downstream compute. If it's spam, archive it and move on. No retrieval, no drafting, no logging.

Evaluation isn't afterthought. I built a full evaluation pipeline: accuracy per category, F1 scores, and confusion matrices against a baseline. If the classifier degrades, you know exactly where and by how much.

Multi-AgentGeminiSemantic SearchPythonEvaluation Pipeline
How the agent tracks and decides
JSON Memory Store historical prices · trends · deltas persistent state, no database reads ↓ ↑ writes Scheduled trigger Windows Task Scheduler runs daily Phase 1: Fetch Amadeus API route + date prices Phase 2: Analyze Gemini LLM trend + seasonality Phase 3: Notify SMTP email recommendation out BUY NOW price dropped significantly WAIT price stable or rising TRACKING insufficient data DECISION OUTCOMES Fully automated, with no manual intervention. No news = no action needed. Memory persists across runs so the agent sees the trend, not a snapshot.
// why i built it this way

JSON memory, not a database. For a personal tool tracking a handful of routes, a lightweight JSON file beats setting up a database. The memory stores price snapshots over time: simple, portable, and zero infrastructure.

LLM for analysis, not just comparison. A simple "is today cheaper than yesterday?" check is trivial. I use Gemini to analyze trends across multiple snapshots: seasonality patterns, rate of change, and whether a drop is noise or a real deal.

Fully automated. Windows Task Scheduler triggers the agent daily. It fetches, analyzes, and emails, with no manual intervention. I wake up and either get a recommendation or nothing (no news = no action needed).

LLM AgentGeminiAmadeus APISMTPPythonTask Automation
How JobPulse works in your browser
BROWSER (CHROME) JOB LISTING PAGE Company: Acme Corp Role: ML Engineer Requirements: • 3+ yrs Python • PyTorch / TensorFlow • Kubernetes in production • MLOps experience Responsibilities: • Model deployment • Data pipeline ownership • ... Location: Remote · Salary: $$ Posted 2 days ago JOBPULSE EXTENSION Content script scrapes job details from the DOM LLM parsing structured extraction: skills, experience, stack Resume comparison skill matching + gap analysis + scoring Fit Score: 78% Strong: Python · PyTorch · model deployment Gap: Kubernetes (mentioned 2×) Action: study K8s basics before applying
// why i built it this way

In the browser, not a separate app. Job seekers have 20 tabs open. Making them copy-paste into a separate tool breaks the flow. JobPulse runs right on the job listing page, with zero context switching.

Structured extraction, not raw text comparison. The LLM doesn't just compare blobs of text. It extracts structured fields (required skills, years of experience, tech stack) from the listing, then matches each against your resume's parsed skills.

Gaps matter more than matches. The most useful output isn't "you match 78%." It's "you're missing Kubernetes and they asked for it twice." That tells you whether to apply or what to study.

VueViteChrome ExtensionLLMJavaScript

From raw launch data to stakeholder dashboard
Data sources SpaceX REST API Wikipedia scraping ~90 launches multi-source join Wrangling Landing outcome One-hot encoding Null handling Type coercion SQL analysis Launch site rates Booster trends Payload ranges Orbit correlations EDA Success over time Site vs orbit Visual feature analysis Feeds model selection
Model comparison: same data, four approaches
Logistic regression
83.3%
linear baseline
SVM
83.3%
kernel-based
Decision tree
88.9%
captures feature splits
best performer
KNN
83.3%
instance-based
What the data revealed
KSC LC-39A
highest success-rate launch site
2013 →
success rate climbs sharply after 2013
FT + B5
booster versions with best odds
// why i built it this way

Four models, not one. The point wasn't to pick the best model; it was to show stakeholders that different approaches converge on similar accuracy, which builds confidence in the prediction itself.

Dashboard over notebook. The audience was non-technical. A Jupyter notebook is useless to them. The Plotly Dash dashboard lets them filter by launch site, orbit type, and booster version and see predictions update live.

SQL before modeling. Running SQL queries on the raw data first surfaced the patterns that mattered: launch site and booster version were the strongest predictors. The models confirmed what the data already showed.

PythonScikit-learnSQLPlotly DashPandasWeb Scraping
From raw tweets to emotion labels
Raw text input social media tweets, posts slang · emojis · hashtags Preprocessing Tokenization Stopword removal Lowercasing Noise removal Feature engineering TF-IDF Vectorization Feature selection n-grams Classification Supervised models Model comparison Hyperparameter tuning Predictions joy · anger sadness · fear surprise · love
Prediction strength across categories
joy
strong
anger
strong
sadness
moderate
fear
moderate
surprise
weaker (fewer samples)
love
weaker (overlaps with joy)
// why i built it this way

Multi-category, not binary sentiment. "Positive vs negative" is a solved problem. Distinguishing joy from love, or anger from fear, in short informal text is where the real challenge is, and where NLP gets interesting.

Pipeline quality matters more than model choice. The preprocessing step (handling slang, emojis, hashtags, abbreviations) had more impact on accuracy than switching between models. Garbage in, garbage out.

Per-category evaluation, not just aggregate accuracy. 85% overall accuracy hides that the model might be terrible at detecting surprise (rare class) while great at joy (common class). I evaluated each emotion independently.

NLPPythonScikit-learnText Classification
From survey responses to severity prediction
SEVERITY BANDS 0 to 4 minimal no clinical concern 5 to 9 mild watchful waiting 10 to 14 moderate treatment plan 15 to 19 mod-severe active treatment 20 to 27 severe immediate referral PHQ-9 Survey 9 questions Likert 0 to 3 validated instrument Data processing Scoring (0 to 27) Band assignment demographic joins Analysis Statistical models Correlation visualization Age-based trends Younger respondents show different severity patterns earlier onset · different band distribution than older cohorts Correlated factors Sleep disruption, fatigue, and concentration loss cluster together (somatic) symptoms move as a unit FINDINGS Research tool, not diagnostic. Designed to communicate patterns to researchers and public health audiences.
// why i built it this way

Clinical scoring, not arbitrary labels. PHQ-9 has established clinical cutoffs. I used the validated severity bands (minimal/mild/moderate/moderately severe/severe) rather than inventing my own thresholds.

Age as a lens, not a feature. Instead of just throwing age into the model, I used it to segment the analysis, revealing that severity patterns differ meaningfully across age groups, which has clinical implications.

Visualization for awareness, not diagnosis. This is a research tool, not a clinical one. The outputs are designed to communicate patterns to researchers and public health audiences, not to diagnose individuals.

PythonMLData VisualizationHealthcare Analytics
The question
U.S. state-level diabetes prevalence → COVID-19 mortality and hospitalization rates
The analysis pipeline
State health data CDC, public datasets 50 states + territories diabetes + COVID joined Cleaning Missing values Standardization per-capita rates Regression Linear + multivariate confounder controls Visualize For non-technical stakeholders clarity over complexity
What the data showed
Positive correlation
Higher diabetes prevalence → higher COVID mortality, even controlling for baseline factors.
Regional disparities
Southern states cluster with the highest prevalence and the highest outcome severity.
Confounders matter
Income, healthcare access, and obesity overlap with diabetes rates. Flagged, not ignored.
// why i built it this way

Research question first, not technique first. I didn't start with "let me try regression." I started with "does diabetes prevalence predict COVID outcomes?" The method follows the question.

Confounders acknowledged, not ignored. The correlation between diabetes and COVID mortality is real, but income, obesity, and healthcare access travel with it. The analysis flags this instead of overstating the finding.

Audience-aware visualization. The outputs were designed for people who don't read scatter plots. Clear labels, annotated axes, and takeaway-first summaries rather than raw statistical output.

PythonPandasRegressionData VisualizationPublic Health

Jan 2026 to Present
Marketing Data Analyst at Aramark
Cleveland, OH
Production data systems for university dining. Backend debugging, student account fixes, operational website maintenance. Exploring AI-assisted email triage.
Oct 2024 to Dec 2025
IT Analyst at Aramark
Cleveland, OH
Backend troubleshooting for student meal plan systems. Built and deployed the VikingFoodCo website used campus-wide.
May 2024 to Jan 2025
Graduate Research Assistant at Cleveland State University
Cleveland, OH
Thermal imagery object detection. Simulation-based transfer learning. Visual performance reporting.

Systems should surface uncertainty, not hide it.

Human review before automated action.

Evaluate against failure modes, not just accuracy.

Build for the case where the model is wrong.


M.S. Computer Science from Cleveland State University
Dec 2025
B.Tech Computer Science Engineering from Avanthi Institute of Engineering and Technology
Nov 2022
OpenAI Prompt Engineering Google/Kaggle AI Agents DeepLearning.AI Supervised ML