Abhay Gupta

Abhay Gupta High School Researcher

Research Intern @ Stanford AI Lab

Contact: abhaygupta1266@gmail.com

I am a high school researcher with a deep passion for natural language processing, large language models (LLMs), and AI safety. Currently, I am a research intern at the Stanford Artificial Intelligence Laboratory (SAIL) working directly under Prof. Yejin Choi and Liwei Jiang.

My research focuses on evaluating biases in LLMs, improving fairness, and diagnosing reasoning failures in complex contexts. My work has been published in top venues including EMNLP, NeurIPS, and AACL.

Large Language Models AI Fairness & Alignment Natural Language Processing

News

Selected Publications (* indicates equal contribution)

NovelHopQA Benchmark

NovelHopQA: Diagnosing Multi-Hop Reasoning Failures in Long Narrative Contexts

In collaboration with Meta and UC Berkeley

A Gupta*, K Zhu, V Sharma, S O'Brien, M Lu

Proceedings of the Association for Computational Linguistics: EMNLP 2025

Current LLMs struggle to answer questions spanning tens of thousands of tokens. We introduce NovelHopQA, an innovative benchmark to evaluate 1-4 hop QA over 64k-128k-token excerpts from 83 full-length public domain novels. Our research evaluates six SOTA models, revealing that scale alone does not guarantee robust multi-hop reasoning.

EnDive Benchmark

EnDive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models

A Gupta*, J Cheung, P Meng, S Sayyed, K Zhu, A Liao, S O'Brien

Findings of the Association for Computational Linguistics: EMNLP 2025

EnDive (English Diversity) addresses the lack of intra-language evaluation in standard benchmarks. Translating SAE datasets into five underrepresented dialects via few-shot prompting, we created a challenging diagnostic that uncovers persistent model biases against non-standard language speakers across reasoning, logic, and math tasks.

AAVENUE: Detecting LLM Biases on NLU Tasks in AAVE via a Novel Benchmark

A Gupta*, E Yurtseven, P Meng, K Zhu

Proceedings of the Third Workshop on NLP for Positive Impact, 2024

To develop more inclusive NLP systems, we introduced AAVENUE, a benchmark to evaluate LLMs on NLU tasks in African American Vernacular English (AAVE). The benchmark uses human-verified LLM translation to reliably map GLUE and SuperGLUE metrics out to AAVE.

MEDEQUALQA: Evaluating Biases in LLMs with Counterfactual Reasoning

R Ghosh*, A Gupta*, H McBride, A J Vaidya, F Mahmood

Proceedings of SciProdLLM, IJCNLP-AACL 2025

Evaluates how demographic cues influence clinical reasoning in frontier LLMs by holding critical symptoms constant while perturbing patient pronouns in 69,000 parallel test items, exposing localized divergences in downstream medical rationales.

CivicParse: A Benchmark and Pipeline for Structured Online Deliberation

A Gupta, M Klein

Accepted @ NeurIPS 2025 LLM Evaluation Workshop

Introducing a two-stage NLP extraction and classification pipeline to structure raw, free-form online discussions into a coherent deliberation map of core issues, barriers, and solutions. Outperforms prompt-only baselines and establishes a standardized task for modeling collective civic intelligence.

Experience

Stanford Artificial Intelligence Laboratory (SAIL)

Research Intern

Aug 2025 - Present · Remote

  • Working directly under Prof. Yejin Choi and Liwei Jiang within the Stanford NLP Lab.
  • Working directly on a research project focusing on pluralistic alignment, ensuring generative AI models are aligned with diverse sets of human values and perspectives.

Harvard-MIT Health Sciences and Technology (HST)

Research Intern

Mar 2025 - Jan 2026 · Remote

  • Led research within Prof. Faisal Mahmood's AI for Pathology Lab, investigating biases in frontier models deployed in clinical settings.
  • Lead Author for MedEqualQA, a framework introducing a counterfactual benchmark to evaluate pronoun-driven clinical reasoning drift in LLMs.

Massachusetts Institute of Technology (MIT CCI Lab)

Research Intern

Nov 2024 - Dec 2025 · Remote

  • Collaborating with Prof. Mark Klein at the MIT Center for Collective Intelligence.
  • First author on CivicParse, an NLP extraction and classification pipeline bridging argumentation theory and AI to map large-scale deliberative discussions.

Cluely AI

Machine Learning Intern

Sep 2025 - Nov 2025 · Remote

  • Worked on core infrastructure for an Andreessen Horowitz (a16z) valued $120M startup developing an undetectable real-time AI assistant for desktop.
  • Comprehensively reworked Cluely's internal evaluation pipeline to monitor inference speed and context tracking during real-time screen and audio processing.
  • Redesigned production system prompts to provide highly accurate, un-obtrusive suggestions for meetings, calls, and brainstorming sessions while minimizing generic generations.

Algoverse AI Research

LLM Researcher

Jan 2024 - Sep 2025 · Remote

  • Conducted rigorous, independent NLP research focused on multi-hop reasoning failures and evaluating dialectal fairness in large language models.
  • Led collaborative lab teams as primary author to conceptualize, evaluate, and publish benchmarks, culminating in acceptances at EMNLP Main '25 and Findings of EMNLP '25.
  • Honored with the 2025 Davidson Fellow Scholarship ($25,000) acknowledging excellence and impact in AI research.

Mathos AI (YC W24)

Machine Learning Intern

Jun 2025 - Aug 2025 · Remote

  • Built complex, adaptive flashcard and quiz generation features natively integrating with their state-of-the-art step-by-step math LLM engine.
  • Features successfully deployed to scale, assisting over 1M+ active student users globally.

Grants & Fellowships

AI2050 Compute Grant (100K hours of NVIDIA H100 compute)

Schmidt Sciences (Jan 2026)

Davidson Fellow Scholarship ($25,000)

Davidson Institute (Jul 2025)

Personal

My journey in AI began with a simple curiosity about how machines process human language, but it was Kevin Zhu who truly opened the door to the world of research for me. He was the very first person to guide my early steps, teaching me through everything and becoming my most impactful mentor. That initial curiosity quickly evolved into a passion for ensuring these systems are fair, inclusive, and safe for everyone. As I continue to grow, I am also incredibly grateful to Prof. Yejin Choi and Liwei Jiang for continuously inspiring me to tackle the difficult sociotechnical problems in AI alignment.

"Growth happens the moment you decide to step into the unknown. The challenges we face are not roadblocks, but the very stepping stones that build our resilience and character."

Outside of running evaluations and writing papers, I enjoy exploring the intersection of technology and linguistics, keeping up with the rapid pace of open-source AI, and finding new ways to make complex machine learning concepts accessible to my peers.