Skip to content
View yullieyang's full-sized avatar

Block or report yullieyang

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
yullieyang/README.md

Hi, I'm Yullie πŸ‘‹

Data scientist working on experimentation, machine learning, and AI / LLM systems β€” defining metrics, designing and analyzing A/B tests, and evaluating whether models (statistical or LLM-based) are reliable enough to ship. Background in regulated, high-stakes settings including the Federal Reserve Board, with a focus on rigor, calibration, and reproducibility.

  • πŸ’Ό Quantitative Analyst, CoStar Group β€” Boston, MA (2025 – Present)
  • πŸ§ͺ Data Scientist / Consultant, Guidehouse β€” McLean, VA (2024 – 2025)
  • πŸ› Graduate Intern (Summer 2023), Board of Governors of the Federal Reserve System β€” Division of International Finance
  • πŸŽ“ Master of Applied Science in Computer Science, University of Pennsylvania (in progress)
  • πŸŽ“ M.S. Business Analytics, University of Maryland, College Park (December 2023)
  • 🌏 B.A. Global Political Economy, Waseda University, Tokyo (September 2022)
  • πŸ“ San Francisco Bay Area

What I'm focused on

Experimentation, evaluation, and applied AI: designing and analyzing A/B tests (North Star metrics, power / MDE, CUPED variance reduction, guardrails); building and evaluating LLM and agentic AI systems for reliability, calibration, and evidence grounding with human-in-the-loop review; and reproducible Python / SQL analysis that is auditable end to end. The question I care about most β€” whether a model, statistical or LLM-based, actually does what it claims.

Featured work

Repo What it is
agentic-ai-evaluation-platform Reproducible study of LLM-based QA agents on synthetic model-monitoring anomaly review: 15 labelled scenario types, a deterministic rule baseline, a schema-constrained agent + reviewer agent, four prompt conditions, and evaluation spanning precision/recall, confidence calibration (ECE, Brier), unsupported-claim analysis, and paired statistical testing (McNemar, permutation, Benjamini–Hochberg). Streamlit research dashboard; runs fully offline via a deterministic mock provider; 54 passing tests. Synthetic data.
product-ab-experiment End-to-end A/B test on 100K simulated users: North Star metric + guardrails, sample-ratio-mismatch trust check, two-proportion z-test, power/MDE, CUPED variance reduction, Bonferroni-corrected segmentation, ship/no-ship decision memo, and an ARIMA monitoring forecast. Reproducible synthetic data with an embedded ground-truth effect.
r-macro-trade-commodity-forecast Reproducible R pipeline: 13 FRED macro / trade / commodity series β†’ quarterly panel with implicit trade deflators and terms of trade β†’ 8-quarter auto.arima forecasts for net exports, real GDP, and WTI β†’ distributed-lag FX pass-through regression on U.S. trade prices. CI + Quarto Pages dashboard.
cre_stress_test Portfolio-demo stress-testing workflow on 100% public data (FRED + Google Mobility + Boston Zoning). Python package + R/auto.arima companion + SQLAlchemy persistence + Streamlit dashboard. pytest + CI. Personal project; does not reflect any employer's internal data or models.
llm-research-workflow-assistant Prompt templates, sample outputs, and a human-in-the-loop checklist for using AI coding tools responsibly in recurring research workflows (data QA, code review, brief review, documentation drafting).
model-output-qa-dashboard Streamlit dashboard for quarterly model-release QA: compares prior vs. current model-output extracts, runs fixed checks (missing values, duplicate keys, schema drift, out-of-range, scenario coverage), computes row-level deltas, and exports a Markdown review report with a human-reviewer checklist. Synthetic data.
yullieyang.github.io Personal portfolio site (GitHub Pages) β€” bio, background, and featured projects.

Toolbox

  • Languages: Python, R, SQL
  • Data & infra: SQLite / PostgreSQL / SQLAlchemy, AWS, Azure, Power BI, Tableau, FRED
  • Engineering: Git, GitHub Actions, Streamlit, make-driven pipelines, pytest, R Markdown / Quarto
  • Experimentation: A/B testing, North Star / guardrail metric design, power / MDE analysis, CUPED variance reduction, confidence calibration (ECE, Brier)
  • Modeling: classification under class imbalance, SHAP explainability, ARIMA / auto.arima, distributed-lag regression
  • AI / LLM: LLM & RAG evaluation, agentic workflows, prompt experimentation, human-in-the-loop systems, retrieval-quality analysis, LangChain, LangGraph, Claude API, Claude Code

How I use AI tools

I treat AI coding tools as support for scaffolding, refactoring, and documentation review β€” not as substitutes for analyst judgment. Analytical logic, assumptions, validation checks, and final outputs are independently verified. Every repo that uses an LLM in its workflow includes a CLAUDE.md defining how the model should behave, a clear human-review step before outputs are shared, and explicit limitations of what the AI-assisted output represents.

Contact

πŸ“¬ yullieyang@gmail.com Β· 🌐 yullieyang.github.io Β· πŸ’Ό LinkedIn

Pinned Loading

  1. r-macro-trade-commodity-forecast r-macro-trade-commodity-forecast Public

    Reproducible R workflow: FRED macro/trade/commodity panel, auto.arima forecasts for net exports, real GDP, and WTI, FX pass-through regression.

    R

  2. llm-research-workflow-assistant llm-research-workflow-assistant Public

    Responsible AI workflow prototype for research QA, documentation, and human-in-the-loop review.

    Python

  3. cre_stress_test cre_stress_test Public

    Production-style CRE credit-risk modeling pipeline β€” Python package + R/auto.arima companion + SQLAlchemy persistence + Streamlit dashboard. FRED, Google Mobility, Boston Zoning. pytest + CI.

    HTML