Yue Yu

Taken in Anchorage, Alsaka

Room E1317, CODA Building

756 W Peachtree St NW, Atlanta, GA 30308

Hello! I am a final-year PhD student at School of Computational Science and Engineering, Georgia Institute of Technology. I mainly work on the intersection of Large Language Models and Data-centric AI.

Before joining Georgia Tech, I obtained my bachelor’s degree (with honors) from the Department of Electronic Engineering, Tsinghua University in 2019, where I have also worked on spatio-temporal data mining under the supervision of Dr. Yong Li.

Feel free to drop me an email (yueyu at gatech dot edu) if you have any questions about my research, or general discussions about NLP.

Educations

Georgia Institute of Technology (2019 - Present)

Ph.D. in Computational Science and Engineering

GPA: 4.00/4.00

Thesis Topic: Towards Efficiently and Effectively Harnessing Large Pre-trained Models via Data-centric Lens.

Advisor: Prof. Chao Zhang

Tsinghua University (2015 - 2019)

B.Eng. in Electronic Engineering

GPA: 3.87/4.00 (Outstanding Graduate)

Research Focus: Spatio-temporal Data Mining [WWW 2019, UbiComp 2020], Recommender Systems [UbiComp 2019].

Advisor: Prof. Yong Li

Industrial Experience

Meta (May 2024 - Aug 2024): Research Intern, GenAI (Llama Post-training Team); Host: Rui Hou, Manager: Melanie Kambadur; Topic: Self-Critiquing Reward Models [Preprint].
NVIDIA (Jan 2024 - May 2024): Research Intern, Applied Deep Learning Research Group; Host: Wei Ping, Manager: Mohammad Shoeybi; Topic: LLM Instruction Fine-tuning for Zero-shot Retrieval-Augmented Generation [NeurIPS 2024].
Google Research (May 2023 - Aug 2023): Research Intern, News Understanding Group; Host: Jiaming Shen, Manager: Jialu Liu; Topic: LLM In-context Learning with Rationales [ACL 2024].
Microsoft Research (May 2021 - Aug 2021): Research Intern, Productivity and Intelligence Group; Mentor: Chenyan Xiong, Manager: Arnold Overwijk; Topic: Zero-shot Dense Text Retrieval [EMNLP 2022].
IQVIA (May 2020 - Aug 2020): Research Intern, Analytics Center of Excellence; Mentor: Cao (Danica) Xiao; Topic: Knowledge-enhanced Drug Interaction Prediction [Bioinformatics 2021].

News

Sep 25, 2024	Two papers are accepted to NeurIPS 2024 and Three papers are accepted to EMNLP 2024. Congratulations!
May 16, 2024	6 papers are accepted to ACL 2024 (4 Main Conf, 2 Findings).
Oct 25, 2023	Honored to receive the NeurIPS 2023 Scholar award!
Sep 22, 2023	3 papers are accepted to NeurIPS 2023. Thanks for my collaborators!
May 16, 2023	Checkout the recent publications: 2 first-author papers are accepted to ACL 2023 (1 Main Conf, 1 Findings), and 3 coauthored papers are accepted to KDD 2023. Thanks and Congratulations for my collaborators!

Selected Publications

RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs

Yue Yu, Wei Ping, Zihan Liu, Boxin Wang, Jiaxuan You, Chao Zhang, Mohammad Shoeybi, and Bryan Catanzaro

Proceedings of NeurIPS, 2024.

arXiv PDF
Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias

Yue Yu*, Yuchen Zhuang*, Jieyu Zhang*, Yu Meng, Alexander Ratner, Ranjay Krishna, Jiaming Shen, and Chao Zhang

Proceedings of NeurIPS (D&B Track), 2023.

arXiv PDF Code
Cold-Start Data Selection for Better Few-shot Language Model Fine-tuning: A Prompt-based Uncertainty Propagation Approach

Yue Yu, Rongzhi Zhang, Ran Xu, Jieyu Zhang, Jiaming Shen, and Chao Zhang

Proceedings of ACL, 2023.

Abs PDF Code

We present PATRON, a prompt-based data selection method for pre-trained language model fine-tuning under cold-start scenarios, i.e., no initial labeled data are available. In PATRON, we design (1) a prompt-based uncertainty propagation approach to estimate the importance of data points and (2) a partition-then-rewrite (PTR) strategy to promote sample diversity when querying for annotations. Experiments on six text classification datasets show that PATRON outperforms the strongest cold-start data selection baselines by up to 6.9%. Besides, with 128 labels only, PATRON achieves 91.0% and 92.1% of the fully supervised performance based on vanilla fine-tuning and prompt-based learning respectively. Our implementation of PATRON will be published upon acceptance.
COCO-DR: Combating Distribution Shifts in Zero-Shot Dense Retrieval with Contrastive and Distributionally Robust Learning

Yue Yu, Chenyan Xiong, Si Sun, Chao Zhang, and Arnold Overwijk

Proceedings of EMNLP, 2022. (Oral)

Abs PDF Code

We present a new zero-shot dense retrieval (ZeroDR) method, COCO-DR, to improve the generalization ability of dense retrieval by combating the distribution shifts between source training tasks and target scenarios. To mitigate the impact of document differences, COCO-DR continues pretraining the language model on the target corpora to adapt the model to target distributions via COtinuous COtrastive learning. To prepare for unseen target queries, COCO-DR leverages implicit Distributionally Robust Optimization (iDRO) to reweight samples from different source query clusters for improving model robustness over rare queries during fine-tuning. COCO-DR achieves superior average performance on BEIR, the zero-shot retrieval benchmark. At BERT_Base scale, COCO-DR Base outperforms other ZeroDR models with 60x larger size. At BERT_Large scale, COCO-DR Large outperforms the giant GPT-3 embedding model which has 500x more parameters. Our analysis shows the correlation between COCO-DR’s effectiveness in combating distribution shifts and improving zero-shot accuracy. Our code and model can be found at \urlhttps://github.com/OpenMatch/COCO-DR.
AcTune: Uncertainty-Based Active Self-Training for Active Fine-Tuning of Pretrained Language Models

Yue Yu, Lingkai Kong, Jieyu Zhang, Rongzhi Zhang, and Chao Zhang

Proceedings of NAACL, 2022. (Oral)

Abs PDF Code

Although fine-tuning pre-trained language models (PLMs) renders strong performance in many NLP tasks, it relies on excessive labeled data. Recently, researchers have resorted to active fine-tuning for enhancing the label efficiency of PLM fine-tuning, but existing methods of this type usually ignore the potential of unlabeled data. We develop AcTune, a new framework that improves the label efficiency of active PLM fine-tuning by unleashing the power of unlabeled data via self-training. AcTune switches between data annotation and model self-training based on uncertainty: the unlabeled samples of high-uncertainty are selected for annotation, while the ones from low-uncertainty regions are used for model self-training. Additionally, we design (1) a region-aware sampling strategy to avoid redundant samples when querying annotations and (2) a momentum-based memory bank to dynamically aggregate the model’s pseudo labels to suppress label noise in self-training. Experiments on 6 text classification datasets show that AcTune outperforms the strongest active learning and self-training baselines and improves the label efficiency of PLM fine-tuning by 56.2% on average. Our implementation is available at \urlhttps://github.com/yueyu1030/actune.