Yue Yu | Publications

2024

RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs

Yue Yu, Wei Ping, Zihan Liu, Boxin Wang, Jiaxuan You, Chao Zhang, Mohammad Shoeybi, and Bryan Catanzaro

Proceedings of NeurIPS, 2024.

arXiv PDF
Explanation-aware Soft Ensemble Empowers Large Language Model In-context Learning

Yue Yu, Jiaming Shen, Tianqi Liu, Zhen Qin, Jing Nathan Yan, Jialu Liu, Chao Zhang, and Michael Bendersky

Proceedings of ACL, 2024.

arXiv PDF
ARL2: Aligning Retrievers with Black-box Large Language Models via Self-guided Adaptive Relevance Labeling

Lingxi Zhang, Yue Yu, Kuan Wang, and Chao Zhang

Proceedings of ACL, 2024.

arXiv PDF
RAM-EHR: Retrieval Augmentation Meets Clinical Predictions on Electronic Health Records

Ran Xu*, Wenqi Shi*, Yue Yu, Yuchen Zhuang, Bowen Jin, May D. Wang, Joyce C. Ho, and Carl Yang

Proceedings of ACL, 2024. (Oral)

arXiv PDF Code

2023

Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias

Yue Yu*, Yuchen Zhuang*, Jieyu Zhang*, Yu Meng, Alexander Ratner, Ranjay Krishna, Jiaming Shen, and Chao Zhang

Proceedings of NeurIPS (D&B Track), 2023.

arXiv PDF Code
ToolQA: A Dataset for LLM Question Answering with External Tools

Yuchen Zhuang*, Yue Yu*, Kuan Wang*, Haotian Sun, and Chao Zhang

Proceedings of NeurIPS (D&B Track), 2023.

arXiv PDF Code
Cold-Start Data Selection for Better Few-shot Language Model Fine-tuning: A Prompt-based Uncertainty Propagation Approach

Yue Yu, Rongzhi Zhang, Ran Xu, Jieyu Zhang, Jiaming Shen, and Chao Zhang

Proceedings of ACL, 2023.

Abs PDF Code

We present PATRON, a prompt-based data selection method for pre-trained language model fine-tuning under cold-start scenarios, i.e., no initial labeled data are available. In PATRON, we design (1) a prompt-based uncertainty propagation approach to estimate the importance of data points and (2) a partition-then-rewrite (PTR) strategy to promote sample diversity when querying for annotations. Experiments on six text classification datasets show that PATRON outperforms the strongest cold-start data selection baselines by up to 6.9%. Besides, with 128 labels only, PATRON achieves 91.0% and 92.1% of the fully supervised performance based on vanilla fine-tuning and prompt-based learning respectively. Our implementation of PATRON will be published upon acceptance.
ReGen: Zero-Shot Text Classification via Training Data Generation with Progressive Dense Retrieval

Yue Yu, Yuchen Zhuang, Rongzhi Zhang, Yu Meng, Jiaming Shen, and Chao Zhang

Proceedings of ACL Findings, 2023.

Abs PDF Code

With the development of large language models (LLMs), zero-shot learning has attracted much attention for various NLP tasks. Different from prior works that generate training data with billion-scale natural language generation (NLG) models, we propose a retrieval-enhanced framework to create training data from a general-domain unlabeled corpus. To realize this, we first conduct contrastive pretraining to learn an unsupervised dense retriever for extracting the most relevant documents using class-descriptive verbalizers. We then further pro- pose two simple strategies, namely Verbalizer Augmentation with Demonstrations and Self- consistency Guided Filtering to improve the topic coverage of the dataset while removing noisy examples. Experiments on nine datasets demonstrate that ReGen achieves 4.3% gain over the strongest baselines and saves around 70% of the time when compared with baselines using large NLG models. Besides, REGEN can be naturally integrated with recently proposed large language models to boost performance.

2022

COCO-DR: Combating Distribution Shifts in Zero-Shot Dense Retrieval with Contrastive and Distributionally Robust Learning

Yue Yu, Chenyan Xiong, Si Sun, Chao Zhang, and Arnold Overwijk

Proceedings of EMNLP, 2022. (Oral)

Abs PDF Code

We present a new zero-shot dense retrieval (ZeroDR) method, COCO-DR, to improve the generalization ability of dense retrieval by combating the distribution shifts between source training tasks and target scenarios. To mitigate the impact of document differences, COCO-DR continues pretraining the language model on the target corpora to adapt the model to target distributions via COtinuous COtrastive learning. To prepare for unseen target queries, COCO-DR leverages implicit Distributionally Robust Optimization (iDRO) to reweight samples from different source query clusters for improving model robustness over rare queries during fine-tuning. COCO-DR achieves superior average performance on BEIR, the zero-shot retrieval benchmark. At BERT_Base scale, COCO-DR Base outperforms other ZeroDR models with 60x larger size. At BERT_Large scale, COCO-DR Large outperforms the giant GPT-3 embedding model which has 500x more parameters. Our analysis shows the correlation between COCO-DR’s effectiveness in combating distribution shifts and improving zero-shot accuracy. Our code and model can be found at \urlhttps://github.com/OpenMatch/COCO-DR.
AcTune: Uncertainty-Based Active Self-Training for Active Fine-Tuning of Pretrained Language Models

Yue Yu, Lingkai Kong, Jieyu Zhang, Rongzhi Zhang, and Chao Zhang

Proceedings of NAACL, 2022. (Oral)

Abs PDF Code

Although fine-tuning pre-trained language models (PLMs) renders strong performance in many NLP tasks, it relies on excessive labeled data. Recently, researchers have resorted to active fine-tuning for enhancing the label efficiency of PLM fine-tuning, but existing methods of this type usually ignore the potential of unlabeled data. We develop AcTune, a new framework that improves the label efficiency of active PLM fine-tuning by unleashing the power of unlabeled data via self-training. AcTune switches between data annotation and model self-training based on uncertainty: the unlabeled samples of high-uncertainty are selected for annotation, while the ones from low-uncertainty regions are used for model self-training. Additionally, we design (1) a region-aware sampling strategy to avoid redundant samples when querying annotations and (2) a momentum-based memory bank to dynamically aggregate the model’s pseudo labels to suppress label noise in self-training. Experiments on 6 text classification datasets show that AcTune outperforms the strongest active learning and self-training baselines and improves the label efficiency of PLM fine-tuning by 56.2% on average. Our implementation is available at \urlhttps://github.com/yueyu1030/actune.
Counterfactual and Factual Reasoning over Hypergraphs for Interpretable Clinical Predictions on EHR

Ran Xu, Yue Yu, Chao Zhang, Mohammed K Ali, Joyce C Ho, and Carl Yang

Proceedings of ML4H, 2022. (Best Paper Award)

Abs PDF Code

Electronic Health Record modeling is crucial for digital medicine. However, existing models ignore higher-order interactions among medical codes and their causal relations towards downstream clinical predictions. To address such limitations, we propose a novel framework CACHE, to provide effective and insightful clinical predictions based on hypergraph representation learning and counterfactual and factual reasoning techniques. Experiments on two real EHR datasets show the superior performance of CACHE. Case studies with a domain expert illustrate a preferred capability of CACHE in generating clinically meaningful interpretations towards the correct predictions.

2021

Fine-Tuning Pre-trained Language Model with Weak Supervision: A Contrastive-Regularized Self-Training Approach

Yue Yu*, Simiao Zuo*, Haoming Jiang, Wendi Ren, Tuo Zhao, and Chao Zhang

Proceedings of NAACL, 2021. (Oral)

arXiv PDF Code
SumGNN: multi-typed drug interaction prediction via efficient knowledge graph summarization

Yue Yu*, Kexin Huang*, Chao Zhang, Lucas M Glass, Jimeng Sun, and Cao Xiao

Bioinformatics, 2021.

arXiv PDF Code
WRENCH: A Comprehensive Benchmark for Weak Supervision

Jieyu Zhang, Yue Yu, Yinghao Li, Yujing Wang, Yaming Yang, Mao Yang, and Alexander Ratner

Proceedings of NeurIPS (D&B Track), 2021. (Oral)

arXiv PDF Code

2020

STEAM: Self-supervised taxonomy expansion with mini-paths

Yue Yu, Yinghao Li, Jiaming Shen, Hao Feng, Jimeng Sun, and Chao Zhang

Proceedings of KDD, 2020. (Oral)

arXiv PDF Code
BOND: BERT-assisted open-domain named entity recognition with distant supervision

Chen Liang*, Yue Yu*, Haoming Jiang*, Siawpeng Er, Ruijia Wang, Tuo Zhao, and Chao Zhang

Proceedings of KDD, 2020. (Oral)

arXiv PDF Code

2019

Understanding Urban Dynamics via State-sharing Hidden Markov Model

Tong Xia*, Yue Yu*, Fengli Xu, Funing Sun, Diansheng Guo, Depeng Jin, and Yong Li

Proceedings of WWW, 2019.

PDF
Privacy-preserving cross-domain location recommendation

Chen Gao, Chao Huang, Yue Yu, Huandong Wang, Yong Li, and Depeng Jin

Proceedings of IMWUT/UbiComp, 2019.

PDF