Ai alignment: A comprehensive survey J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang, Y Duan, Z He, J Zhou, ... arXiv preprint arXiv:2310.19852, 2023 | 133 | 2023 |
Safety gymnasium: A unified safe reinforcement learning benchmark J Ji, B Zhang, J Zhou, X Pan, W Huang, R Sun, Y Geng, Y Zhong, J Dai, ... Advances in Neural Information Processing Systems 36, 2023 | 47* | 2023 |
OmniSafe: An Infrastructure for Accelerating Safe Reinforcement Learning Research J Ji, J Zhou, B Zhang, J Dai, X Pan, R Sun, W Huang, Y Geng, M Liu, ... arXiv preprint arXiv:2305.09304, 2023 | 27 | 2023 |
Rethinking information structures in rlhf: Reward generalization from a graph theory perspective T Qiu, F Zeng, J Ji, D Yan, K Wang, J Zhou, H Yang, J Dai, X Pan, Y Yang arXiv preprint arXiv:2402.10184, 2024 | 4 | 2024 |
Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback J Zhou, J Ji, J Dai, Y Yang arXiv preprint arXiv:2409.00162, 2024 | | 2024 |
Language Models Resist Alignment J Ji, K Wang, T Qiu, B Chen, J Zhou, C Li, H Lou, Y Yang arXiv preprint arXiv:2406.06144, 2024 | | 2024 |