Detailed Results of the Foundation Benchmark

In Table 5, we delineate the performance assessment for each model across the various tasks on the foundation benchmark. With the exception of Speaker Gender Recognition and Synthesized Voice Detection, which are binary-choice tasks, all other tasks necessitate a selection from four options. As such, a random selection in the Speaker Gender Recognition and Synthesized Voice Detection datasets would theoretically achieve an accuracy of 50%, while the expected accuracy for random choices across the remaining datasets stands at 25%. Consequently, any performance metrics that approximate these random baselines are indicative of an absence of discernible proficiency in the respective tasks.

Table 5: The accuracy of each model across all tasks in the foundation benchmark.

This paper is available on arxiv under CC BY 4.0 DEED license.

← Previous

AIR-Bench: A New Benchmark for Large Audio-Language Models

Up Next →

How Will Smith Eating Spaghetti Became a Benchmark for AI Video Progress