Abstract

Large language models (LMs) have been shown to capture large amounts of relationalknowledge from the pre-training corpus. These models can be probed for this factual knowledge by using cloze-style prompts as demonstrated on the LAMA benchmark. However,recent studies have uncovered that results only perform well, because the models are goodat performing educated guesses or recalling facts from the training data. We present a novelWikidata-based benchmark dataset, KAMEL , for probing relational knowledge in LMs.In contrast to previous datasets, it covers a broader range of knowledge, probes for single-,and multi-token entities, and contains facts with literal values. Furthermore, the evaluationprocedure is more accurate, since the dataset contains alternative entity labels and dealswith higher-cardinality relations. Instead of performing the evaluation on masked languagemodels, we present results for a variety of recent causal LMs in a few-shot setting. We showthat indeed novel models perform very well on LAMA, achieving a promising F1-score of52.90%, while only achieving 17.62% on KAMEL. Our analysis shows that even large language models are far from being able to memorize all varieties of relational knowledge thatis usually stored knowledge graphs.

Source PDF View Code