HyperAI

ProCQA Community-based Programming Question Answering Dataset

Date

a year ago

Size

2.34 GB

Organization

Beijing University of Aeronautics and Astronautics

Publish URL

github.com

ProCQA is a large-scale programming question-answering dataset created by Beihang University, containing about 5 million question-answer pairs.Covers 11 different programming languages including Python, Java, JavaScript, etc.These questions and answers involve multiple knowledge areas such as algorithms, frameworks, and library usage. The data comes from the StackOverflow community. Researchers obtain it through crawler technology and adopt strict rule filtering strategies, including filtering out too short or too long questions and only retaining answers accepted by the questioner, to ensure the quality and fairness of the data. The question-answer pairs in ProCQA are naturally structured mixed modalities, that is, text and code are intertwined in the question-answer field, providing a natural supervision signal for the model and helping to align the two modalities. This dataset can be widely used in evaluation benchmarks and pre-training corpora, providing important resources for code retrieval and question-answering tasks.

ProCQA.torrent
Seeding 1Downloading 0Completed 96Total Downloads 195
  • ProCQA/
    • README.md
      1.56 KB
    • README.txt
      3.13 KB
      • data/
        • procqa.zip
          2.34 GB