unable to reproduce the gpt-4 performance

I generate code by gpt-4 with prompt in dataset, and evaluate the code in `codegeex/codegeex:0.1.23`.

the result is `'pass@1': 47.1` that cannot match `'pass@1': 64.3` in the paper.

<img width="1821" alt="image" src="https://github.com/THUDM/NaturalCodeBench/assets/8615337/d304f4fd-cabc-4a4e-8a79-560cde219bd9">

 the gpt-4 output is here
[gpt4_ncb_java_zh.jsonl.zip](https://github.com/user-attachments/files/15884262/gpt4_ncb_java_zh.jsonl.zip)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unable to reproduce the gpt-4 performance #2

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

unable to reproduce the gpt-4 performance #2

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions