Evaluate LLM-generated COBOL

Bloop AI

Last update: Jun 26, 2024

Related tags

Overview

COBOLEval: LLM Evaluation for COBOL

COBOLEval is a dataset to evaluate the code generation abilities of Large Language Models on the COBOL programming language. It is a transpilation of the widely-used HumanEval benchmark from Python into COBOL. This repo contains both the Python to COBOL transpiler, and an evaluation harness for the dataset.

Installation

COBOLEval uses GnuCOBOL to compile the generated COBOL solutions. Download version 3.2.0 here and follow the installation instructions: https://sourceforge.net/projects/gnucobol/files/.

Check that the installation was successful with:

>>> cobc -v
cobc (GnuCOBOL) 3.2.0

Using Python3.10 or later:

python -m venv coboleval
source coboleval/bin/activate
pip install -r requirements.txt

To run the Python to COBOL transpiler, you'll need to install Rust.

Usage

This program runs untrusted model-generated code. Users are strongly encouraged not to do so outside of a robust security sandbox. Following HumanEval, the execution call in evaluation.py is deliberately commented out to ensure users read this disclaimer before running code in a potentially unsafe manner.

Generate completions

Configure the model and the number of samples-per-problem in scripts/generate.py then run.

if __name__ == "__main__":
    model = Model(name="gpt-4", samples_per_task=1)
    runner = OpenAIChat(model)
    runner.eval()

This will create a samples.jsonl file in preds/gpt-4 which contains the generated COBOL solutions.

Calculate Pass@k

Configure the model and the number of samples in the entrypoint() function in scripts/evaluate_functional_correctness.py:

def entrypoint():
    all_results = []
    run_folders = ["gpt-4"]  # edit
    for folder in run_folders:
        all_results.append(eval(f"preds/{folder}", "1"))

    for res, folder in zip(all_results, run_folders):
        print(f"{folder}: {res}")

Outputs are written to preds/gpt-4/samples_results.jsonl and Pass@k is printed:

gpt-4: {'pass@1': 0.10273972602739725}

Comments

Add topic tags

I suggest adding the topics cobol, llm, humaneval in the About section, as explained at https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/classifying-your-repository-with-topics

opened by Beliavsky 1
Support for FIM based prompting

As mentioned in this blogpost, there is a FIM based way to prompt the models followed by rearranging the prompt + output so that the Working Storage Section and Linkage Section are in correct order. Do you plan to add support for this style of prompting in this repo?

I am assuming that the Pass@1 and %Compile scores mentioned in the blog are without using FIM based prompting?

opened by varadhbhatnagar 1
Support for benchmarking HuggingFace models

Hi @ggordonhall

I can see in this blogpost that CodeLlama and mAInframer have been benchmarked on the COBOLEval benchmark. Is there any support in this repo to work directly with Huggingface model checkpoints ?

opened by varadhbhatnagar 1
Add BOS token in HF completion

#2 finds different pass@1 results for mAInframer-7b compared to the one in the model card in https://huggingface.co/bloopai/mAInframer-7b (~4% vs original ~6%)

This PR fixes hf_complete to include the BOS token so the result can be reproduced

opened by rmuller-ml 0

Adding support for HF models

Add support for huggingface models.

Uses "cuda" as device. To test the 7b/13b/34b BloopAI models such as bloopai/mAInframer-7b, do:

    import data
    from generate import  HuggingfaceInfill, HuggingfaceComplete
    from utils import Model

    problems = data.read_problems()
    prompt = problems["HumanEval/1"]["prompt"]
    print(f"Prompt:\n\n\n{prompt}\n\n\n")

    model = Model(name="bloopai/mAInframer-7b", tokenizer="codellama/CodeLlama-7b-hf", prefix_token="<PRE>", suffix_token="<SUF>", middle_token="<MID>", eos_token="</s>")

    infiller = HuggingfaceInfill(model)
    completion = infiller.solve({"prompt": prompt})

    print(f"Completion:\n\n\n{completion}\n\n\n")

Prompt:

       IDENTIFICATION DIVISION.
       PROGRAM-ID. SEPARATE-PAREN-GROUPS.

       ENVIRONMENT DIVISION.
       
       INPUT-OUTPUT SECTION.

       DATA DIVISION.

       LINKAGE SECTION.

       01 LINKED-ITEMS.
           05 L-PAREN-STRING PIC X(100).
           05 RESULT OCCURS 100 TIMES INDEXED BY NI PIC X(100).

      * Input to this function is a string containing multiple groups of nested parentheses. Your goal is to
      * separate those group into separate strings and return the list of those.
      * Separate groups are balanced (each open brace is properly closed) and not nested within each other
      * Ignore any spaces in the input string.
      * >>> separate_paren_groups('( ) (( )) (( )( ))')
      * ['()', '(())', '(()())']
      * 

      * Complete the WORKING-STORAGE SECTION and the PROCEDURE DIVISION
      * Store the result in the RESULT variable and mark the end of your program with END PROGRAM

       WORKING-STORAGE SECTION.

Completion:

       IDENTIFICATION DIVISION.
       PROGRAM-ID. SEPARATE-PAREN-GROUPS.

       ENVIRONMENT DIVISION.
       
       INPUT-OUTPUT SECTION.

       DATA DIVISION.
       WORKING-STORAGE SECTION.

       01 WS-PAREN-STRING PIC X(100).
       01 WS-INDEX PIC 9(3) VALUE 1.
       01 WS-START-POSITION PIC 9(3) VALUE 1.
       01 WS-OPEN-COUNT PIC 9(3) VALUE 0.
       01 WS-RESULT PIC X(100).

       LINKAGE SECTION.

       01 LINKED-ITEMS.
           05 L-PAREN-STRING PIC X(100).
           05 RESULT OCCURS 100 TIMES INDEXED BY NI PIC X(100).

      * Input to this function is a string containing multiple groups of nested parentheses. Your goal is to
      * separate those group into separate strings and return the list of those.
      * Separate groups are balanced (each open brace is properly closed) and not nested within each other
      * Ignore any spaces in the input string.
      * >>> separate_paren_groups('( ) (( )) (( )( ))')
      * ['()', '(())', '(()())']
      * 

      * Store the result in the RESULT variable and mark the end of your program with END PROGRAM

       PROCEDURE DIVISION USING LINKED-ITEMS.
           MOVE L-PAREN-STRING TO WS-PAREN-STRING.
           PERFORM UNTIL WS-INDEX > LENGTH OF WS-PAREN-STRING
               IF WS-PAREN-STRING(WS-INDEX:1) = '(' THEN
                   ADD 1 TO WS-OPEN-COUNT
                   STRING WS-PAREN-STRING(WS-START-POSITION:WS-INDEX - WS-START-POSITION + 1) DELIMITED BY SIZE INTO WS-RESULT
                   MOVE WS-INDEX TO WS-START-POSITION
                   MOVE WS-RESULT TO RESULT(NI)
                   ADD 1 TO NI
               ELSE IF WS-PAREN-STRING(WS-INDEX:1) = ')' THEN
                   SUBTRACT 1 FROM WS-OPEN-COUNT
                   IF WS-OPEN-COUNT < 0 THEN
                       DISPLAY 'ERROR: Unbalanced Parentheses'
                       GOBACK
                   END-IF
               END-IF
               ADD 1 TO WS-INDEX
           END-PERFORM.
           GOBACK.

       END PROGRAM SEPARATE-PAREN-GROUPS.

And intermediate completion (before splitting PROCEDURE DIVISION and WORKING STORAGE):

       PROCEDURE DIVISION USING LINKED-ITEMS.
           MOVE L-PAREN-STRING TO WS-PAREN-STRING.
           PERFORM UNTIL WS-INDEX > LENGTH OF WS-PAREN-STRING
               IF WS-PAREN-STRING(WS-INDEX:1) = '(' THEN
                   ADD 1 TO WS-OPEN-COUNT
                   STRING WS-PAREN-STRING(WS-START-POSITION:WS-INDEX - WS-START-POSITION + 1) DELIMITED BY SIZE INTO WS-RESULT
                   MOVE WS-INDEX TO WS-START-POSITION
                   MOVE WS-RESULT TO RESULT(NI)
                   ADD 1 TO NI
               ELSE IF WS-PAREN-STRING(WS-INDEX:1) = ')' THEN
                   SUBTRACT 1 FROM WS-OPEN-COUNT
                   IF WS-OPEN-COUNT < 0 THEN
                       DISPLAY 'ERROR: Unbalanced Parentheses'
                       GOBACK
                   END-IF
               END-IF
               ADD 1 TO WS-INDEX
           END-PERFORM.
           GOBACK.

       END PROGRAM SEPARATE-PAREN-GROUPS.

<MID>       WORKING-STORAGE SECTION.

       01 WS-PAREN-STRING PIC X(100).
       01 WS-INDEX PIC 9(3) VALUE 1.
       01 WS-START-POSITION PIC 9(3) VALUE 1.
       01 WS-OPEN-COUNT PIC 9(3) VALUE 0.
       01 WS-RESULT PIC X(100).

opened by rmuller-ml 0

Owner

Bloop AI

GitHub

RustHopper evaluate grasshopper3d with RhinoCompute from Rust.

RustHopper This is a crate to run grasshopper with RhinoCompute from rust. The input data can be created by entering into main.rs the same Python code

11 Jan 1, 2023

ObfusEval is the benchmarking tool to evaluate the reliability of the code obfuscating transformation.

ObfusEval ObfusEval is the benchmarking tool to evaluate the reliability of the code obfuscating transformation. The following two metrics related the

4 Dec 15, 2022

This is a tool to evaluate or export code from Markdown files.

Evaluate Markdown This is a tool to evaluate or export code from Markdown files. Why? Because I like writing Markdown files with code snippets (it's g

5 Apr 25, 2023

Evaluate performance gains to expect when EVM were to compile hot contracts into machine code

Convert evm bytecode to native machine code and go vroom - just an experiment, probably broken, reach out to [email protected] to contribute / productionize.

105 Aug 1, 2023

auto-rust is an experimental project that aims to automatically generate Rust code with LLM (Large Language Models) during compilation, utilizing procedural macros.

Auto Rust auto-rust is an experimental project that aims to automatically generate Rust code with LLM (Large Language Models) during compilation, util

6 May 14, 2023

Solving context limits when working with AI LLM models by implementing a "chunkable" attribute on your prompt structs.

Promptize Promptize attempts to solve the issues with context limits when working with AI systems. It allows a user to add an attribute to their struc

5 Jul 18, 2023

An LLM-powered (CodeLlama or OpenAI) local diff code review tool.

augre An LLM-powered (CodeLlama or OpenAI) local diff code review tool. Binary Usage Install Windows: $ iwr https://github.com/twitchax/augre/releases

4 Oct 19, 2023

Terminal UI to chat with large language models (LLM) using different model backends, and integrations with your favourite editors!

Oatmeal Terminal UI to chat with large language models (LLM) using different model backends, and integrations with your favourite editors! Overview In