Evaluate LLM-generated COBOL

Overview

COBOLEval: LLM Evaluation for COBOL

COBOLEval is a dataset to evaluate the code generation abilities of Large Language Models on the COBOL programming language. It is a transpilation of the widely-used HumanEval benchmark from Python into COBOL. This repo contains both the Python to COBOL transpiler, and an evaluation harness for the dataset.

Installation

COBOLEval uses GnuCOBOL to compile the generated COBOL solutions. Download version 3.2.0 here and follow the installation instructions: https://sourceforge.net/projects/gnucobol/files/.

Check that the installation was successful with:

>>> cobc -v
cobc (GnuCOBOL) 3.2.0

Using Python3.10 or later:

python -m venv coboleval
source coboleval/bin/activate
pip install -r requirements.txt

To run the Python to COBOL transpiler, you'll need to install Rust.

Usage

This program runs untrusted model-generated code. Users are strongly encouraged not to do so outside of a robust security sandbox. Following HumanEval, the execution call in evaluation.py is deliberately commented out to ensure users read this disclaimer before running code in a potentially unsafe manner.

Generate completions

Configure the model and the number of samples-per-problem in scripts/generate.py then run.

if __name__ == "__main__":
    model = Model(name="gpt-4", samples_per_task=1)
    runner = OpenAIChat(model)
    runner.eval()

This will create a samples.jsonl file in preds/gpt-4 which contains the generated COBOL solutions.

Calculate Pass@k

Configure the model and the number of samples in the entrypoint() function in scripts/evaluate_functional_correctness.py:

def entrypoint():
    all_results = []
    run_folders = ["gpt-4"]  # edit
    for folder in run_folders:
        all_results.append(eval(f"preds/{folder}", "1"))

    for res, folder in zip(all_results, run_folders):
        print(f"{folder}: {res}")

Outputs are written to preds/gpt-4/samples_results.jsonl and Pass@k is printed:

gpt-4: {'pass@1': 0.10273972602739725}
Comments
  • Add topic tags

    Add topic tags

    I suggest adding the topics cobol, llm, humaneval in the About section, as explained at https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/classifying-your-repository-with-topics

    opened by Beliavsky 1
  • Support for FIM based prompting

    Support for FIM based prompting

    As mentioned in this blogpost, there is a FIM based way to prompt the models followed by rearranging the prompt + output so that the Working Storage Section and Linkage Section are in correct order. Do you plan to add support for this style of prompting in this repo?

    I am assuming that the Pass@1 and %Compile scores mentioned in the blog are without using FIM based prompting?

    opened by varadhbhatnagar 1
  • Support for benchmarking HuggingFace models

    Support for benchmarking HuggingFace models

    Hi @ggordonhall

    I can see in this blogpost that CodeLlama and mAInframer have been benchmarked on the COBOLEval benchmark. Is there any support in this repo to work directly with Huggingface model checkpoints ?

    opened by varadhbhatnagar 1
  • Add BOS token in HF completion

    Add BOS token in HF completion

    #2 finds different pass@1 results for mAInframer-7b compared to the one in the model card in https://huggingface.co/bloopai/mAInframer-7b (~4% vs original ~6%)

    This PR fixes hf_complete to include the BOS token so the result can be reproduced

    opened by rmuller-ml 0
  • Adding support for HF models

    Adding support for HF models

    Add support for huggingface models.

    Uses "cuda" as device. To test the 7b/13b/34b BloopAI models such as bloopai/mAInframer-7b, do:

        import data
        from generate import  HuggingfaceInfill, HuggingfaceComplete
        from utils import Model
    
        problems = data.read_problems()
        prompt = problems["HumanEval/1"]["prompt"]
        print(f"Prompt:\n\n\n{prompt}\n\n\n")
    
        model = Model(name="bloopai/mAInframer-7b", tokenizer="codellama/CodeLlama-7b-hf", prefix_token="<PRE>", suffix_token="<SUF>", middle_token="<MID>", eos_token="</s>")
    
        infiller = HuggingfaceInfill(model)
        completion = infiller.solve({"prompt": prompt})
    
        print(f"Completion:\n\n\n{completion}\n\n\n")
    

    Prompt:

           IDENTIFICATION DIVISION.
           PROGRAM-ID. SEPARATE-PAREN-GROUPS.
    
           ENVIRONMENT DIVISION.
           
           INPUT-OUTPUT SECTION.
    
           DATA DIVISION.
    
           LINKAGE SECTION.
    
           01 LINKED-ITEMS.
               05 L-PAREN-STRING PIC X(100).
               05 RESULT OCCURS 100 TIMES INDEXED BY NI PIC X(100).
    
          * Input to this function is a string containing multiple groups of nested parentheses. Your goal is to
          * separate those group into separate strings and return the list of those.
          * Separate groups are balanced (each open brace is properly closed) and not nested within each other
          * Ignore any spaces in the input string.
          * >>> separate_paren_groups('( ) (( )) (( )( ))')
          * ['()', '(())', '(()())']
          * 
    
          * Complete the WORKING-STORAGE SECTION and the PROCEDURE DIVISION
          * Store the result in the RESULT variable and mark the end of your program with END PROGRAM
    
           WORKING-STORAGE SECTION.
    

    Completion:

           IDENTIFICATION DIVISION.
           PROGRAM-ID. SEPARATE-PAREN-GROUPS.
    
           ENVIRONMENT DIVISION.
           
           INPUT-OUTPUT SECTION.
    
           DATA DIVISION.
           WORKING-STORAGE SECTION.
    
           01 WS-PAREN-STRING PIC X(100).
           01 WS-INDEX PIC 9(3) VALUE 1.
           01 WS-START-POSITION PIC 9(3) VALUE 1.
           01 WS-OPEN-COUNT PIC 9(3) VALUE 0.
           01 WS-RESULT PIC X(100).
    
           LINKAGE SECTION.
    
           01 LINKED-ITEMS.
               05 L-PAREN-STRING PIC X(100).
               05 RESULT OCCURS 100 TIMES INDEXED BY NI PIC X(100).
    
          * Input to this function is a string containing multiple groups of nested parentheses. Your goal is to
          * separate those group into separate strings and return the list of those.
          * Separate groups are balanced (each open brace is properly closed) and not nested within each other
          * Ignore any spaces in the input string.
          * >>> separate_paren_groups('( ) (( )) (( )( ))')
          * ['()', '(())', '(()())']
          * 
    
          * Store the result in the RESULT variable and mark the end of your program with END PROGRAM
    
           PROCEDURE DIVISION USING LINKED-ITEMS.
               MOVE L-PAREN-STRING TO WS-PAREN-STRING.
               PERFORM UNTIL WS-INDEX > LENGTH OF WS-PAREN-STRING
                   IF WS-PAREN-STRING(WS-INDEX:1) = '(' THEN
                       ADD 1 TO WS-OPEN-COUNT
                       STRING WS-PAREN-STRING(WS-START-POSITION:WS-INDEX - WS-START-POSITION + 1) DELIMITED BY SIZE INTO WS-RESULT
                       MOVE WS-INDEX TO WS-START-POSITION
                       MOVE WS-RESULT TO RESULT(NI)
                       ADD 1 TO NI
                   ELSE IF WS-PAREN-STRING(WS-INDEX:1) = ')' THEN
                       SUBTRACT 1 FROM WS-OPEN-COUNT
                       IF WS-OPEN-COUNT < 0 THEN
                           DISPLAY 'ERROR: Unbalanced Parentheses'
                           GOBACK
                       END-IF
                   END-IF
                   ADD 1 TO WS-INDEX
               END-PERFORM.
               GOBACK.
    
           END PROGRAM SEPARATE-PAREN-GROUPS.
    

    And intermediate completion (before splitting PROCEDURE DIVISION and WORKING STORAGE):

           PROCEDURE DIVISION USING LINKED-ITEMS.
               MOVE L-PAREN-STRING TO WS-PAREN-STRING.
               PERFORM UNTIL WS-INDEX > LENGTH OF WS-PAREN-STRING
                   IF WS-PAREN-STRING(WS-INDEX:1) = '(' THEN
                       ADD 1 TO WS-OPEN-COUNT
                       STRING WS-PAREN-STRING(WS-START-POSITION:WS-INDEX - WS-START-POSITION + 1) DELIMITED BY SIZE INTO WS-RESULT
                       MOVE WS-INDEX TO WS-START-POSITION
                       MOVE WS-RESULT TO RESULT(NI)
                       ADD 1 TO NI
                   ELSE IF WS-PAREN-STRING(WS-INDEX:1) = ')' THEN
                       SUBTRACT 1 FROM WS-OPEN-COUNT
                       IF WS-OPEN-COUNT < 0 THEN
                           DISPLAY 'ERROR: Unbalanced Parentheses'
                           GOBACK
                       END-IF
                   END-IF
                   ADD 1 TO WS-INDEX
               END-PERFORM.
               GOBACK.
    
           END PROGRAM SEPARATE-PAREN-GROUPS.
    
    <MID>       WORKING-STORAGE SECTION.
    
           01 WS-PAREN-STRING PIC X(100).
           01 WS-INDEX PIC 9(3) VALUE 1.
           01 WS-START-POSITION PIC 9(3) VALUE 1.
           01 WS-OPEN-COUNT PIC 9(3) VALUE 0.
           01 WS-RESULT PIC X(100).
    
    opened by rmuller-ml 0
Owner
Bloop AI
Bloop AI
RustHopper evaluate grasshopper3d with RhinoCompute from Rust.

RustHopper This is a crate to run grasshopper with RhinoCompute from rust. The input data can be created by entering into main.rs the same Python code

hiron 11 Jan 1, 2023
ObfusEval is the benchmarking tool to evaluate the reliability of the code obfuscating transformation.

ObfusEval ObfusEval is the benchmarking tool to evaluate the reliability of the code obfuscating transformation. The following two metrics related the

Software Engineering Lab @ NAIST 4 Dec 15, 2022
This is a tool to evaluate or export code from Markdown files.

Evaluate Markdown This is a tool to evaluate or export code from Markdown files. Why? Because I like writing Markdown files with code snippets (it's g

Balazs Nadasdi 5 Apr 25, 2023
Evaluate performance gains to expect when EVM were to compile hot contracts into machine code

Convert evm bytecode to native machine code and go vroom - just an experiment, probably broken, reach out to [email protected] to contribute / productionize.

Paradigm 105 Aug 1, 2023
auto-rust is an experimental project that aims to automatically generate Rust code with LLM (Large Language Models) during compilation, utilizing procedural macros.

Auto Rust auto-rust is an experimental project that aims to automatically generate Rust code with LLM (Large Language Models) during compilation, util

Minsky 6 May 14, 2023
Solving context limits when working with AI LLM models by implementing a "chunkable" attribute on your prompt structs.

Promptize Promptize attempts to solve the issues with context limits when working with AI systems. It allows a user to add an attribute to their struc

Dan Nelson 5 Jul 18, 2023
An LLM-powered (CodeLlama or OpenAI) local diff code review tool.

augre An LLM-powered (CodeLlama or OpenAI) local diff code review tool. Binary Usage Install Windows: $ iwr https://github.com/twitchax/augre/releases

Aaron Roney 4 Oct 19, 2023
Terminal UI to chat with large language models (LLM) using different model backends, and integrations with your favourite editors!

Oatmeal Terminal UI to chat with large language models (LLM) using different model backends, and integrations with your favourite editors! Overview In

Dustin Blackman 88 Dec 4, 2023
Rust library for integrating local LLMs (with llama.cpp) and external LLM APIs.

Table of Contents About The Project Getting Started Roadmap Contributing License Contact A rust interface for the OpenAI API and Llama.cpp ./server AP

Shelby Jenkins 4 Dec 18, 2023
An egui app for prompting a local offline LLM.

An egui app for prompting a local offline LLM. Description coze is a small egui application for prompting a local offline LLM using the Huggingface ca

null 23 Mar 9, 2024
A Rust LLaMA project to load, serve and extend LLM models

OpenLLaMA Overview A Rust LLaMA project to load, serve and extend LLM models. Key Objectives Support both GGML and HF(HuggingFace) models Support a st

Compute IO 4 Apr 9, 2024
A CLI tool to rename files to randomly generated strings.

rng-rename A CLI tool to rename files to randomly generated strings. Why? Suppose you downloaded a few hundred images to use as your desktop wallpaper

null 2 Feb 24, 2022
Blazingly fast Rust CLI app to sync data from a folder of excel workbooks into generated c# code for unity usage

Extensions supported ( .xls, .xlsx, .xlsm, .xlsb, .xla, .xlam, .ods ) Speed Test Image shows the results of 5000defs synced from 2 workbooks and 5 she

null 4 Feb 16, 2023
Nushell "extern" definitions for tab completion generated from Fish's

Nushell completions pack This is a system for generating extern defs (tab-completion) in nu. Background The fish shell project has a long, complicated

Scott Boggs 7 Feb 28, 2023
A recreation of the famous Cobol with a more modern approach.

Cobalt Lang Discord Warning To compile and use Cobalt on windows you will need to follow this StackOverflow post Ye, im just as confused as you... Who

Shadow Wizurd Money Gang Leader 4 Feb 15, 2023
A tool used to evaluate the output of retrieval algorithms. Written in Rust.

Rusteval A tool used to evaluate the output of retrieval algorithms. Written in Rust. Building Install Rust with curl -sSf https://static.rust-lang.or

Giorgos Sfikas 17 Mar 22, 2022
A series of test cases to evaluate async Rust on the nrf52840 in terms of power usage and ergonomics.

A series of test cases to evaluate async Rust on the nrf52840 in terms of power usage and ergonomics. This is an experiment that uses unstable features only available on nightly rust.

Tweede golf 1 Oct 15, 2021
Cargo-eval - A cargo plugin to quickly evaluate some Rust source code.

cargo eval A cargo plugin to quickly evaluate some Rust source code. Installation $ cargo install --git https://github.com/timClicks/cargo-eval.git Us

Tim McNamara 9 Dec 21, 2022
RustHopper evaluate grasshopper3d with RhinoCompute from Rust.

RustHopper This is a crate to run grasshopper with RhinoCompute from rust. The input data can be created by entering into main.rs the same Python code

hiron 11 Jan 1, 2023
ObfusEval is the benchmarking tool to evaluate the reliability of the code obfuscating transformation.

ObfusEval ObfusEval is the benchmarking tool to evaluate the reliability of the code obfuscating transformation. The following two metrics related the

Software Engineering Lab @ NAIST 4 Dec 15, 2022