Thinking Cobra with Scratchpad and CoT by AndrasFerenczy · Pull Request #32 · OpenHelix-Team/cobra

AndrasFerenczy · 2025-12-01T19:44:19Z

No description provided.

…nd save BLEU scores with timestamps. Add new entries to .gitignore for output files.

…n_table files

Scratchpad Reasoning + Benchmark

…ctionality for clearing GPU memory and saving BLEU scores has been integrated into the main workflow.

Update notebook: Clear GPU RAM and save image of results

This commit introduces a comprehensive set of files and scripts for fine-tuning the Cobra VLM on the LLaVA-CoT-100k dataset. Key additions include: - **Dataset Preparation**: A script to download, validate, and prepare the dataset for training. - **Custom Dataset Loader**: A new loader that supports JSONL format and integrates with existing training infrastructure. - **Fine-Tuning Script**: A dedicated script for fine-tuning the model using the prepared dataset. - **Documentation**: Detailed guides and summaries for setup and usage. These changes enhance the model's reasoning capabilities by leveraging structured reasoning annotations in the dataset.

X Merge branch 'main' of https://github.com/AndrasFerenczy/thinking_cobra

This commit introduces the foundational structure for the Cobra Evaluation System, including: - **Main Module**: The entry point for running evaluations with command-line argument parsing. - **Configuration Management**: A dedicated module for handling CLI arguments and settings. - **Registry System**: A registry for managing generators and metrics, allowing for extensibility. - **Generators**: Implementations for baseline, scratchpad, and external generation methods. - **Metrics**: Initial implementations for BLEU and BERTScore metrics. - **Utilities**: Functions for GPU management, JSON I/O, and visualization of results. These changes establish a modular framework for evaluating visual language models with various inference strategies and metrics, enhancing the system's extensibility and usability.

This commit introduces significant improvements to the evaluation process, including: - **Method Comparison**: Added functionality to run and compare results from multiple methods (baseline and scratchpad) within the same evaluation session. - **Visualization Enhancements**: Implemented a new comparison visualization that displays results side-by-side for easier analysis of generated captions and metrics. - **BERTScore Metric Updates**: Enhanced the BERTScore metric to store per-sample scores, allowing for detailed performance analysis. - **Code Refactoring**: Cleaned up the main evaluation logic for better readability and maintainability. These changes improve the usability and analytical capabilities of the Cobra Evaluation System, facilitating more comprehensive evaluations of visual language models.

This commit introduces several improvements to the evaluation process, including: - **Dynamic Output Directories**: Results are now saved in timestamped directories for better organization, allowing users to easily manage multiple runs. - **Comparison Statistics**: Added functionality to compute and save comparison statistics between baseline and scratchpad methods, including win rates and metric differences. - **Visualization Updates**: Enhanced the comparison visualization to include detailed metrics and reasoning traces, improving the clarity of results. These changes enhance the usability and analytical capabilities of the Cobra Evaluation System, facilitating more comprehensive evaluations of visual language models.

This commit modifies the .gitignore file to ensure that shell scripts are ignored and removes the output.png file, which is no longer needed. These changes help streamline the project by keeping unnecessary files out of version control.

This commit modifies the .gitignore file to include the __pycache__ directory, ensuring that Python bytecode files are not tracked in version control. This helps maintain a cleaner project structure by excluding unnecessary files.

This commit introduces several new files and enhancements to the Cobra Evaluation System, including: - **New Scripts**: Added `analyze_significance.py`, `compare_scratchpad_passes.py`, and `visualize_scratchpad_passes.py` for analyzing and visualizing scratchpad performance across multiple passes. - **Checkpointing Guide**: Introduced `CHECKPOINTING_GUIDE.md` to document the new automatic checkpointing feature for long-running evaluations. - **Improved Documentation**: Added `SCRATCHPAD_COMPARE_MODE.md`, `SCRATCHPAD_DEGRADATION_ANALYSIS.md`, and `SCRATCHPAD_IMPROVEMENTS.md` to provide insights into scratchpad methods and their performance. - **New Data Files**: Included various JSON and PNG files for results and visualizations from recent evaluations. These changes enhance the analytical capabilities and usability of the evaluation system, facilitating better understanding and comparison of different methods in visual language models.

This commit introduces support for external model API clients, allowing users to run evaluations using models such as GPT-5, Gemini, Claude, and Llama. Key changes include: - **New Inference Methods**: Added options for external models in the evaluation workflow. - **API Key Management**: Introduced command-line arguments for specifying API keys and model configurations. - **Conditional Model Loading**: Updated the main evaluation logic to skip local model loading when using external models. - **Checkpointing Improvements**: Enhanced checkpointing functionality to support overwriting the latest checkpoint file. These updates significantly expand the evaluation options and flexibility of the Cobra Evaluation System, facilitating integration with various external AI models.

…ith 10,100,1000,10000 examples

…ing scripts as well as my install requirements script.

…r_it MMStar and accuracy eval added (& benchmarked for 1000 images on both COCO and MMStar)

… into feature/qlora-finetune

Feature/qlora finetune

h-zhao1997 · 2025-12-02T08:52:03Z

Hi @AndrasFerenczy, @philosophercode and @MatteoPerona ,

Thank you so much for submitting this PR to this project! The effort you put into integrating LoRA adapters into the training pipeline and experimenting on reasoning datasets demonstrates a promising direction for extending the model’s capabilities.

However, after carefully reviewing your contribution, I’ve decided not to merge this feature directly into the main branch at this time. The main reason is that our priority is to keep the code and experiments in the main repository as clear and simple as possible, ensuring we can reliably reproduce the core results from the paper.

That said, I genuinely believe your work is highly valuable and deserves visibility. Rather than merging this into the current repository, I recommend you maintain these excellent improvements as a standalone fork or a new, independent repository.

Once again, thank you sincerely for your insightful contribution. I look forward to seeing your work grow into an exciting project of its own!

All the best,
Han Zhao

andrasferenczy and others added 30 commits November 8, 2025 00:12

finally working

ffd319a

intro benchmarking

5602331

running benchmark for the full dataset

904f50e

benchmarking finished

accff81

benchmarking finished #2

dd06273

for isaac

a1b8113

Update benchmark notebook to clear GPU memory before loading models a…

ae39c1e

…nd save BLEU scores with timestamps. Add new entries to .gitignore for output files.

Update .gitignore to include bleu_scores_output and caption_compariso…

159991b

…n_table files

Use scratchpad reasoning and benchmark it

199b624

introduced bertscore

67f4984

Merge pull request #1 from AndrasFerenczy/scratchpad_reasoning

83d1d72

Scratchpad Reasoning + Benchmark

Remove benchmark notebook as it is no longer needed. The previous fun…

2927c0b

…ctionality for clearing GPU memory and saving BLEU scores has been integrated into the main workflow.

Merge main into notebook-updates: restore benchmark.ipynb from main

552f78b

Merge pull request #2 from AndrasFerenczy/notebook-updates

4e88bd7

Update notebook: Clear GPU RAM and save image of results

benchmarking with bertscore

f8689bd

^X

f097be8

X Merge branch 'main' of https://github.com/AndrasFerenczy/thinking_cobra

Update .gitignore and remove output image

82a6176

This commit modifies the .gitignore file to ensure that shell scripts are ignored and removes the output.png file, which is no longer needed. These changes help streamline the project by keeping unnecessary files out of version control.

Merge branch 'main' of https://github.com/AndrasFerenczy/thinking_cobra

6d36b76

big txt file

08efe25

Update .gitignore to ignore __pycache__ directory

87ea06c

This commit modifies the .gitignore file to include the __pycache__ directory, ensuring that Python bytecode files are not tracked in version control. This helps maintain a cleaner project structure by excluding unnecessary files.

added mmstar and accuracy evaluation

f00f97e

benchmarked on 1000 images for both coco and mmstar

d56d66a

renamed COCO and MMStar test results

7335ff5

Add QLoRA finetune implementation

d3f118a

finetuning works but running into memory constraints"

e02605f

philosophercode and others added 9 commits December 1, 2025 02:12

deleted the old finetuning script and finetuned 4 versions of cobra w…

d1b800a

…ith 10,100,1000,10000 examples

removed .sh condition from gitignore so I could add all of my finetun…

3f5410e

…ing scripts as well as my install requirements script.

preparing to merge

c2e73c9

preparing to merge #2

4147131

Merge pull request #3 from AndrasFerenczy/mmstar_and_accuracy_eval_fo…

749cdd8

…r_it MMStar and accuracy eval added (& benchmarked for 1000 images on both COCO and MMStar)

Merge branch 'main' of https://github.com/AndrasFerenczy/thinking_cobra…

e158567

… into feature/qlora-finetune

updated readme

0a181e6

Merge pull request #4 from AndrasFerenczy/feature/qlora-finetune

788960b

Feature/qlora finetune

AndrasFerenczy added 3 commits December 15, 2025 21:48

refinement of codebase and readme

cad7456

readme refinement

3ceec74

readme refinement

c43e3df

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thinking Cobra with Scratchpad and CoT#32

Thinking Cobra with Scratchpad and CoT#32
AndrasFerenczy wants to merge 42 commits intoOpenHelix-Team:mainfrom
AndrasFerenczy:main

AndrasFerenczy commented Dec 1, 2025

Uh oh!

h-zhao1997 commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

AndrasFerenczy commented Dec 1, 2025

Uh oh!

h-zhao1997 commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants