Skip to content

Thinking Cobra with Scratchpad and CoT#32

Open
AndrasFerenczy wants to merge 42 commits intoOpenHelix-Team:mainfrom
AndrasFerenczy:main
Open

Thinking Cobra with Scratchpad and CoT#32
AndrasFerenczy wants to merge 42 commits intoOpenHelix-Team:mainfrom
AndrasFerenczy:main

Conversation

@AndrasFerenczy
Copy link
Copy Markdown

No description provided.

andrasferenczy and others added 30 commits November 8, 2025 00:12
…nd save BLEU scores with timestamps. Add new entries to .gitignore for output files.
…ctionality for clearing GPU memory and saving BLEU scores has been integrated into the main workflow.
Update notebook: Clear GPU RAM and save image of results
This commit introduces a comprehensive set of files and scripts for fine-tuning the Cobra VLM on the LLaVA-CoT-100k dataset. Key additions include:

- **Dataset Preparation**: A script to download, validate, and prepare the dataset for training.
- **Custom Dataset Loader**: A new loader that supports JSONL format and integrates with existing training infrastructure.
- **Fine-Tuning Script**: A dedicated script for fine-tuning the model using the prepared dataset.
- **Documentation**: Detailed guides and summaries for setup and usage.

These changes enhance the model's reasoning capabilities by leveraging structured reasoning annotations in the dataset.
This commit introduces the foundational structure for the Cobra Evaluation System, including:

- **Main Module**: The entry point for running evaluations with command-line argument parsing.
- **Configuration Management**: A dedicated module for handling CLI arguments and settings.
- **Registry System**: A registry for managing generators and metrics, allowing for extensibility.
- **Generators**: Implementations for baseline, scratchpad, and external generation methods.
- **Metrics**: Initial implementations for BLEU and BERTScore metrics.
- **Utilities**: Functions for GPU management, JSON I/O, and visualization of results.

These changes establish a modular framework for evaluating visual language models with various inference strategies and metrics, enhancing the system's extensibility and usability.
This commit introduces significant improvements to the evaluation process, including:

- **Method Comparison**: Added functionality to run and compare results from multiple methods (baseline and scratchpad) within the same evaluation session.
- **Visualization Enhancements**: Implemented a new comparison visualization that displays results side-by-side for easier analysis of generated captions and metrics.
- **BERTScore Metric Updates**: Enhanced the BERTScore metric to store per-sample scores, allowing for detailed performance analysis.
- **Code Refactoring**: Cleaned up the main evaluation logic for better readability and maintainability.

These changes improve the usability and analytical capabilities of the Cobra Evaluation System, facilitating more comprehensive evaluations of visual language models.
This commit introduces several improvements to the evaluation process, including:

- **Dynamic Output Directories**: Results are now saved in timestamped directories for better organization, allowing users to easily manage multiple runs.
- **Comparison Statistics**: Added functionality to compute and save comparison statistics between baseline and scratchpad methods, including win rates and metric differences.
- **Visualization Updates**: Enhanced the comparison visualization to include detailed metrics and reasoning traces, improving the clarity of results.

These changes enhance the usability and analytical capabilities of the Cobra Evaluation System, facilitating more comprehensive evaluations of visual language models.
This commit modifies the .gitignore file to ensure that shell scripts are ignored and removes the output.png file, which is no longer needed. These changes help streamline the project by keeping unnecessary files out of version control.
This commit modifies the .gitignore file to include the __pycache__ directory, ensuring that Python bytecode files are not tracked in version control. This helps maintain a cleaner project structure by excluding unnecessary files.
This commit introduces several new files and enhancements to the Cobra Evaluation System, including:

- **New Scripts**: Added `analyze_significance.py`, `compare_scratchpad_passes.py`, and `visualize_scratchpad_passes.py` for analyzing and visualizing scratchpad performance across multiple passes.
- **Checkpointing Guide**: Introduced `CHECKPOINTING_GUIDE.md` to document the new automatic checkpointing feature for long-running evaluations.
- **Improved Documentation**: Added `SCRATCHPAD_COMPARE_MODE.md`, `SCRATCHPAD_DEGRADATION_ANALYSIS.md`, and `SCRATCHPAD_IMPROVEMENTS.md` to provide insights into scratchpad methods and their performance.
- **New Data Files**: Included various JSON and PNG files for results and visualizations from recent evaluations.

These changes enhance the analytical capabilities and usability of the evaluation system, facilitating better understanding and comparison of different methods in visual language models.
philosophercode and others added 9 commits December 1, 2025 02:12
This commit introduces support for external model API clients, allowing users to run evaluations using models such as GPT-5, Gemini, Claude, and Llama. Key changes include:

- **New Inference Methods**: Added options for external models in the evaluation workflow.
- **API Key Management**: Introduced command-line arguments for specifying API keys and model configurations.
- **Conditional Model Loading**: Updated the main evaluation logic to skip local model loading when using external models.
- **Checkpointing Improvements**: Enhanced checkpointing functionality to support overwriting the latest checkpoint file.

These updates significantly expand the evaluation options and flexibility of the Cobra Evaluation System, facilitating integration with various external AI models.
…ing scripts as well as my install requirements script.
…r_it

MMStar and accuracy eval added (& benchmarked for 1000 images on both COCO and MMStar)
@h-zhao1997
Copy link
Copy Markdown
Collaborator

Hi @AndrasFerenczy, @philosophercode and @MatteoPerona ,

Thank you so much for submitting this PR to this project! The effort you put into integrating LoRA adapters into the training pipeline and experimenting on reasoning datasets demonstrates a promising direction for extending the model’s capabilities.

However, after carefully reviewing your contribution, I’ve decided not to merge this feature directly into the main branch at this time. The main reason is that our priority is to keep the code and experiments in the main repository as clear and simple as possible, ensuring we can reliably reproduce the core results from the paper.

That said, I genuinely believe your work is highly valuable and deserves visibility. Rather than merging this into the current repository, I recommend you maintain these excellent improvements as a standalone fork or a new, independent repository.

Once again, thank you sincerely for your insightful contribution. I look forward to seeing your work grow into an exciting project of its own!

All the best,
Han Zhao

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants