Skip to content

Fix loading local datasets#108

Merged
tastelikefeet merged 2 commits intomainfrom
fix/local_files
Mar 6, 2026
Merged

Fix loading local datasets#108
tastelikefeet merged 2 commits intomainfrom
fix/local_files

Conversation

@tastelikefeet
Copy link
Collaborator

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

Write the detail information belongs to this PR.

Experiment results

Paste your experiment result here(if needed).

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves an issue where local datasets were not being loaded correctly. The changes refine the dataset loading mechanism to properly differentiate between individual local files and local directories, ensuring that the datasets library receives the appropriate parameters for each scenario. This enhancement improves the robustness of local data handling and is supported by updated documentation for user guidance and new tests for verification.

Highlights

  • Local Dataset Loading Fix: The core dataset loading logic has been updated to correctly distinguish between local file paths and local directory paths, ensuring proper data ingestion using the datasets library.
  • Documentation Updates: User documentation in both English and Chinese has been enhanced with clear instructions on how to load local datasets, specifying guidelines for single files and directories.
  • New Test Cases: New tests were added to validate the functionality of loading local Lance files and directories, improving coverage for local dataset handling.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • docs/source_en/Components/Dataset/Dataset.md
    • Added instructions for loading local dataset files and directories.
    • Included a note about using the datasets library and its supported extensions.
  • docs/source_zh/组件/数据集/Dataset.md
    • Added Chinese instructions for loading local dataset files and directories.
    • Included a note about using the datasets library and its supported extensions.
  • src/twinkle/dataset/base.py
    • Modified the _load_dataset function to check if dataset_id is a directory or a file.
    • Updated load_dataset calls to use data_dir for directories and data_files for single files.
    • Extracted file extension from the first file in a directory when loading from a folder.
  • tests/dataset/test_loading.py
    • Added test_load_local_lance to verify loading of a single local Lance file.
    • Added test_load_local_lance_dir to verify loading of local Lance files from a directory.
Activity
  • The pull request was opened to address a bug fix.
  • The author provided a PR type and information section in the description.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@tastelikefeet tastelikefeet merged commit 1a5e5d8 into main Mar 6, 2026
2 of 4 checks passed
@tastelikefeet tastelikefeet mentioned this pull request Mar 6, 2026
1 task
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for loading local datasets from a directory, which is a great enhancement. However, it introduces a critical Path Traversal vulnerability due to improper validation of the dataset_id path, potentially allowing access to sensitive system files. Additionally, a bug in handling empty directories can cause a TypeError and potential denial of service. Other issues include a high-severity problem with kwargs handling that could lead to arguments being dropped, and minor suggestions for improving documentation clarity and test code readability.

@tastelikefeet tastelikefeet deleted the fix/local_files branch March 6, 2026 07:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants