Skip to content

[Metadata Input]: <Documents parquet contains no metadata> #2416

Description

@00kfulton00

Do you need to file an issue?

  • I have searched the existing issues and this bug is not already filed.
  • My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
  • I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the issue

My 'documents.parquet' does not include a 'metadata' column.

Using the documentation it mentions specifying "metadata" for the input section, this can be used for JSON and CSV.
https://microsoft.github.io/graphrag/index/inputs/#metadata

The graphrag-input config does not include the 'metadata' field
https://github.com/microsoft/graphrag/blob/v3.1.0/packages/graphrag-input/graphrag_input/input_config.py

However passing in the list of metadata to the chunking config works "ok" - the metadata fields are pre-set in the text field.

Logs and screenshots

No metadata in the docs parquet
Image

Metadata listed in the text column for the chunk
Image

Steps to reproduce

Using the Config Classes specify metadata on the input config.
Run the graphrag pipeline - review the documents parquet for metadata column

GraphRAG Config Used

    exchangeindexCalendarConfig = GraphRagConfig(
            completion_models={"default_completion_model": completion_model_config},
            embedding_models={"default_embedding_model": embedding_model_config},
            input_storage=StorageConfig(base_dir=f"{str(temp_input_path)}"),
            input=InputConfig(type="json", title_column="subject", text_column="displayTo", metadata={"parentFolderId": "str", "displayTo": "str"}),
            chunking=ChunkingConfig(type="tokens", size=100, overlap=50, prepend_metadata=config.exchangeIndexCalendarMetadata),
            cache=cache_config,
            vector_store=vector_store_config,
            output_storage=output_storage_config,
        exchangeindexCalendarConfig = GraphRagConfig(
            completion_models={"default_completion_model": completion_model_config},
            embedding_models={"default_embedding_model": embedding_model_config},
            input_storage=StorageConfig(base_dir=f"{str(temp_input_path)}"),
            input=InputConfig(type="json", title_column="subject", text_column="displayTo", metadata=config.exchangeIndexCalendarMetadata),
            chunking=ChunkingConfig(type="tokens", size=100, overlap=50, prepend_metadata=config.exchangeIndexCalendarMetadata),
            cache=cache_config,
            vector_store=vector_store_config,
            output_storage=output_storage_config,
            update_output_storage=StorageConfig(base_dir="/home/kevaughn/privateBranch/graphragData")
        )     
### Additional Information

graphrag                       3.1.0
graphrag-cache                 3.1.0
graphrag-chunking              3.1.0
graphrag-common                3.1.0
graphrag-input                 3.1.0
graphrag-llm                   3.1.0
graphrag-storage               3.1.0
graphrag-vectors               3.1.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    triageDefault label assignment, indicates new issue needs reviewed by a maintainer

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions