NativQA Framework is a powerful open-source toolkit designed to effortlessly build large-scale, culturally and regionally aligned question-answering (QA) datasets in native languages. By leveraging user-defined seed queries and real-time search engine results, NativQA captures location-specific, everyday information to generate natural, multilingual QA data. Ideal for evaluating and fine-tuning large language models (LLMs), the framework bridges the gap in region-specific QA resources, empowering inclusive AI development across diverse linguistic and cultural landscapes.
Developing the NativQA Framework is an ongoing effort, and the framework will continue to grow and improve over time. Currently, it offers the following features:
- Supports Google, Yahoo, and Bing search engines for collecting "People also ask" questions and answers.
- Accepts seed queries in CSV or TSV format.
- Open-source and community-driven.
- Supports image collection for visual question answering tasks.
- Multilingual support for diverse language coverage.
Here is a quick overview of NativQA Framework:
- Clone the repository
git clone https://gitlab.com/nativqa/nativqa-framework.git - Navigate to the
nativqa-frameworkdirectory:cd nativqa-framework - Install the requirements:
pip install -r requirements.txt - Put your SerpAPI api key in
envs/api_key.env - Run the program!
For example, to run the program using example seed queries:
python -m nativqa --engine google --search_type text --input_file data/test_query.csv --country_code qa --location "Doha, Qatar" --env envs/api_key.env --n_iter 3which uses a sample seed query file
--engineSearch engine to use for collect QA. Currently supports only Google, Bing, and Yahoo.--search_typeType of search eithertext,image, orvideo. [Currently supports only Google and Bing for image search and Google for video/audio search.]--input_fileseed query file should be CSV/TSV--country_codeParameter defines the country to use for the Google search. The country code supported by Google.--locationParameter defines from where you want the search to originate.--multiple_countriesParameter defines one or multiple countries to limit the search to. For example,countryQA|countryBD--envAPI key file.--outputoutput directory location. Defaults./results/--n_iternumber of search iteration to perform.
- The framework will create a directory using input
filenameunder the given output directory.- dataset directory has the final QA pair file with the same name of input file.
- iteration_{n} directory contains output folder of each iteration and input queries of each iteration.
- output contains the output of each iteration consisting of
related_search.tsv, original_response.json, summary.jsonl, related_question.json
- output contains the output of each iteration consisting of
- completed_queries.txt List of completed searched queries
python scripts/template2seeds.py --template_file templates/arabic_template.csv --output_file templates/test.csv --location "قطر"
To add
Manually verified domains list file are located in domain/annotated_domains.csv.
To verify the answer source reliability: (the input file will be dataset file generated from nativqa framework.)
python scripts/check_domain_reliability.py --input_file <dataset_directory>/input_filename.csv --output_file <output_directory>/output_filename.csvNote that we only support csv/tsv file for domain reliability task. We aim to extend other file types in future.
python scripts/GPT_4o_labeling.py --input_file results/text/arabic_qa/dataset.json --env_path envs/gpt4-api-key.env --output_dir results/text/arabic_qa/GPT4o_labeling/ --location "Qatar"The NativQA Framework is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).
You are free to share and adapt the framework for non-commercial purposes, provided that:
- You give appropriate credit.
- You indicate if changes were made.
- You distribute your contributions under the same license.
Please cite our papers when referring to this framework:
@article{Alam2025nativqa,
title={NativQA Framework: A Framework for Collecting Multilingual Culturally-Aligned Natural Queries},
author={Alam, Firoj and Hasan, Md Arid and Laskar, Sahinur Rahman and Kutlu, Mucahid and Chowdhury, Shammur Absar},
journal={arXiv},
year={2024}
}