Inside the data folder, are stored every downloaded dataset specified in the pipeline.yaml. Below is shown the structure of the data folder:
.
└── data
├── dataset-A -----------> # type-1
│ ├── test
│ │ ├── images
│ │ └── labels
│ └── train
│ ├── images
│ └── labels
├── dataset-B -----------> # type-2
│ ├── sub-dataset-1 ---> # variant-1 of dataset B
│ │ ├── test
│ │ │ ├── images
│ │ │ └── labels
│ │ └── train
│ │ ├── images
│ │ └── labels
│ └── sub-dataset-2 ---> # variant-2 of dataset B
│ ├── test
│ │ ├── images
│ │ └── labels
│ └── train
│ ├── images
│ └── labels
└── ....To understand how to add new datasets, it is necessary to distinguish between two categories of datasets.
type-1: this is the default type of a dataset.type-2: this is just a collection oftype-1datasetssub-dataset: in the example above thedataset-Bhas more variants, each variant is asub-dataset
So in the example above the structure of type-1 and sub-dataset are the same and explained below:
.
├── test # suppose from 100 to 150 are for the testing
│ ├── images
│ │ ├── img100.jpg
│ │ ├── img101.jpg
│ │ └── ...
│ │ └── img150.jpg
│ └── labels
│ ├── img100.txt
│ ├── img101.txt
│ ├── ...
│ └── img150.txt
└── train # suppose from 1 to 99 are for the training
├── images
│ ├── img1.jpg
│ ├── img2.jpg
│ └── ...
└── labels
├── img1.txt
├── img2.txt
└── ...- the
trainandtestfolder have the same structure, they contain 2 folders:imagesandlabels - inside the
labelsfolder there is a.txtfile for each image present in theimagesfolder. The name of each.txtfile should be the same as the corresponding image as shown in the example above. Make sure that the label's name only has the image's name without the extension.
Each line of the .txt file contains the bounding box associated with each word, separated by \t. When generating your data you could use this command:
# example on how to create .txt files
# word is a string and bbox is a list
file.write(f"{word}\t{bbox}\n")The .txt file should look something like this:
This [x1,y1,x2,y2]
is [x3,y3,x4,y4]
an [x5,y5,x6,y6]
example [x7,y7,x8,y8]
...Where x1,y1,x2,y2 are absolute points:
x1,y1: top-left coordinatesx2,y2: bottom-right coordinates
For every dataset that is created there is always a .py file inside ./src/dataset/. The dataset can be local or online:
local: the dataset you want to use for generating the training data is already in your pc, you just need to convert it in the specified formatonline: the annotations and images or only the annotations are available online, the_download()automatically downloads images and labels and stores them in the./datafolder in the specified format
This is an example of a .py file for a dataset:
from .dataset import Dataset
CONFIG = {...}
class DATASET(Dataset):
def __init__(
self,
config: dict
) -> None:
super().__init__(config)
def _download(self) -> None:
# download codeThe CONFIG specify the download links for images and labels, defines also the structure inside the ./data/ folder:
# local dataset case
CONFIG = {}
# type-1 dataset case
# if it's only one it's a type-1 dataset
CONFIG = {
"dataset-name": "..."
}
# type-2 dataset case
# if it's more than one it's a type-2 dataset
CONFIG = {
"sub-1": "...",
"sub-2": "...",
}DATASETclass: takes only theCONFIGas input
More information can be found here