-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Proposed process for classifying OHR Projects:
Pass 1: Given the project name and the file extensions, classify as "hardware" or "not hardware". For projects that have a only a README and/or a .ohwr.yaml. file, classify as "ambiguous"
Pass 2: For projects labeled as "ambiguous", pull down the project description, README, and/or the Wiki page and search for hardware keywords. The more keywords, the higher the score. If a project has a high enough of a score, then it is classified as hardware and visa versa. If a project receives a "moderate score", then it labeled as ambiguous and will be manually reviewed.
ROUND 1: In this round, I only examined only the file extensions and the folder names, not the project name. The code block below details all of the extensions and keywords that I found to be unique to hardware projects.
hardware_extensions = {'.pcb', '.sch', '.brd', '.gbr', '.drl', '.kicad_pcb', '.lib', 'SchDoc', '.PcbDoc', '.PcbLib',
'.PrjPCB', '.ipt', '.step', '.stl', '.dwg'}
hardware_folders = {'hardware', 'pcb', 'schematic', 'eagle', 'kicad', 'gerber'}
ROUND 2: I still only examined the file extensions and the folder names, but extended the list. This method performed poorer than the previous method in round 1.
hardware_extensions = {
'.pcb', '.sch', '.brd', '.gbr', '.drl', '.kicad_pcb', '.lib',
'.SchDoc', '.PcbDoc', '.PcbLib', '.PrjPCB', '.ipt', '.step',
'.stl', '.dwg', 'vhd', '.v', 'ucf'
}
hardware_folders = {'hardware', 'pcb', 'eagle', 'kicad', 'gerber',
'hw', 'layout', 'schematics', 'schematic', 'board',
'rtl', 'pcb_design'}
ROUND 3: I examined the file extension, file names, and the project name. Project names that contained 'gatware', 'software', 'firmware', 'gw', 'sw', or 'fw' were classified as non-hardware.
words = re.split(r'[\s_-]+', project_name.lower())
project_name_exclusion = {'gatware', 'software', 'firmware', 'gw', 'sw', 'fw'}
if any(exclusion in words for exclusion in project_name_exclusion):
classification = 'not_hardware'
return {
'project_id': project_id,
'file_extensions': list(project_data['file_extensions'].keys()),
'file_names': [file['name'] for file in project_data['files']],
'classification': classification
}
# Hardware file indicators (fixed missing dots)
hardware_extensions = {
'.pcb', '.sch', '.brd', '.gbr', '.drl', '.kicad_pcb', '.lib',
'.SchDoc', '.PcbDoc', '.PcbLib', '.PrjPCB', '.ipt', '.step',
'.stl', '.dwg', '.vhd', '.v', '.ucf' # Fixed: added dots to vhd, v, ucf
}
# Hardware folders
hardware_folders = {'hardware', 'pcb', 'eagle', 'kicad', 'gerber',
'hw', 'layout', 'schematics', 'schematic', 'board',
'rtl', 'pcb_design', 'cad'}
ROUND 3.5: In this second pass, I pulled down the project description, Readme, and Wiki pages for all of the projects labeled as "ambiguous". This information is scanned for hardware keywords. I count all of the occurrences of the keywords, and assign the project a weighted score (i.e, pcb is rated higher than board).
def evaluate_project_information(combined_text: str):
hardware_keywords = {
'strong': ['schematics', 'schematic', 'pcb', 'circuit', 'breakout board', 'fpga mezzanine card',
'hardware design', 'sch', 'sch diagram', 'bom'],
'medium': ['hardware', 'microcontroller', 'i/o', 'layout'],
'weak': ['prototype', 'board', 'chip', 'design', 'device']
}
hw_score = 0
for strength, keywords in hardware_keywords.items():
weight = {'strong': 3, 'medium': 2, 'weak': 1}[strength]
hw_score += sum(combined_text.count(keyword) * weight for keyword in keywords)
if hw_score >= 20:
classification = 'hardware'
elif hw_score >= 15:
classification = 'still ambiguous'
else:
classification = 'not hardware'
return {
'hw_score': hw_score,
'classification': classification,
}