Skip to content

OHR Project Classifier #4

@sarah114tran

Description

@sarah114tran

Proposed process for classifying OHR Projects:

Pass 1: Given the project name and the file extensions, classify as "hardware" or "not hardware". For projects that have a only a README and/or a .ohwr.yaml. file, classify as "ambiguous"

Pass 2: For projects labeled as "ambiguous", pull down the project description, README, and/or the Wiki page and search for hardware keywords. The more keywords, the higher the score. If a project has a high enough of a score, then it is classified as hardware and visa versa. If a project receives a "moderate score", then it labeled as ambiguous and will be manually reviewed.

ROUND 1: In this round, I only examined only the file extensions and the folder names, not the project name. The code block below details all of the extensions and keywords that I found to be unique to hardware projects.

hardware_extensions = {'.pcb', '.sch', '.brd', '.gbr', '.drl', '.kicad_pcb', '.lib', 'SchDoc', '.PcbDoc', '.PcbLib',
                           '.PrjPCB', '.ipt', '.step', '.stl', '.dwg'}

hardware_folders = {'hardware', 'pcb', 'schematic', 'eagle', 'kicad', 'gerber'}

ROUND 2: I still only examined the file extensions and the folder names, but extended the list. This method performed poorer than the previous method in round 1.

hardware_extensions = {
        '.pcb', '.sch', '.brd', '.gbr', '.drl', '.kicad_pcb', '.lib', 
        '.SchDoc', '.PcbDoc', '.PcbLib', '.PrjPCB', '.ipt', '.step', 
        '.stl', '.dwg', 'vhd', '.v', 'ucf'
    }

    hardware_folders = {'hardware', 'pcb', 'eagle', 'kicad', 'gerber',
                        'hw', 'layout', 'schematics', 'schematic', 'board',
                        'rtl', 'pcb_design'}

ROUND 3: I examined the file extension, file names, and the project name. Project names that contained 'gatware', 'software', 'firmware', 'gw', 'sw', or 'fw' were classified as non-hardware.

 words = re.split(r'[\s_-]+', project_name.lower())
    project_name_exclusion = {'gatware', 'software', 'firmware', 'gw', 'sw', 'fw'}
    
    if any(exclusion in words for exclusion in project_name_exclusion):
        classification = 'not_hardware'
        return {
            'project_id': project_id,
            'file_extensions': list(project_data['file_extensions'].keys()),
            'file_names': [file['name'] for file in project_data['files']],
            'classification': classification
        }
    
    # Hardware file indicators (fixed missing dots)
    hardware_extensions = {
        '.pcb', '.sch', '.brd', '.gbr', '.drl', '.kicad_pcb', '.lib', 
        '.SchDoc', '.PcbDoc', '.PcbLib', '.PrjPCB', '.ipt', '.step', 
        '.stl', '.dwg', '.vhd', '.v', '.ucf'  # Fixed: added dots to vhd, v, ucf
    }

    # Hardware folders
    hardware_folders = {'hardware', 'pcb', 'eagle', 'kicad', 'gerber',
                        'hw', 'layout', 'schematics', 'schematic', 'board',
                        'rtl', 'pcb_design', 'cad'}

ROUND 3.5: In this second pass, I pulled down the project description, Readme, and Wiki pages for all of the projects labeled as "ambiguous". This information is scanned for hardware keywords. I count all of the occurrences of the keywords, and assign the project a weighted score (i.e, pcb is rated higher than board).

def evaluate_project_information(combined_text: str):
    hardware_keywords = {
        'strong': ['schematics', 'schematic', 'pcb', 'circuit', 'breakout board', 'fpga mezzanine card',
                   'hardware design', 'sch', 'sch diagram', 'bom'],
        'medium': ['hardware', 'microcontroller', 'i/o', 'layout'],
        'weak': ['prototype', 'board', 'chip', 'design', 'device']
    }

    hw_score = 0

    for strength, keywords in hardware_keywords.items():
        weight = {'strong': 3, 'medium': 2, 'weak': 1}[strength]
        hw_score += sum(combined_text.count(keyword) * weight for keyword in keywords)

    if hw_score >= 20:
        classification = 'hardware'
    elif hw_score >= 15:
         classification = 'still ambiguous'
    else:
         classification = 'not hardware'

    return {
        'hw_score': hw_score,
        'classification': classification,
    }

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions