Skip to content

Missing files for tfidf_vectorizer #6

@ShineHop

Description

@ShineHop

Hey,

I found that the location of the tfidf_vectorizer file is not specified in the kbguided_pretrain when creating raw_ptdata, and it seems that this file is not provided on GitHub. I'm curious about what this file is.

kbguided_pretrain/datagen/generate_raw_ptdata.py

tfidf_vectorizer = ''       
vectorizer = joblib.load(tfidf_vectorizer)

def generate_pair(y, mentions, select_scheme):
    if select_scheme == 'random':
        return random.choice(mentions)
    elif select_scheme == 'sample':
        similarity_estimate = cal_similarity_tfidf(mentions, y, vectorizer)
        print(similarity_estimate.shape) ##
        return np.random.choice(mentions, 1, p = similarity_estimate/np.sum(similarity_estimate))[0]
    elif select_scheme == 'most_sim':
        similarity_estimate = cal_similarity_tfidf(mentions, y, vectorizer)
        return mentions[similarity_estimate.argmax()]
    elif select_scheme == 'least_sim':
        similarity_estimate = cal_similarity_tfidf(mentions, y, vectorizer)
        return mentions[similarity_estimate.argmin()]
    else:
        print('Wrong mention selection scheme input!!!')

same is missing in the data_utils>ncbi>prepare_dataset.py

Looking forward to your reply,
Best,

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions