GitHub

PRODUCT MATCHING Task Report on Applied Approaches Helina Asadi

Problem Analysis Overview Develop a Python algorithm to identify and group similar product names. Data 2 .xlsx Excel files each including a total of 7 product names listed in a column Challenges Variations in product naming, word order, and language differences.

Data Preprocessing

LOAD DATA Using pandas pd, read_data function, to load data as a list of Products.

Data Cleaning:

Stripping whitespace Converting to lowercase Digit mapping all Farsi digits to English Removing uninformative and general words that don’t specify the product by any means (based on the stopwords list)

Methodology and Possible Approaches Cosine Similarity & TF-IDF Vectorization Fuzzy Both methods are discussed in more detail further in the report. Cosine Similarity with TF-IDF Vectorization CosSimilarity.py

Why can Cosine Similarity be a good approach for this problem? Cosine similarity works well for short text comparisons, such as product names. It also handles variations in word order. Method The fuzzy approach applies token-set-ratio which tokenizes each product name into sets of words and compares them based on overlap. This technique is robust to word order and extra words, making it ideal for matching variations of product names. An adjustable threshold controls sensitivity of the comparison (threshold 70 resulted in the most cohesive groups). Initially, I tested pairwise matching which finds a match for each product using the threshold and reports pairs of matching products. However, since multiple similar product names appeared in each dataset, I changed the algorithm to perform cluster-based grouping, which groups all similar products into clusters instead of just pairs. Results of both methods get saved as Excel files.

Fuzzy fuzzy.py

Why can fuzzy work? Fuzzy matching allows for uncertainty in text and measures similarity even when there are variations in wording. Method The algorithm applies token-based matching using rapidfuzz. An adjustable threshold controls sensitivity of the comparison (threshold 70 resulted in the most cohesive groups). Initially, I tested pairwise matching which finds a match for each product using the threshold and reports pairs of matching products. However, since multiple similar product names appeared in each dataset, I changed the algorithm to perform cluster-based grouping, which groups all similar products into clusters instead of just pairs. Results of both methods get saved as Excel files.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.idea		.idea
dataFiles		dataFiles
CosSimilarity.py		CosSimilarity.py
README.md		README.md
fuzzy.py		fuzzy.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

HelinaAsadi/ProductMatcherTask

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages