cbow

Continuous Bag of Words implementation and principal component analysis

This project is meant to explore embeddings visual representation and the validity of such representation. Principal Component Analysis techniques allows axis reduction from a large number (let's say over 1500) to 2 or 3. We can then plot those points and recognize patterns such as clustering. It is usefull to SEE various text fragments next to eachother on the same topic or realise that some strayed away and you can investigate why is that happening.

But is this reduction compatible with the original data? The way this is checked below is by comparing the point distances from both sets and validate that the relative distances between points in the source environment closely match the distances in the PCA environment, i.e. dist(A,B,source) < dist(A,C,source) then dist(A,B,pca) < dist(A,C,pca) + slack where slack is a small value to accomodate for scaling.

The embeddings can be used from the CBoW implementation by running python run.py -t. This will create a new CBoW model based on a trainig text (default shakespeare.txt) and run the included tests. Once the model is created you can run python run.py -l -t to load the generated model from the previous run and re-run the tests with the associated embeddings.

If you are not convinced you can bring on your own embeddings and run python run.py -l -e embs_sample.txt to test the solution for way more than 50 dimensions that CBoW currently has.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cbow.py		cbow.py
embs_sample.txt		embs_sample.txt
run.py		run.py
sentences.txt		sentences.txt
shakespeare.txt		shakespeare.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cbow

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

cbow

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages