-
Notifications
You must be signed in to change notification settings - Fork 8
New Algorithm: DBSCAN clustering #16
Copy link
Copy link
Open
Labels
datascienceProbably involves knowledge of scikit-learn, LLMs, etc.Probably involves knowledge of scikit-learn, LLMs, etc.htmxProbably involves mostly changes to HTML or HTMXProbably involves mostly changes to HTML or HTMXpythonProbably involves a lot of changes to Python codeProbably involves a lot of changes to Python code
Metadata
Metadata
Assignees
Labels
datascienceProbably involves knowledge of scikit-learn, LLMs, etc.Probably involves knowledge of scikit-learn, LLMs, etc.htmxProbably involves mostly changes to HTML or HTMXProbably involves mostly changes to HTML or HTMXpythonProbably involves a lot of changes to Python codeProbably involves a lot of changes to Python code
I like the clustering approach, but I don't like that k-means makes you say up front how many clusters there's going to be (i'm discovering too, it's a new day, i don't know yet, right??). I want to experiment with other clustering algorithms that make different assumptions and trade-offs about the data.
DBSCAN seems interesting because it finds clusters based on density. So you have to say what the expected density should be, that threshold that defines a cluster.
I expect that there will be a lot of tweaking to make it work for a certain embedding model, but after you get it to work it'll be a lot more dynamic and robust.
Note: DBSCAN doesn't assign all posts to a cluster, so you might not be able to use the
toot_clusters.htmlon it's own. You'll probably need an offshoot of it. Feel free to skip this part on the first pass of the PR, we might even be able to get someone else to do this part.