k modes: optimal kClustering not producing even clustersK-means incoherent behaviour choosing K with Elbow method, BIC, variance explained and silhouetteClustering users based on buying behaviourIs Clustering used in real world systems/products involving large amounts of data? How are the nuances taken care of?Clustering with cosine similarityClustering with multiple distance measuresHow to use cluster analysis with grouped data so one cluster may only have not more than one item from each group?clustering 2-dimensional euclidean vectors - appropriate dissimilarity measureLow silhouette coefficientK-modes implementation in pyspark

CLI: Get information Ubuntu releases

Asserting that Atheism and Theism are both faith based positions

Determine voltage drop over 10G resistors with cheap multimeter

Do I need to convey a moral for each of my blog post?

What is the reasoning behind standardization (dividing by standard deviation)?

Would mining huge amounts of resources on the Moon change its orbit?

Someone scrambled my calling sign- who am I?

What (if any) is the reason to buy in small local stores?

Pre-Employment Background Check With Consent For Future Checks

How can I create URL shortcuts/redirects for task/diff IDs in Phabricator?

Why doesn't the fusion process of the sun speed up?

What are rules for concealing thieves tools (or items in general)?

Is xar preinstalled on macOS?

Nested Dynamic SOQL Query

Why is indicated airspeed rather than ground speed used during the takeoff roll?

Why are there no stars visible in cislunar space?

Why doesn't the chatan sign the ketubah?

How to find the largest number(s) in a list of elements?

Does convergence of polynomials imply that of its coefficients?

Single word to change groups

When did hardware antialiasing start being available?

What is the tangent at a sharp point on a curve?

PTIJ: Which Dr. Seuss books should one obtain?

When should a starting writer get his own webpage?



k modes: optimal k


Clustering not producing even clustersK-means incoherent behaviour choosing K with Elbow method, BIC, variance explained and silhouetteClustering users based on buying behaviourIs Clustering used in real world systems/products involving large amounts of data? How are the nuances taken care of?Clustering with cosine similarityClustering with multiple distance measuresHow to use cluster analysis with grouped data so one cluster may only have not more than one item from each group?clustering 2-dimensional euclidean vectors - appropriate dissimilarity measureLow silhouette coefficientK-modes implementation in pyspark













0












$begingroup$


I have categorical data and I'm trying to implement k-modes using the GitHub package available here. I am trying to create clusters in my (large) dataset of say, 5-7 records, each of most similar records.



However, as of now I have no means to select the optimal 'k' which would result in maximum silhouette score, ideally. This would be ideal as k-modes works on dissimilarity/similarity measure as a distance. So I would assume that silhouette distance would then measure how close/far the clusters are based on the distance metric defined by this dissimilarity and thus, establish the silhouette score. I'm not able to find an implementation of this.



Can I perhaps use the elbow method here? But then, I'm not able to understand how to programmatically determine this, without looking at a graph as I have to do this process repeatedly a large number of times. Currently, an idea is - find k where cost drops substantially. See if the next few values introduce a very less drop in cost or not. If yes, choose this as k, if no.. then what? I'm a little confused at this point.



I was looking online and also found this, which I'm not able to interpret in terms of k modes. I'm looking for any code/suggestions to start me off on the right path.










share|improve this question









New contributor




user2816215 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$







  • 2




    $begingroup$
    Please don't cross post duplicates: stackoverflow.com/q/55188965/1060350
    $endgroup$
    – Anony-Mousse
    2 days ago















0












$begingroup$


I have categorical data and I'm trying to implement k-modes using the GitHub package available here. I am trying to create clusters in my (large) dataset of say, 5-7 records, each of most similar records.



However, as of now I have no means to select the optimal 'k' which would result in maximum silhouette score, ideally. This would be ideal as k-modes works on dissimilarity/similarity measure as a distance. So I would assume that silhouette distance would then measure how close/far the clusters are based on the distance metric defined by this dissimilarity and thus, establish the silhouette score. I'm not able to find an implementation of this.



Can I perhaps use the elbow method here? But then, I'm not able to understand how to programmatically determine this, without looking at a graph as I have to do this process repeatedly a large number of times. Currently, an idea is - find k where cost drops substantially. See if the next few values introduce a very less drop in cost or not. If yes, choose this as k, if no.. then what? I'm a little confused at this point.



I was looking online and also found this, which I'm not able to interpret in terms of k modes. I'm looking for any code/suggestions to start me off on the right path.










share|improve this question









New contributor




user2816215 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$







  • 2




    $begingroup$
    Please don't cross post duplicates: stackoverflow.com/q/55188965/1060350
    $endgroup$
    – Anony-Mousse
    2 days ago













0












0








0





$begingroup$


I have categorical data and I'm trying to implement k-modes using the GitHub package available here. I am trying to create clusters in my (large) dataset of say, 5-7 records, each of most similar records.



However, as of now I have no means to select the optimal 'k' which would result in maximum silhouette score, ideally. This would be ideal as k-modes works on dissimilarity/similarity measure as a distance. So I would assume that silhouette distance would then measure how close/far the clusters are based on the distance metric defined by this dissimilarity and thus, establish the silhouette score. I'm not able to find an implementation of this.



Can I perhaps use the elbow method here? But then, I'm not able to understand how to programmatically determine this, without looking at a graph as I have to do this process repeatedly a large number of times. Currently, an idea is - find k where cost drops substantially. See if the next few values introduce a very less drop in cost or not. If yes, choose this as k, if no.. then what? I'm a little confused at this point.



I was looking online and also found this, which I'm not able to interpret in terms of k modes. I'm looking for any code/suggestions to start me off on the right path.










share|improve this question









New contributor




user2816215 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$




I have categorical data and I'm trying to implement k-modes using the GitHub package available here. I am trying to create clusters in my (large) dataset of say, 5-7 records, each of most similar records.



However, as of now I have no means to select the optimal 'k' which would result in maximum silhouette score, ideally. This would be ideal as k-modes works on dissimilarity/similarity measure as a distance. So I would assume that silhouette distance would then measure how close/far the clusters are based on the distance metric defined by this dissimilarity and thus, establish the silhouette score. I'm not able to find an implementation of this.



Can I perhaps use the elbow method here? But then, I'm not able to understand how to programmatically determine this, without looking at a graph as I have to do this process repeatedly a large number of times. Currently, an idea is - find k where cost drops substantially. See if the next few values introduce a very less drop in cost or not. If yes, choose this as k, if no.. then what? I'm a little confused at this point.



I was looking online and also found this, which I'm not able to interpret in terms of k modes. I'm looking for any code/suggestions to start me off on the right path.







machine-learning python clustering k-means






share|improve this question









New contributor




user2816215 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question









New contributor




user2816215 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question








edited Mar 15 at 19:43







user2816215













New contributor




user2816215 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked Mar 15 at 19:17









user2816215user2816215

62




62




New contributor




user2816215 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





user2816215 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






user2816215 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







  • 2




    $begingroup$
    Please don't cross post duplicates: stackoverflow.com/q/55188965/1060350
    $endgroup$
    – Anony-Mousse
    2 days ago












  • 2




    $begingroup$
    Please don't cross post duplicates: stackoverflow.com/q/55188965/1060350
    $endgroup$
    – Anony-Mousse
    2 days ago







2




2




$begingroup$
Please don't cross post duplicates: stackoverflow.com/q/55188965/1060350
$endgroup$
– Anony-Mousse
2 days ago




$begingroup$
Please don't cross post duplicates: stackoverflow.com/q/55188965/1060350
$endgroup$
– Anony-Mousse
2 days ago










2 Answers
2






active

oldest

votes


















1












$begingroup$

Instead of trying to find a place to download some source code, why don't you just implement, e.g., Silhouette yourself?



Plenty of the code you find online in blogs and repos is broken.



I've seen so many github repositories with bad code, and people like you wondering why it doesn't work. Relying on anonymous others to not have made mistakes is a bad idea. At some point you are better off writing the code yourself!



Of course it is okay to rely on large open-source projects like sklearn, R, ELKI, Weka. These have code-reviews, discuss pull requests, and dozens of people look at the code, use it, try to find and fix bugs (but even there are errors in the code).






share|improve this answer









$endgroup$












  • $begingroup$
    The idea to ask this question was to have someone verify my logic before I start implementing the code, which is why I posted some of the thoughts I had on the ways I could start off. Or say, if someone was already working on this problem, I could have discussed this with them. Please see -- 'So I would assume that silhouette distance (in k-modes) would then measure how close/far the clusters are based on the distance metric defined by this dissimilarity and thus, establish the silhouette score.'
    $endgroup$
    – user2816215
    9 hours ago











  • $begingroup$
    Yes, Silhouette "just" needs pairwise distances. That may be too expensive to compute, but on small data this will work. It took me like 30 seconds to verify that all of R, sklearn, ELKI will allow you to specify an arbitrary distance matrix... Why did you not check yourself?
    $endgroup$
    – Anony-Mousse
    5 hours ago










  • $begingroup$
    I didn't search for it yet, tbh. I was just trying to have someone confirm if what I was saying was right before getting down to the implementation part. I didn't search for the specifics. The thing is that my dataset is large. Pairwise distances will be expensive.
    $endgroup$
    – user2816215
    4 hours ago











  • $begingroup$
    Silhouette ist defined on paiwise distances. So then don't use Silhouette. Read the definitions and documentation, please!
    $endgroup$
    – Anony-Mousse
    4 hours ago










  • $begingroup$
    Say, I was using pairwise distances. I was looking into the calculation as defined here (scikit-learn.org/stable/modules/generated/…). How would we define 'nearest' here? What kind of distance metric would be a good choice? Equality in terms of the vector? Or something like hamming/jaccard distance for each of the values of the vector? By that, I mean - say a column has ['apple', 'cloudy] and ['mango', 'cloudy'], would a dissimilarity measure say the sum of dissimilar items work? Say, 1 in this case? Or jaccard giving sum of similarity of items?
    $endgroup$
    – user2816215
    4 hours ago



















0












$begingroup$

From my understanding of silhouette score from the wikipedia page, here is an implementation:



def matching_similarity(a, b):
return np.sum(a == b, axis=1)

distinct_cluster_label_predictions = [...]
silhouette_dict = dict()

for i in m_array:
other_records_in_cluster = m_array_(with cluster_prediction == cluster_prediction of i)
other_records_outside_cluster = m_array_(with cluster_prediction != cluster_prediction of i)
sum_a = 0
sum_b = 0
sum_cluster_dist = dict()
avg_cluster_dist = dict()

for c in distinct_cluster_label_predictions:
avg_cluster_dist[c] = 0

# finding a(i) by taking avg. of intra-cluster distance
for j in other_records_in_cluster:
sum_a += matching_similarity(i, j)
a = sum_a/len(other_records_in_cluster)

dict_b = dict()
# find average of inter-cluster distance with nearest neighbour
for j in other_records_in_cluster:
dist_i_to_j = matching_similarity(i,j)
dict_b[j] = (cluster[j], dist_i_to_j)
sum_till_now = avg_cluster_dist[cluster[j]]
sum_cluster_dist[cluster[j]] = sum_till_now+dist_i_to_j

for c in distinct_cluster_label_predictions:
avg_cluster_dist[c] = sum_cluster_dist[c]/len(elements_belonging_to_c)

# nearest_neighbour is the with smallest average distance
nearest_cluster_label = key of minimum avg_cluster_dist value

# for more than one nearest neighbour? Break randomly?

neighbouring_cluster_records = list of records with cluster_prediction == nearest_cluster_label

for k in neighbouring_cluster_records:
sum_b += dict_b[j][1]

b = sum_b/len(neighbouring_cluster_records)

if (a<b):
sil = 1 - (a/b)
elif(a==b):
sil = 0
else:
sil = b/a - 1

silhouette_dict[i] = sil

average_silhouette_score = avg(all values in silhouette_dict)





share|improve this answer










New contributor




user2816215 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$












  • $begingroup$
    No, b is defined differently.
    $endgroup$
    – Anony-Mousse
    2 hours ago










Your Answer





StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);






user2816215 is a new contributor. Be nice, and check out our Code of Conduct.









draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47373%2fk-modes-optimal-k%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









1












$begingroup$

Instead of trying to find a place to download some source code, why don't you just implement, e.g., Silhouette yourself?



Plenty of the code you find online in blogs and repos is broken.



I've seen so many github repositories with bad code, and people like you wondering why it doesn't work. Relying on anonymous others to not have made mistakes is a bad idea. At some point you are better off writing the code yourself!



Of course it is okay to rely on large open-source projects like sklearn, R, ELKI, Weka. These have code-reviews, discuss pull requests, and dozens of people look at the code, use it, try to find and fix bugs (but even there are errors in the code).






share|improve this answer









$endgroup$












  • $begingroup$
    The idea to ask this question was to have someone verify my logic before I start implementing the code, which is why I posted some of the thoughts I had on the ways I could start off. Or say, if someone was already working on this problem, I could have discussed this with them. Please see -- 'So I would assume that silhouette distance (in k-modes) would then measure how close/far the clusters are based on the distance metric defined by this dissimilarity and thus, establish the silhouette score.'
    $endgroup$
    – user2816215
    9 hours ago











  • $begingroup$
    Yes, Silhouette "just" needs pairwise distances. That may be too expensive to compute, but on small data this will work. It took me like 30 seconds to verify that all of R, sklearn, ELKI will allow you to specify an arbitrary distance matrix... Why did you not check yourself?
    $endgroup$
    – Anony-Mousse
    5 hours ago










  • $begingroup$
    I didn't search for it yet, tbh. I was just trying to have someone confirm if what I was saying was right before getting down to the implementation part. I didn't search for the specifics. The thing is that my dataset is large. Pairwise distances will be expensive.
    $endgroup$
    – user2816215
    4 hours ago











  • $begingroup$
    Silhouette ist defined on paiwise distances. So then don't use Silhouette. Read the definitions and documentation, please!
    $endgroup$
    – Anony-Mousse
    4 hours ago










  • $begingroup$
    Say, I was using pairwise distances. I was looking into the calculation as defined here (scikit-learn.org/stable/modules/generated/…). How would we define 'nearest' here? What kind of distance metric would be a good choice? Equality in terms of the vector? Or something like hamming/jaccard distance for each of the values of the vector? By that, I mean - say a column has ['apple', 'cloudy] and ['mango', 'cloudy'], would a dissimilarity measure say the sum of dissimilar items work? Say, 1 in this case? Or jaccard giving sum of similarity of items?
    $endgroup$
    – user2816215
    4 hours ago
















1












$begingroup$

Instead of trying to find a place to download some source code, why don't you just implement, e.g., Silhouette yourself?



Plenty of the code you find online in blogs and repos is broken.



I've seen so many github repositories with bad code, and people like you wondering why it doesn't work. Relying on anonymous others to not have made mistakes is a bad idea. At some point you are better off writing the code yourself!



Of course it is okay to rely on large open-source projects like sklearn, R, ELKI, Weka. These have code-reviews, discuss pull requests, and dozens of people look at the code, use it, try to find and fix bugs (but even there are errors in the code).






share|improve this answer









$endgroup$












  • $begingroup$
    The idea to ask this question was to have someone verify my logic before I start implementing the code, which is why I posted some of the thoughts I had on the ways I could start off. Or say, if someone was already working on this problem, I could have discussed this with them. Please see -- 'So I would assume that silhouette distance (in k-modes) would then measure how close/far the clusters are based on the distance metric defined by this dissimilarity and thus, establish the silhouette score.'
    $endgroup$
    – user2816215
    9 hours ago











  • $begingroup$
    Yes, Silhouette "just" needs pairwise distances. That may be too expensive to compute, but on small data this will work. It took me like 30 seconds to verify that all of R, sklearn, ELKI will allow you to specify an arbitrary distance matrix... Why did you not check yourself?
    $endgroup$
    – Anony-Mousse
    5 hours ago










  • $begingroup$
    I didn't search for it yet, tbh. I was just trying to have someone confirm if what I was saying was right before getting down to the implementation part. I didn't search for the specifics. The thing is that my dataset is large. Pairwise distances will be expensive.
    $endgroup$
    – user2816215
    4 hours ago











  • $begingroup$
    Silhouette ist defined on paiwise distances. So then don't use Silhouette. Read the definitions and documentation, please!
    $endgroup$
    – Anony-Mousse
    4 hours ago










  • $begingroup$
    Say, I was using pairwise distances. I was looking into the calculation as defined here (scikit-learn.org/stable/modules/generated/…). How would we define 'nearest' here? What kind of distance metric would be a good choice? Equality in terms of the vector? Or something like hamming/jaccard distance for each of the values of the vector? By that, I mean - say a column has ['apple', 'cloudy] and ['mango', 'cloudy'], would a dissimilarity measure say the sum of dissimilar items work? Say, 1 in this case? Or jaccard giving sum of similarity of items?
    $endgroup$
    – user2816215
    4 hours ago














1












1








1





$begingroup$

Instead of trying to find a place to download some source code, why don't you just implement, e.g., Silhouette yourself?



Plenty of the code you find online in blogs and repos is broken.



I've seen so many github repositories with bad code, and people like you wondering why it doesn't work. Relying on anonymous others to not have made mistakes is a bad idea. At some point you are better off writing the code yourself!



Of course it is okay to rely on large open-source projects like sklearn, R, ELKI, Weka. These have code-reviews, discuss pull requests, and dozens of people look at the code, use it, try to find and fix bugs (but even there are errors in the code).






share|improve this answer









$endgroup$



Instead of trying to find a place to download some source code, why don't you just implement, e.g., Silhouette yourself?



Plenty of the code you find online in blogs and repos is broken.



I've seen so many github repositories with bad code, and people like you wondering why it doesn't work. Relying on anonymous others to not have made mistakes is a bad idea. At some point you are better off writing the code yourself!



Of course it is okay to rely on large open-source projects like sklearn, R, ELKI, Weka. These have code-reviews, discuss pull requests, and dozens of people look at the code, use it, try to find and fix bugs (but even there are errors in the code).







share|improve this answer












share|improve this answer



share|improve this answer










answered 2 days ago









Anony-MousseAnony-Mousse

4,975624




4,975624











  • $begingroup$
    The idea to ask this question was to have someone verify my logic before I start implementing the code, which is why I posted some of the thoughts I had on the ways I could start off. Or say, if someone was already working on this problem, I could have discussed this with them. Please see -- 'So I would assume that silhouette distance (in k-modes) would then measure how close/far the clusters are based on the distance metric defined by this dissimilarity and thus, establish the silhouette score.'
    $endgroup$
    – user2816215
    9 hours ago











  • $begingroup$
    Yes, Silhouette "just" needs pairwise distances. That may be too expensive to compute, but on small data this will work. It took me like 30 seconds to verify that all of R, sklearn, ELKI will allow you to specify an arbitrary distance matrix... Why did you not check yourself?
    $endgroup$
    – Anony-Mousse
    5 hours ago










  • $begingroup$
    I didn't search for it yet, tbh. I was just trying to have someone confirm if what I was saying was right before getting down to the implementation part. I didn't search for the specifics. The thing is that my dataset is large. Pairwise distances will be expensive.
    $endgroup$
    – user2816215
    4 hours ago











  • $begingroup$
    Silhouette ist defined on paiwise distances. So then don't use Silhouette. Read the definitions and documentation, please!
    $endgroup$
    – Anony-Mousse
    4 hours ago










  • $begingroup$
    Say, I was using pairwise distances. I was looking into the calculation as defined here (scikit-learn.org/stable/modules/generated/…). How would we define 'nearest' here? What kind of distance metric would be a good choice? Equality in terms of the vector? Or something like hamming/jaccard distance for each of the values of the vector? By that, I mean - say a column has ['apple', 'cloudy] and ['mango', 'cloudy'], would a dissimilarity measure say the sum of dissimilar items work? Say, 1 in this case? Or jaccard giving sum of similarity of items?
    $endgroup$
    – user2816215
    4 hours ago

















  • $begingroup$
    The idea to ask this question was to have someone verify my logic before I start implementing the code, which is why I posted some of the thoughts I had on the ways I could start off. Or say, if someone was already working on this problem, I could have discussed this with them. Please see -- 'So I would assume that silhouette distance (in k-modes) would then measure how close/far the clusters are based on the distance metric defined by this dissimilarity and thus, establish the silhouette score.'
    $endgroup$
    – user2816215
    9 hours ago











  • $begingroup$
    Yes, Silhouette "just" needs pairwise distances. That may be too expensive to compute, but on small data this will work. It took me like 30 seconds to verify that all of R, sklearn, ELKI will allow you to specify an arbitrary distance matrix... Why did you not check yourself?
    $endgroup$
    – Anony-Mousse
    5 hours ago










  • $begingroup$
    I didn't search for it yet, tbh. I was just trying to have someone confirm if what I was saying was right before getting down to the implementation part. I didn't search for the specifics. The thing is that my dataset is large. Pairwise distances will be expensive.
    $endgroup$
    – user2816215
    4 hours ago











  • $begingroup$
    Silhouette ist defined on paiwise distances. So then don't use Silhouette. Read the definitions and documentation, please!
    $endgroup$
    – Anony-Mousse
    4 hours ago










  • $begingroup$
    Say, I was using pairwise distances. I was looking into the calculation as defined here (scikit-learn.org/stable/modules/generated/…). How would we define 'nearest' here? What kind of distance metric would be a good choice? Equality in terms of the vector? Or something like hamming/jaccard distance for each of the values of the vector? By that, I mean - say a column has ['apple', 'cloudy] and ['mango', 'cloudy'], would a dissimilarity measure say the sum of dissimilar items work? Say, 1 in this case? Or jaccard giving sum of similarity of items?
    $endgroup$
    – user2816215
    4 hours ago
















$begingroup$
The idea to ask this question was to have someone verify my logic before I start implementing the code, which is why I posted some of the thoughts I had on the ways I could start off. Or say, if someone was already working on this problem, I could have discussed this with them. Please see -- 'So I would assume that silhouette distance (in k-modes) would then measure how close/far the clusters are based on the distance metric defined by this dissimilarity and thus, establish the silhouette score.'
$endgroup$
– user2816215
9 hours ago





$begingroup$
The idea to ask this question was to have someone verify my logic before I start implementing the code, which is why I posted some of the thoughts I had on the ways I could start off. Or say, if someone was already working on this problem, I could have discussed this with them. Please see -- 'So I would assume that silhouette distance (in k-modes) would then measure how close/far the clusters are based on the distance metric defined by this dissimilarity and thus, establish the silhouette score.'
$endgroup$
– user2816215
9 hours ago













$begingroup$
Yes, Silhouette "just" needs pairwise distances. That may be too expensive to compute, but on small data this will work. It took me like 30 seconds to verify that all of R, sklearn, ELKI will allow you to specify an arbitrary distance matrix... Why did you not check yourself?
$endgroup$
– Anony-Mousse
5 hours ago




$begingroup$
Yes, Silhouette "just" needs pairwise distances. That may be too expensive to compute, but on small data this will work. It took me like 30 seconds to verify that all of R, sklearn, ELKI will allow you to specify an arbitrary distance matrix... Why did you not check yourself?
$endgroup$
– Anony-Mousse
5 hours ago












$begingroup$
I didn't search for it yet, tbh. I was just trying to have someone confirm if what I was saying was right before getting down to the implementation part. I didn't search for the specifics. The thing is that my dataset is large. Pairwise distances will be expensive.
$endgroup$
– user2816215
4 hours ago





$begingroup$
I didn't search for it yet, tbh. I was just trying to have someone confirm if what I was saying was right before getting down to the implementation part. I didn't search for the specifics. The thing is that my dataset is large. Pairwise distances will be expensive.
$endgroup$
– user2816215
4 hours ago













$begingroup$
Silhouette ist defined on paiwise distances. So then don't use Silhouette. Read the definitions and documentation, please!
$endgroup$
– Anony-Mousse
4 hours ago




$begingroup$
Silhouette ist defined on paiwise distances. So then don't use Silhouette. Read the definitions and documentation, please!
$endgroup$
– Anony-Mousse
4 hours ago












$begingroup$
Say, I was using pairwise distances. I was looking into the calculation as defined here (scikit-learn.org/stable/modules/generated/…). How would we define 'nearest' here? What kind of distance metric would be a good choice? Equality in terms of the vector? Or something like hamming/jaccard distance for each of the values of the vector? By that, I mean - say a column has ['apple', 'cloudy] and ['mango', 'cloudy'], would a dissimilarity measure say the sum of dissimilar items work? Say, 1 in this case? Or jaccard giving sum of similarity of items?
$endgroup$
– user2816215
4 hours ago





$begingroup$
Say, I was using pairwise distances. I was looking into the calculation as defined here (scikit-learn.org/stable/modules/generated/…). How would we define 'nearest' here? What kind of distance metric would be a good choice? Equality in terms of the vector? Or something like hamming/jaccard distance for each of the values of the vector? By that, I mean - say a column has ['apple', 'cloudy] and ['mango', 'cloudy'], would a dissimilarity measure say the sum of dissimilar items work? Say, 1 in this case? Or jaccard giving sum of similarity of items?
$endgroup$
– user2816215
4 hours ago












0












$begingroup$

From my understanding of silhouette score from the wikipedia page, here is an implementation:



def matching_similarity(a, b):
return np.sum(a == b, axis=1)

distinct_cluster_label_predictions = [...]
silhouette_dict = dict()

for i in m_array:
other_records_in_cluster = m_array_(with cluster_prediction == cluster_prediction of i)
other_records_outside_cluster = m_array_(with cluster_prediction != cluster_prediction of i)
sum_a = 0
sum_b = 0
sum_cluster_dist = dict()
avg_cluster_dist = dict()

for c in distinct_cluster_label_predictions:
avg_cluster_dist[c] = 0

# finding a(i) by taking avg. of intra-cluster distance
for j in other_records_in_cluster:
sum_a += matching_similarity(i, j)
a = sum_a/len(other_records_in_cluster)

dict_b = dict()
# find average of inter-cluster distance with nearest neighbour
for j in other_records_in_cluster:
dist_i_to_j = matching_similarity(i,j)
dict_b[j] = (cluster[j], dist_i_to_j)
sum_till_now = avg_cluster_dist[cluster[j]]
sum_cluster_dist[cluster[j]] = sum_till_now+dist_i_to_j

for c in distinct_cluster_label_predictions:
avg_cluster_dist[c] = sum_cluster_dist[c]/len(elements_belonging_to_c)

# nearest_neighbour is the with smallest average distance
nearest_cluster_label = key of minimum avg_cluster_dist value

# for more than one nearest neighbour? Break randomly?

neighbouring_cluster_records = list of records with cluster_prediction == nearest_cluster_label

for k in neighbouring_cluster_records:
sum_b += dict_b[j][1]

b = sum_b/len(neighbouring_cluster_records)

if (a<b):
sil = 1 - (a/b)
elif(a==b):
sil = 0
else:
sil = b/a - 1

silhouette_dict[i] = sil

average_silhouette_score = avg(all values in silhouette_dict)





share|improve this answer










New contributor




user2816215 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$












  • $begingroup$
    No, b is defined differently.
    $endgroup$
    – Anony-Mousse
    2 hours ago















0












$begingroup$

From my understanding of silhouette score from the wikipedia page, here is an implementation:



def matching_similarity(a, b):
return np.sum(a == b, axis=1)

distinct_cluster_label_predictions = [...]
silhouette_dict = dict()

for i in m_array:
other_records_in_cluster = m_array_(with cluster_prediction == cluster_prediction of i)
other_records_outside_cluster = m_array_(with cluster_prediction != cluster_prediction of i)
sum_a = 0
sum_b = 0
sum_cluster_dist = dict()
avg_cluster_dist = dict()

for c in distinct_cluster_label_predictions:
avg_cluster_dist[c] = 0

# finding a(i) by taking avg. of intra-cluster distance
for j in other_records_in_cluster:
sum_a += matching_similarity(i, j)
a = sum_a/len(other_records_in_cluster)

dict_b = dict()
# find average of inter-cluster distance with nearest neighbour
for j in other_records_in_cluster:
dist_i_to_j = matching_similarity(i,j)
dict_b[j] = (cluster[j], dist_i_to_j)
sum_till_now = avg_cluster_dist[cluster[j]]
sum_cluster_dist[cluster[j]] = sum_till_now+dist_i_to_j

for c in distinct_cluster_label_predictions:
avg_cluster_dist[c] = sum_cluster_dist[c]/len(elements_belonging_to_c)

# nearest_neighbour is the with smallest average distance
nearest_cluster_label = key of minimum avg_cluster_dist value

# for more than one nearest neighbour? Break randomly?

neighbouring_cluster_records = list of records with cluster_prediction == nearest_cluster_label

for k in neighbouring_cluster_records:
sum_b += dict_b[j][1]

b = sum_b/len(neighbouring_cluster_records)

if (a<b):
sil = 1 - (a/b)
elif(a==b):
sil = 0
else:
sil = b/a - 1

silhouette_dict[i] = sil

average_silhouette_score = avg(all values in silhouette_dict)





share|improve this answer










New contributor




user2816215 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$












  • $begingroup$
    No, b is defined differently.
    $endgroup$
    – Anony-Mousse
    2 hours ago













0












0








0





$begingroup$

From my understanding of silhouette score from the wikipedia page, here is an implementation:



def matching_similarity(a, b):
return np.sum(a == b, axis=1)

distinct_cluster_label_predictions = [...]
silhouette_dict = dict()

for i in m_array:
other_records_in_cluster = m_array_(with cluster_prediction == cluster_prediction of i)
other_records_outside_cluster = m_array_(with cluster_prediction != cluster_prediction of i)
sum_a = 0
sum_b = 0
sum_cluster_dist = dict()
avg_cluster_dist = dict()

for c in distinct_cluster_label_predictions:
avg_cluster_dist[c] = 0

# finding a(i) by taking avg. of intra-cluster distance
for j in other_records_in_cluster:
sum_a += matching_similarity(i, j)
a = sum_a/len(other_records_in_cluster)

dict_b = dict()
# find average of inter-cluster distance with nearest neighbour
for j in other_records_in_cluster:
dist_i_to_j = matching_similarity(i,j)
dict_b[j] = (cluster[j], dist_i_to_j)
sum_till_now = avg_cluster_dist[cluster[j]]
sum_cluster_dist[cluster[j]] = sum_till_now+dist_i_to_j

for c in distinct_cluster_label_predictions:
avg_cluster_dist[c] = sum_cluster_dist[c]/len(elements_belonging_to_c)

# nearest_neighbour is the with smallest average distance
nearest_cluster_label = key of minimum avg_cluster_dist value

# for more than one nearest neighbour? Break randomly?

neighbouring_cluster_records = list of records with cluster_prediction == nearest_cluster_label

for k in neighbouring_cluster_records:
sum_b += dict_b[j][1]

b = sum_b/len(neighbouring_cluster_records)

if (a<b):
sil = 1 - (a/b)
elif(a==b):
sil = 0
else:
sil = b/a - 1

silhouette_dict[i] = sil

average_silhouette_score = avg(all values in silhouette_dict)





share|improve this answer










New contributor




user2816215 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$



From my understanding of silhouette score from the wikipedia page, here is an implementation:



def matching_similarity(a, b):
return np.sum(a == b, axis=1)

distinct_cluster_label_predictions = [...]
silhouette_dict = dict()

for i in m_array:
other_records_in_cluster = m_array_(with cluster_prediction == cluster_prediction of i)
other_records_outside_cluster = m_array_(with cluster_prediction != cluster_prediction of i)
sum_a = 0
sum_b = 0
sum_cluster_dist = dict()
avg_cluster_dist = dict()

for c in distinct_cluster_label_predictions:
avg_cluster_dist[c] = 0

# finding a(i) by taking avg. of intra-cluster distance
for j in other_records_in_cluster:
sum_a += matching_similarity(i, j)
a = sum_a/len(other_records_in_cluster)

dict_b = dict()
# find average of inter-cluster distance with nearest neighbour
for j in other_records_in_cluster:
dist_i_to_j = matching_similarity(i,j)
dict_b[j] = (cluster[j], dist_i_to_j)
sum_till_now = avg_cluster_dist[cluster[j]]
sum_cluster_dist[cluster[j]] = sum_till_now+dist_i_to_j

for c in distinct_cluster_label_predictions:
avg_cluster_dist[c] = sum_cluster_dist[c]/len(elements_belonging_to_c)

# nearest_neighbour is the with smallest average distance
nearest_cluster_label = key of minimum avg_cluster_dist value

# for more than one nearest neighbour? Break randomly?

neighbouring_cluster_records = list of records with cluster_prediction == nearest_cluster_label

for k in neighbouring_cluster_records:
sum_b += dict_b[j][1]

b = sum_b/len(neighbouring_cluster_records)

if (a<b):
sil = 1 - (a/b)
elif(a==b):
sil = 0
else:
sil = b/a - 1

silhouette_dict[i] = sil

average_silhouette_score = avg(all values in silhouette_dict)






share|improve this answer










New contributor




user2816215 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this answer



share|improve this answer








edited 2 hours ago





















New contributor




user2816215 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









answered 3 hours ago









user2816215user2816215

62




62




New contributor




user2816215 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





user2816215 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






user2816215 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











  • $begingroup$
    No, b is defined differently.
    $endgroup$
    – Anony-Mousse
    2 hours ago
















  • $begingroup$
    No, b is defined differently.
    $endgroup$
    – Anony-Mousse
    2 hours ago















$begingroup$
No, b is defined differently.
$endgroup$
– Anony-Mousse
2 hours ago




$begingroup$
No, b is defined differently.
$endgroup$
– Anony-Mousse
2 hours ago










user2816215 is a new contributor. Be nice, and check out our Code of Conduct.









draft saved

draft discarded


















user2816215 is a new contributor. Be nice, and check out our Code of Conduct.












user2816215 is a new contributor. Be nice, and check out our Code of Conduct.











user2816215 is a new contributor. Be nice, and check out our Code of Conduct.














Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47373%2fk-modes-optimal-k%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Ружовы пелікан Змест Знешні выгляд | Пашырэнне | Асаблівасці біялогіі | Літаратура | НавігацыяДагледжаная версіяправерана1 зменаДагледжаная версіяправерана1 змена/ 22697590 Сістэматыкана ВіківідахВыявына Вікісховішчы174693363011049382

ValueError: Error when checking input: expected conv2d_13_input to have shape (3, 150, 150) but got array with shape (150, 150, 3)2019 Community Moderator ElectionError when checking : expected dense_1_input to have shape (None, 5) but got array with shape (200, 1)Error 'Expected 2D array, got 1D array instead:'ValueError: Error when checking input: expected lstm_41_input to have 3 dimensions, but got array with shape (40000,100)ValueError: Error when checking target: expected dense_1 to have shape (7,) but got array with shape (1,)ValueError: Error when checking target: expected dense_2 to have shape (1,) but got array with shape (0,)Keras exception: ValueError: Error when checking input: expected conv2d_1_input to have shape (150, 150, 3) but got array with shape (256, 256, 3)Steps taking too long to completewhen checking input: expected dense_1_input to have shape (13328,) but got array with shape (317,)ValueError: Error when checking target: expected dense_3 to have shape (None, 1) but got array with shape (7715, 40000)Keras exception: Error when checking input: expected dense_input to have shape (2,) but got array with shape (1,)

Illegal assignment from SObject to ContactFetching String, Id from Map - Illegal Assignment Id to Field / ObjectError: Compile Error: Illegal assignment from String to BooleanError: List has no rows for assignment to SObjectError on Test Class - System.QueryException: List has no rows for assignment to SObjectRemote action problemDML requires SObject or SObject list type error“Illegal assignment from List to List”Test Class Fail: Batch Class: System.QueryException: List has no rows for assignment to SObjectMapping to a user'List has no rows for assignment to SObject' Mystery