The effect of duplicating certain vertex #34

zhitian-wu · 2025-08-26T22:28:41Z

zhitian-wu
Aug 26, 2025

Dear all,

I opened a thread last November to ask about whether self-loops should be added to an input graph. Recently, I have revisited my previous experience with protein clustering, and several new questions came into my mind. Here is one of them:

By reading Chapter 6, basic MCL theory, I learnt about direct product (Kronecker product).
As a special case, given a graph of n vertices and its adjacent matrix M1, if we duplicated each of these n vertices m times, with each additional duplicated vertex renamed, and with their neighbors remaining the same, then the corresponding adjacent matrix is M1 ⊗ Jm. According to Lemma 5, the way n vertices are clustered will remain the same, regardless of duplication.

However, what if only certain vertices, instead of all of them, are duplicated? In case of protein clustering, there may be a gene duplication event resulting in two similar proteins. As a more concrete example, in the case of "line graph" (a path of 7 vertices), if we duplicate vertex 3 once

0--1; 1--2; 2--3; 3--4; 4--5; 5--6; becomes
0--1; 1--2; 2--3A; 2--3B; 3A--3B; 3A--4; 3B--4; 4--5; 5--6;

(I am using a '--' to represent two arcs between vertices, and all self loops are present, all arcs and loops may be assumed to have equal weight)
Informally, I guess that duplicated vertices can trap their unduplicated neighbors by extra cycles. But is there a more mathematical way of analyzing similar cases?

Thanks in advance for taking your valuable time considering this question!

micans · 2025-08-27T13:13:47Z

micans
Aug 27, 2025
Maintainer

As far as I know the Kronecker result is useless here - it is a pretty abstract result that has little bearing on real-life graphs. In interesting examples such as yours (reflecting e.g. a gene duplication event) it does not even remotely apply.

Informally, I guess that duplicated vertices can trap their unduplicated neighbors by extra cycles.

This is a pretty good description, I would agree.

Is there a more mathematical way of analyzing similar cases?

Not that I know of, really. There is a quite interesting mathematical structure associated with MCL iterands, but the questions around that structure are really hard.

I prefer to approach this with a more abstracted line of reasoning. But first of all, what is your goal/motivation?

Is MCL doing the wrong thing in some cases? Splitting up things that should not be split up, including nodes in a cluster that should not be included? The main tool that we have is manipulating the input graph to represent our expectations; e.g. best-reciprocal hit filtering and the precise definition of the edge weights.

Still, I expect that things like the above will always happen. In protein parology/orthology, I've seen cases in the past where locally the graph would look quite bipartite, and MCL would occasionally group together nodes that are not connected in the input graph. I would say such cases are at a level of detail that MCL cannot resolve.
Another potential issue is where expected clusters have large variation in size. It depends on the edges whether this causes MCL to pull together things that ideally should not be pulled together.

It is hard to nail down when/how MCL clusters nodes together or not - if it was easy then there would be no need for MCL.
I'm curious what your goal is. To be more confident in results? Understand when/where MCL does something you'd rather avoid? Get a feeling for how changes in the input graph lead to changes in the clustering result?

0 replies

zhitian-wu · 2025-08-27T20:23:45Z

zhitian-wu
Aug 27, 2025
Author

Thanks for your quick response!

Considering the complexity of proteins, I know it is unrealistic to expect the resulting clustering of a set of proteins will be 100% biologically correct. I asked this question mainly because I noticed an input graph I sent to mcl running into a "flip-flop" state. It has the following form after several iterations during the MCL process.

After inflation and made stochastic (Before expansion),

1/2 0   0   ... 0
1/2 0   0   ... 0
0   1/n 1/n ... 1/n
0   1/n 1/n ... 1/n
.   .   .
.   .   .
.   .   .
0   1/n 1/n ... 1/n

After expansion,

1/4   0   0   ... 0
1/4   0   0   ... 0
1/2n  1/n 1/n ... 1/n
1/2n  1/n 1/n ... 1/n
.     .   .   .
.     .   .   .
.     .   .   .
1/2n  1/n 1/n ... 1/n

Then a cycle starts, provided that a large inflation value is used, say larger than 5.

At first, I thought that this matrix is obviously different from those special circulant matrices mentioned in Chapter 6.
But upon a second check, it seems most likely that this pattern appears because arcs of small weights ( (1/2n) ** inflation) are pruned by default, even if their weights may still be representable in floating point notation. Otherwise, the "cycle" should not exist at all: For entries in the first column, the 1/2n's should always become larger and 1/4's always become smaller, after each iteration. Eventually all n+2 vertices will form a single cluster.

Because I am not completely sure whether the above arguments are valid, I inspected the original input graph. I found that the vertices in the graph essentially corresponds to three proteins:

n identical copies of protein A (from row 3 to row n+2)
protein B that are moderately similar to protein A's and C (row 2)
protein C that are similar to B but not A's (row 1)

Because A's and C are sort of competing for B in two quite different ways, I concluded that the reason why the above mentioned "cycle" would appear is due to these duplicated proteins. And I started daydreaming if there is anything I can do with such proteins, no matter they were found in another "cycle" or not. For example, if we reduce duplicated vertices into a single supervertex, then both clustering process and interpretation will become simpler.

Thanks for reading my rather trivial motivation. I guess I am curious about what changes made to input graph can influence the clustering, and I don't really have a clear goal in mind. Do you have any further suggestions?

You also mentioned that

It is hard to nail down when/how MCL clusters nodes together or not - if it was easy then there would be no need for MCL.

While daydreaming, I am thinking of a related elementary question: Analogous to the notion of edge connectivity in graph theory, we may remove a minimum number of edges from Jn after which we get two clusters out of a MCL process? If the input is weighted and max/min ratio is sufficiently small, can we make similar assertion? I plan to study this question before the end of summer holiday :)

0 replies

micans · 2025-08-27T21:32:28Z

micans
Aug 27, 2025
Maintainer

Hello, thanks for the extensive response. It is very informative and interesting. After some fiddling I managed to make a small test case (which is probably irrelevant for you) for n=5:

mcl t.mci --show --discard-loops=n  -I 14 > ooo

where t.mci:

mclheader
mcltype matrix
dimensions 5x5
)
(mclmatrix
begin
0 0:1 1:1 $
1 2:1 3:1 4:1 $
2 2:1 3:1 4:1 $
3 2:1 3:1 4:1 $
4 2:1 3:1 4:1 $
)

The immediate questions I have now are:

is your input symmetric? You'll note that your example is not symmetric (with this I mean to ignore any weights and just look at the existence of a connection). Does your example start symmetric and then somehow end up in this state? When reading abc format mcl will make the input symmetric (both presence and edge weights), but I don't think it does so with other formats. What input format do you use when running mcl?
how large is your n? It might become an issue if n gets close to the magnitude of MCL's pruning settings, which by default are > 1000.

Meanwhile I'll think about your other questions/ideas and daydreaming )

0 replies

zhitian-wu · 2025-08-28T10:33:33Z

zhitian-wu
Aug 28, 2025
Author

Hello, I attached the raw data below.

The numerical weights look clumsy because the person who wrote the program to calculate protein similarity (not me) believe that pre-inflation is important (e.g., 390625=5**8).

I used the ABC format so that the input matrix is symmetric. I didn't specify discard-loops. Then, using either mcl-14 or mcl-22 with inflation≥3.65 will lead to thousands of iterands.

By the way, vertices 1, 2, ..., 9 are the duplicated protein A I was referring to in previous post. Vertex 16 and 17 are protein B and protein C. Vertex 10 is also needed, but its role is not clear. Removing any of them will lead to convergence after a few iterands.

Do you think it is due to pruning?

1 2 390625.0
1 3 390625.0
1 4 390625.0
1 5 390625.0
1 6 390625.0
1 7 390625.0
1 8 390625.0
1 9 390625.0
1 10 71397.56207514372
1 16 0.06275884239266713
1 17 0.0005298140991619278
2 3 390625.0
2 4 390625.0
2 5 390625.0
2 6 390625.0
2 7 390625.0
2 8 390625.0
2 9 390625.0
2 10 71397.56207514372
2 16 0.06275884239266713
2 17 0.0005298140991619278
3 4 390625.0
3 5 390625.0
3 6 390625.0
3 7 390625.0
3 8 390625.0
3 9 390625.0
3 10 71397.56207514372
3 16 0.06275884239266713
3 17 0.0005298140991619278
4 5 390625.0
4 6 390625.0
4 7 390625.0
4 8 390625.0
4 9 390625.0
4 10 71397.56207514372
4 16 0.06275884239266713
4 17 0.0005298140991619278
5 6 390625.0
5 7 390625.0
5 8 390625.0
5 9 390625.0
5 10 71397.56207514372
5 16 0.06275884239266713
5 17 0.0005298140991619278
6 7 390625.0
6 8 390625.0
6 9 390625.0
6 10 71397.56207514372
6 16 0.06275884239266713
6 17 0.0005298140991619278
7 8 390625.0
7 9 390625.0
7 10 71397.56207514372
7 16 0.06275884239266713
7 17 0.0005298140991619278
8 9 390625.0
8 10 71397.56207514372
8 16 0.06275884239266713
8 17 0.0005298140991619278
9 10 71397.56207514372
9 16 0.06275884239266713
9 17 0.0005298140991619278
16 17 0.06275884239266713

0 replies

micans · 2025-08-28T21:53:06Z

micans
Aug 28, 2025
Maintainer

This is very cool, thanks for sharing. It's something new to me, a flip-flop state that I would describe as a moving trampoline; one where under certain conditions the inflation value does not matter provided it has a minimum value. It does not have to do with pruning. One can actually change the inflation value intermittently and the flip-flop state will adapt. I've written some simple code to compute these things without using matrices. I aim to share this here soon.

0 replies

micans · 2025-08-29T23:28:06Z

micans
Aug 29, 2025
Maintainer

I've added https://github.com/micans/mcl/blob/main/scripts/elastiflop.py .
It contains a description of the class of matrix you found with some initial analysis (maybe it was rushed, I need to think more about some of it). You're mentioned in it; let me know if you're OK with that. I'll respond about some other aspects you raised later again.

0 replies

zhitian-wu · 2025-08-30T16:52:53Z

zhitian-wu
Aug 30, 2025
Author

Wow, thanks a lot for your lucid explanation. It really IS a flip flop state! I spent a week studying floating point arithmetic and imagining that powers of x' becomes 0; however, it would perhaps take billions of operations to influence the result.

Now I see that the main tool needed here is a quadratic function! It's also interesting to note that we cannot generalize on the number of y's; just two, no more, no less.

Still, I do not fully understand your remarks on delta. I need to study it later. Concerning putting my username in the script, I am fine with that. All in all, I will never find this test case if the default option in mcl is discard-loop=n!

1 reply

zhitian-wu Aug 30, 2025
Author

sorry, my mind was not clear, we can generalize on the number of y's, say call it m. Then m, n, infl, starting value may follow a certain relationship to maintain the flip-flop state.

micans · 2025-08-30T20:51:06Z

micans
Aug 30, 2025
Maintainer

You're right, I made an example with 3 y's - it generalises in that direction. It may also generalise towards more columns with y's. Taking stock of two aspects mentioned earlier:

related elementary question: Analogous to the notion of edge connectivity in graph theory, we may remove a minimum number of edges from Jn after which we get two clusters out of a MCL process? This reminds me of planted partitions. I've read in the past comparison articles (perhaps Fortunato was one of the authors, not sure), where a number of algorithms were compared using planted partitions. In that comparison MCL did not fare well, which probably aligns with my own findings in chapter 12 of thesis (where I did not compare unfortunately). Anyway the question you ask is still interesting from the perspective of understanding the uncoupling characteristics of MCL. Keeping a lot of symmetry while reducing Jn would be my choice of approach, but MCL's relatively poorer (IIRC) showing in planted partitions perhaps takes the sheen off a bit in advance.
The pre-inflation you mentioned. It may have been used to solve a problem particular to a specific kind of sub-graph, but I suspect there is a large danger of trammelling on information elsewhere. My hunch (without knowing more context, so not carrying much weight) is that I'd prefer to solve that problem with some other kind of pre-processing of the graph or post-processing of the clustering. Maybe using a non-redundant set of proteins for the input graph, or something else. I am curious what has been tried in the field. (very much related to your thought For example, if we reduce duplicated vertices into a single supervertex, then both clustering process and interpretation will become simpler.). Does the A/B/C pattern happen because some of the matches are between proteins of (very) different length, with only a subset of domain(s) matching?

0 replies

zhitian-wu · 2025-09-07T11:23:05Z

zhitian-wu
Sep 7, 2025
Author

Sorry for my late response. I haven't been able to read anything about Fortunato's paper yet, because I was still studying the paper David Matula published in 1972; There are many things to learn as a beginner.

My plan for the first question was enumerating small cases, say Jn with 3≤n≤6. The result shows that if we remove ceiling(n/2) - 1 edges, the graph will still have a single attractor, regardless of the inflation chosen. For example, in J5, we can remove the edges between A and B, C and D, then vertex E will end up attracting all the other 4 vertices. There is another simple example of removing edges without splitting the graph. If we remove all but one edge of vertex A, we obtain a graph of J_{n-1} plus A's self-loop and an edge A---B. This graph will always form a single cluster, with the attractor being vertex B, the only neighbor of vertex A.

In general, to avoid a graph splitting into more than one clusters, we need to prevent a single equivalent class from attracting all other vertices. And for those that form a single cluster, it may also be studied in a symbolic way; by showing that throughout the MCL process each value in certain rows is always the maximum value of its column. Then, inflation makes them relatively larger, and finally all other values in the column becomes 0.

However, ceiling(n/2) - 1 may be a very poor lower bound, because there are many more possibilities for graphs with more vertices.

1 reply

micans Sep 10, 2025
Maintainer

Quite likely the properties of the dimension n and whether you can preserve graph symmetry when taking away k edges play a big role. This is all more of theoretical interest, as opposed to the potential for better clustering by learning better graph input creation; both are interesting depending on taste. I certainly appreciate a lot this novel class of flip-flop states, which probably slants more to theoretical interest although amazingly observed in practice. I intend to implement matrix element perturbation rather than the current inflation perturbation when convergence is slow or not happening.

zhitian-wu · 2025-09-07T11:54:12Z

zhitian-wu
Sep 7, 2025
Author

The input proteins A/B/C are all highly similar to each other belonging to a certain region in Arabidopsis. I don't really know why there are so many genes there. See https://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/av.cgi?db=ara&c=gene&a=fiche&l=AT5G36500

The score itself has been processed with pre-inflation according to an article published several years ago (doi.org/10.1186/s12859-018-2362-4). I quote from that article

Furthermore, we would like the clustering to be relatively insensitive to evolutionary sequence divergence. That is, within a similarity component pairs of homologs from two distant species should be ideally scored nearly as high as pairs from two closely related species. To achieve this, in each similarity component we calculate the average distance between each pair of species as 100 minus the average inter-species similarity score and add it to all the similarity scores between those species within the similarity component.

Finally, to increase the contrast between the final similarity scores, before the similarity component is passed to the MCL algorithm, we raise the scores to the power of C, the contrast parameter. This operation is similar to one round of expansion (?) as explained in [14] and was experimentally observed to increase the specificity of the resulting clusters.

Following the above procedure, the pairwise similarity is measured in the scale [0..100], then a constant, say 95, is subtracted off all weighted edges. If the weight is now negative, then it is converted to zero. For example, 100 → 5, 95.1 → 0.1, 50 → 0, etc. Finally, the program attached to the article will apply pre-inflation=8 before sending the data (in --abc format) to MCL.

I also read at several other places that one may discard the edges below a threshold and regard the rest as having equal weight. In this way, Matula's method can be utilized as well. I think this has gone too far, sorry for the digression.

In addition to the scoring, I agree that some pre-processing needs to be done. Nowadays people usually combine all the proteins from several related species together, build a similarity matrix, then send the matrix to software like mcl. Maybe they can first cluster each species (in case there are already duplicated proteins present), then construct a larger clustering using the clusters of individual species.

2 replies

zhitian-wu Sep 10, 2025
Author

I'll get back if there's any update!

micans Sep 10, 2025
Maintainer

Thanks for following up, I did mean to respond. It seems you have a good grasp of all that is happening, and there is a lot of scope of how to pre-process the input. I suppose there are good reasons to go with high pre-inflation, nonetheless it is such an extreme measure that I would take that as a starting point to see if other approaches would be more like a chisel and less like a sledgehammer. If you do look into the general aspect of 'input graph creation / finessing' then I'll be glad to learn more about it, either in a publication or maybe here.

micans · 2025-09-19T10:18:13Z

micans
Sep 19, 2025
Maintainer

Renamed https ://github.com/micans/mcl/blob/main/scripts/elastiflop.py to https://github.com/micans/mcl/blob/main/scripts/elastiflip.py

0 replies

The effect of duplicating certain vertex #34

Uh oh!

Uh oh!

zhitian-wu Aug 26, 2025

Replies: 11 comments · 4 replies

Uh oh!

micans Aug 27, 2025 Maintainer

Uh oh!

Uh oh!

zhitian-wu Aug 27, 2025 Author

Uh oh!

micans Aug 27, 2025 Maintainer

Uh oh!

Uh oh!

zhitian-wu Aug 28, 2025 Author

Uh oh!

micans Aug 28, 2025 Maintainer

Uh oh!

micans Aug 29, 2025 Maintainer

Uh oh!

zhitian-wu Aug 30, 2025 Author

Uh oh!

zhitian-wu Aug 30, 2025 Author

Uh oh!

micans Aug 30, 2025 Maintainer

Uh oh!

Uh oh!

zhitian-wu Sep 7, 2025 Author

Uh oh!

micans Sep 10, 2025 Maintainer

Uh oh!

Uh oh!

zhitian-wu Sep 7, 2025 Author

Uh oh!

zhitian-wu Sep 10, 2025 Author

Uh oh!

micans Sep 10, 2025 Maintainer

Uh oh!

micans Sep 19, 2025 Maintainer

zhitian-wu
Aug 26, 2025

Replies: 11 comments 4 replies

micans
Aug 27, 2025
Maintainer

zhitian-wu
Aug 27, 2025
Author

micans
Aug 27, 2025
Maintainer

zhitian-wu
Aug 28, 2025
Author

micans
Aug 28, 2025
Maintainer

micans
Aug 29, 2025
Maintainer

zhitian-wu
Aug 30, 2025
Author

zhitian-wu Aug 30, 2025
Author

micans
Aug 30, 2025
Maintainer

zhitian-wu
Sep 7, 2025
Author

micans Sep 10, 2025
Maintainer

zhitian-wu
Sep 7, 2025
Author

zhitian-wu Sep 10, 2025
Author

micans Sep 10, 2025
Maintainer

micans
Sep 19, 2025
Maintainer