The effect of duplicating certain vertex #34
Replies: 11 comments 4 replies
-
|
As far as I know the Kronecker result is useless here - it is a pretty abstract result that has little bearing on real-life graphs. In interesting examples such as yours (reflecting e.g. a gene duplication event) it does not even remotely apply.
This is a pretty good description, I would agree.
Not that I know of, really. There is a quite interesting mathematical structure associated with MCL iterands, but the questions around that structure are really hard. I prefer to approach this with a more abstracted line of reasoning. But first of all, what is your goal/motivation? Is MCL doing the wrong thing in some cases? Splitting up things that should not be split up, including nodes in a cluster that should not be included? The main tool that we have is manipulating the input graph to represent our expectations; e.g. best-reciprocal hit filtering and the precise definition of the edge weights. Still, I expect that things like the above will always happen. In protein parology/orthology, I've seen cases in the past where locally the graph would look quite bipartite, and MCL would occasionally group together nodes that are not connected in the input graph. I would say such cases are at a level of detail that MCL cannot resolve. It is hard to nail down when/how MCL clusters nodes together or not - if it was easy then there would be no need for MCL. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for your quick response! Considering the complexity of proteins, I know it is unrealistic to expect the resulting clustering of a set of proteins will be 100% biologically correct. I asked this question mainly because I noticed an input graph I sent to mcl running into a "flip-flop" state. It has the following form after several iterations during the MCL process. After inflation and made stochastic (Before expansion), After expansion, Then a cycle starts, provided that a large inflation value is used, say larger than 5. At first, I thought that this matrix is obviously different from those special circulant matrices mentioned in Chapter 6. Because I am not completely sure whether the above arguments are valid, I inspected the original input graph. I found that the vertices in the graph essentially corresponds to three proteins:
Because A's and C are sort of competing for B in two quite different ways, I concluded that the reason why the above mentioned "cycle" would appear is due to these duplicated proteins. And I started daydreaming if there is anything I can do with such proteins, no matter they were found in another "cycle" or not. For example, if we reduce duplicated vertices into a single supervertex, then both clustering process and interpretation will become simpler. Thanks for reading my rather trivial motivation. I guess I am curious about what changes made to input graph can influence the clustering, and I don't really have a clear goal in mind. Do you have any further suggestions? You also mentioned that
While daydreaming, I am thinking of a related elementary question: Analogous to the notion of edge connectivity in graph theory, we may remove a minimum number of edges from Jn after which we get two clusters out of a MCL process? If the input is weighted and max/min ratio is sufficiently small, can we make similar assertion? I plan to study this question before the end of summer holiday :) |
Beta Was this translation helpful? Give feedback.
-
|
Hello, thanks for the extensive response. It is very informative and interesting. After some fiddling I managed to make a small test case (which is probably irrelevant for you) for where The immediate questions I have now are:
Meanwhile I'll think about your other questions/ideas and daydreaming ) |
Beta Was this translation helpful? Give feedback.
-
|
Hello, I attached the raw data below. The numerical weights look clumsy because the person who wrote the program to calculate protein similarity (not me) believe that pre-inflation is important (e.g., 390625=5**8). I used the ABC format so that the input matrix is symmetric. I didn't specify By the way, vertices 1, 2, ..., 9 are the duplicated protein A I was referring to in previous post. Vertex 16 and 17 are protein B and protein C. Vertex 10 is also needed, but its role is not clear. Removing any of them will lead to convergence after a few iterands. Do you think it is due to pruning? |
Beta Was this translation helpful? Give feedback.
-
|
This is very cool, thanks for sharing. It's something new to me, a flip-flop state that I would describe as a moving trampoline; one where under certain conditions the inflation value does not matter provided it has a minimum value. It does not have to do with pruning. One can actually change the inflation value intermittently and the flip-flop state will adapt. I've written some simple code to compute these things without using matrices. I aim to share this here soon. |
Beta Was this translation helpful? Give feedback.
-
|
I've added https://github.com/micans/mcl/blob/main/scripts/elastiflop.py . |
Beta Was this translation helpful? Give feedback.
-
|
Wow, thanks a lot for your lucid explanation. It really IS a flip flop state! I spent a week studying floating point arithmetic and imagining that powers of x' becomes 0; however, it would perhaps take billions of operations to influence the result. Now I see that the main tool needed here is a quadratic function! It's also interesting to note that we cannot generalize on the number of y's; just two, no more, no less. Still, I do not fully understand your remarks on delta. I need to study it later. Concerning putting my username in the script, I am fine with that. All in all, I will never find this test case if the default option in mcl is discard-loop=n! |
Beta Was this translation helpful? Give feedback.
-
|
You're right, I made an example with 3 y's - it generalises in that direction. It may also generalise towards more columns with y's. Taking stock of two aspects mentioned earlier:
|
Beta Was this translation helpful? Give feedback.
-
|
Sorry for my late response. I haven't been able to read anything about Fortunato's paper yet, because I was still studying the paper David Matula published in 1972; There are many things to learn as a beginner. My plan for the first question was enumerating small cases, say Jn with 3≤n≤6. The result shows that if we remove In general, to avoid a graph splitting into more than one clusters, we need to prevent a single equivalent class from attracting all other vertices. And for those that form a single cluster, it may also be studied in a symbolic way; by showing that throughout the MCL process each value in certain rows is always the maximum value of its column. Then, inflation makes them relatively larger, and finally all other values in the column becomes 0. However, |
Beta Was this translation helpful? Give feedback.
-
|
The input proteins A/B/C are all highly similar to each other belonging to a certain region in Arabidopsis. I don't really know why there are so many genes there. See https://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/av.cgi?db=ara&c=gene&a=fiche&l=AT5G36500 The score itself has been processed with pre-inflation according to an article published several years ago (doi.org/10.1186/s12859-018-2362-4). I quote from that article
Following the above procedure, the pairwise similarity is measured in the scale [0..100], then a constant, say 95, is subtracted off all weighted edges. If the weight is now negative, then it is converted to zero. For example, 100 → 5, 95.1 → 0.1, 50 → 0, etc. Finally, the program attached to the article will apply pre-inflation=8 before sending the data (in I also read at several other places that one may discard the edges below a threshold and regard the rest as having equal weight. In this way, Matula's method can be utilized as well. I think this has gone too far, sorry for the digression. In addition to the scoring, I agree that some pre-processing needs to be done. Nowadays people usually combine all the proteins from several related species together, build a similarity matrix, then send the matrix to software like mcl. Maybe they can first cluster each species (in case there are already duplicated proteins present), then construct a larger clustering using the clusters of individual species. |
Beta Was this translation helpful? Give feedback.
-
|
Renamed |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Dear all,
I opened a thread last November to ask about whether self-loops should be added to an input graph. Recently, I have revisited my previous experience with protein clustering, and several new questions came into my mind. Here is one of them:
By reading Chapter 6, basic MCL theory, I learnt about direct product (Kronecker product).
As a special case, given a graph of n vertices and its adjacent matrix M1, if we duplicated each of these n vertices m times, with each additional duplicated vertex renamed, and with their neighbors remaining the same, then the corresponding adjacent matrix is M1 ⊗ Jm. According to Lemma 5, the way n vertices are clustered will remain the same, regardless of duplication.
However, what if only certain vertices, instead of all of them, are duplicated? In case of protein clustering, there may be a gene duplication event resulting in two similar proteins. As a more concrete example, in the case of "line graph" (a path of 7 vertices), if we duplicate vertex 3 once
0--1; 1--2; 2--3; 3--4; 4--5; 5--6; becomes
0--1; 1--2; 2--3A; 2--3B; 3A--3B; 3A--4; 3B--4; 4--5; 5--6;
(I am using a '--' to represent two arcs between vertices, and all self loops are present, all arcs and loops may be assumed to have equal weight)
Informally, I guess that duplicated vertices can trap their unduplicated neighbors by extra cycles. But is there a more mathematical way of analyzing similar cases?
Thanks in advance for taking your valuable time considering this question!
Beta Was this translation helpful? Give feedback.
All reactions