A Needle in the Connectome: Neural ‘Fingerprint’ Identifies Individuals with ~93% accuracy

Much like we picture ourselves, we tend to assume that each individual brain is a bit of a unique snowflake. When running a brain imaging experiment it is common for participants or students to excitedly ask what can be revealed specifically about them given their data. Usually, we have to give a disappointing answer – not all that much, as neuroscientists typically throw this information away to get at average activation profiles set in ‘standard’ space. Now a new study published today in Nature Neuroscience suggests that our brains do indeed contain a kind of person-specific fingerprint, hidden within the functional connectome. Perhaps even more interesting, the study suggests that particular neural networks (e.g. frontoparietal and default mode) contribute the greatest amount of unique information to your ‘neuro-profile’ and also predict individual differences in fluid intelligence.

To do so lead author Emily Finn and colleagues at Yale University analysed repeated sets of functional magnetic resonance imaging (fMRI) data from 128 subjects over 6 different sessions (2 rest, 4 task), derived from the Human Connectome Project. After dividing each participant’s brain data into 268 nodes (a technique known as “parcellation”), Emily and colleagues constructed matrices of the pairwise correlation between all nodes. These correlation matrices (below, figure 1b), which encode the connectome or connectivity map for each participant, were then used in a permutation based decoding procedure to determine how accurately a participant’s connectivity pattern could be identified from the rest. This involved taking a vector of edge values (connection strengths) from a participant in the training set and correlating it with a similar vector sampled randomly with replacement from the test set (i.e. testing whether one participant’s data correlated with another’s). Pairs with the highest correlation where then labelled “1” to indicate that the algorithm assigned a matching identity between a particular train-test pair. The results of this process were then compared to a similar one in which both pairs and subject identity were randomly permuted.

Finn et al's method for identifying subjects from their connectomes.
Finn et al’s method for identifying subjects from their connectomes.

At first glance, the results are impressive:

Identification was performed using the whole-brain connectivity matrix (268 nodes; 35,778 edges), with no a priori network definitions. The success rate was 117/126 (92.9%) and 119/126 (94.4%) based on a target-database of Rest1-Rest2 and the reverse Rest2-Rest1, respectively. The success rate ranged from 68/126 (54.0%) to 110/126 (87.3%) with other database and target pairs, including rest-to-task and task-to-task comparisons.

This is a striking result – not only could identity be decoded from one resting state scan to another, but the identification also worked when going from rest to a variety of tasks and vice versa. Although classification accuracy dropped when moving between different tasks, these results were still highly significant when compared to the random shuffle, which only achieved a 5% success rate. Overall this suggests that inter-individual patterns in connectivity are highly reproducible regardless of the context from which they are obtained.

The authors then go on to perform a variety of crucial control analyses. For example, one immediate worry is that that the high identification might be driven by head motion, which strongly influences functional connectivity and is likely to show strong within-subject correlation. Another concern might be that the accuracy is driven primarily by anatomical rather than functional features. The authors test both of these alternative hypotheses, first by applying the same decoding approach to an expanded set of root-mean square motion parameters and second by testing if classification accuracy decreased as the data were increasingly smoothed (which should eliminate or reduce the contribution of anatomical features). Here the results were also encouraging: motion was totally unable to predict identity, resulting in less than 5% accuracy, and classification accuracy remained essentially the same across smoothing kernels. The authors further tested the contribution of their parcellation scheme to the more common and coarse-grained Yeo 8-network solution. This revealed that the coarser network division seemed to decrease accuracy, particularly for the fronto-parietal network, a decrease that was seemingly driven by increased reliability of the diagonal elements of the inter-subject matrix (which encode the intra-subject correlation). The authors suggest this may reflect the need for higher spatial precision to delineate individual patterns of fronto-parietal connectivity. Although this intepretation seems sensible, I do have to wonder if it conflicts with their smoothing-based control analysis. The authors also looked at how well they could identify an individual based on the variability of the BOLD signal in each region and found that although this was also significant, it showed a systematic decrease in accuracy compared to the connectomic approach. This suggests that although at least some of what makes an individual unique can be found in activity alone, connectivity data is needed for a more complete fingerprint. In a final control analysis (figure 2c below), training simultaneously on multiple data sets (for example a resting state and a task, to control inherent differences in signal length) further increased accuracy to as high as 100% in some cases.

Finn et al; networks showing most and least individuality and contributing factors.
Finn et al; networks showing most and least individuality and contributing factors. Interesting to note that sensory areas are highly common across subjects whereas fronto-parietal and mid-line show the greatest individuality!

Having established the robustness of their connectome fingerprints, Finn and colleagues then examined how much each individual cortical node contributed to the identification accuracy. This analysis revealed a particularly interesting result; frontal-parietal and midline (‘default mode’) networks showed the highest contribution (above, figure 2a), whereas sensory areas appeared to not contribute at all. This compliments their finding that the more coarse grained Yeo parcellation greatly reduced the contribution of these networks to classificaiton accuracy. Further still, Finn and colleagues linked the contributions of these networks to behavior, examining how strongly each network fingerprint predicted an overall index of fluid intelligence (g-factor). Again they found that fronto-parietal and default mode nodes were the most predictive of inter-individual differences in behaviour (in opposite directions, although I’d hesitate to interpret the sign of this finding given the global signal regression).

So what does this all mean? For starters this is a powerful demonstration of the rich individual information that can be gleaned from combining connectome analyses with high-volume data collection. The authors not only showed that resting state networks are highly stable and individual within subjects, but that these signatures can be used to delineate the way the brain responds to tasks and even behaviour. Not only is the study well powered, but the authors clearly worked hard to generalize their results across a variety of datasets while controlling for quite a few important confounds. While previous studies have reported similar findings in structural and functional data, I’m not aware of any this generalisable or specific. The task-rest signature alone confirms that both measures reflect a common neural architecture, an important finding. I could be a little concerned about other vasculature or breath-related confounds; the authors do remove such nuisance variables though, so this may not be a serious concern (though I am am not convinced their use of global signal regression to control these variables is adequate). These minor concerns none-withstanding, I found the network-specific results particularly interesting; although previous studies indicate that functional and structural heterogeneity greatly increases along the fronto-parietal axis, this study is the first demonstration to my knowledge of the extremely high predictive power embedded within those differences. It is interesting to wonder how much of this stability is important for the higher-order functions supported by these networks – indeed it seems intuitive that self-awareness, social cognition, and cognitive control depend upon acquired experiences that are highly individual. The authors conclude by suggesting that future studies may evaluate classification accuracy within an individual over many time points, raising the interesting question: Can you identify who I am tomorrow by how my brain connects today? Or am I “here today, gone tomorrow”?

Only time (and connectomics) may tell…



thanks to Kate Mills for pointing out this interesting PLOS ONE paper from a year ago (cited by Finn et al), that used similar methods and also found high classification accuracy, albeit with a smaller sample and fewer controls:




It seems there was a slight mistake in my understanding of the methods – see this useful comment by lead author Emily Finn for clarification:


corrections? comments? want to yell at me for being dumb? Let me know in the comments or on twitter @neuroconscience!

44 thoughts on “A Needle in the Connectome: Neural ‘Fingerprint’ Identifies Individuals with ~93% accuracy

  1. Nice breakdown. I pretty much agree and think this is a landmark study for fMRI. Regarding global signal regression, if anything, this likely hurt the accuracy. Alternative methods better account for motion and physiological noise and improve within-subject reproducibility (e.g. ICA-AROMA: http://www.ncbi.nlm.nih.gov/pubmed/25770990). It’ll be neat to see how much more accurate we can get with these connectome fingerprints.

    • Thanks Aaron! And I agree, I can’t imagine vasculature would have a strong biasing effect here since they do ‘remove it’, but GSR is just a poor technique to do so. I agree that a more advanced technique like ICA or nuisance variable regression would have been a better choice here. Caveats none-withstanding, this paper is destined to become an instant classic. So cheers to Emily and her colleagues!

  2. Interesting paper! I’m a little worried about the anatomical variance possibility.

    I’m not sure you would predict that anatomical differences would be smoothed out? I mean, eventually they would, but why would you expect them to be more vulnerable to smoothing than the local differences in functional connectivity itself that you were interested in?

    In my view the way to solve this would be to first find a metric for “anatomical similarity” and then see if the fMRI connectivity can discriminate between people who happen to have high anatomical similarity. You could match up participants with their “brain dopplegangers” and then try to discriminate the two on the basis of fMRI. If the accuracy with substantially reduced by doing that, it would suggest anatomy is driving the effects


    • Something I thought would be interesting -and also useful to sort this out- is to disentangle the contribution of anatomy from the contribution of functional activity by attempting to do the same analysis while subjects are deep in sleep or under anaesthesia. If this is driven by features related to the contents of their consciousness (as suggested by the fact that fronto-parietal regions drive this result) then it should not be possible to fingerprint them while unconscious. If you can do it, then it’s either a component of FC related to underlying structural connectivity, or directly the anatomy of the cortex itself.

      • I think this is a great idea. Of course, it’s a generally a good point that there is a discrepancy how much visual and frontal cortex contribute. This probably does rule out some anatomical factors after all. Considering how large (and heritable etc) visual cortex seems to be, perhaps it should play more of a role if it were only down to anatomy?

      • Hi Enzo. Would love to see someone try this under anesthesia! Our lab (http://www.sciencedirect.com/science/article/pii/S1053811909008015) and many others have shown that light anesthesia preferentially alters activity and connectivity within and between higher-order cortical regions, though not via this exact identification-style framework — so we might hypothesize that ID rates would drop as we put people further under. (Although it’s worth pointing out that while we emphasized the frontoparietal stuff in the paper since it performed the best for identification and gF prediction, all of the networks, even the ones primarily composed of primary sensory regions, still identified people at rates well above chance, so we likely wouldn’t expect the result to disappear entirely.) In any case, great future direction.

    • I concur. To me this smells a lot like anatomy is playing a role which makes it seem to be expected that you would observe such high identification accuracy. But then maybe I am just bitter because this is was the trivial explanation to some extremely beautiful functional connectivity results I had recently… 😛

      Always beware of results that look too beautiful…

      • Hi Sam, thanks for raising these points. Hopefully my long-winded response below helps address some of them. I suppose the real question is not whether anatomy is playing SOME role — it almost certainly is — but whether function is adding substantial, non-redundant information about individual variability (which we believe it is).

        The backstory to this paper is that what we were originally trying to do with this data was take a sliding-window, dynamic connectivity approach to try to derive “brain states” across subjects by clustering connectivity matrices calculated from different points in the scans (an effort spearheaded by my co-first author, Xilin Shen). But when we fed all the data to the clustering algorithm, we kept on getting out subject instead of state. Hence this really very simple analysis showing that people always look most similar to themselves. Funny how the data can nudge you in a different direction…in any case, thanks for your interest!

        • Thanks for this reply as well! You’re right that anatomy and function should be related so this seems entirely reasonable. I also thank you for your honesty about the serendipity of the finding. I agree this is often how science progresses! (let’s see how NS will respond to that point… 😉

          I remain unsure about how much we should or shouldn’t have expected this result. But in the end, sometimes results that seem obvious in hindsight are the most important ones. And hindsight is very biased.

          I mocked up some simulations of your analysis. It looks quite interesting although I have yet to check it over and it’s currently very basic. Either way, it certainly supports your conclusions – it mainly speaks about what kind of effect sizes we can expect from results like this. I may share this if I get the chance to finish it!

          Anyway, thanks for the interesting work sparking this discussion!

          • Totally agree — we were actually less surprised by the result than by the fact that no one had shown it in quite this way before. Would love to see your simulations, please do share if and when you get a chance!

          • Hi again, so I have fiddled more with the simulation and I think it is valid, albeit very simplistic. Unfortunately I currently don’t have time to write a blog post about it. Maybe later. In any case what this suggests to me is that for the full set of 35,778 edges the effect size (i.e. the correlation between day 1 and day 2 for a given subject) must be quite small in order to produce around 90% accuracy. As typically happens when I simulate something, this now seems completely obvious to me from a theoretical view but I didn’t think of this before 😛

            Essentially the maximum correlation I seem to get in the simulation (that is, the one between each subject’s connectivity matrix from day 1 and the winning matrix from day 2) is around r=0.02, or around 0.04% of variance explained (but of course, because these are correlations with 35,776 degrees of freedom these are nonetheless extremely significant). Is this what you see also for the actual data?

            To me this would suggest that these ‘fingerprints’ are in fact extremely weak compared to the general variability of connection strengths in the human brain. The reason you nonetheless can pick them up so reliably is because of the large data sets.

            Again, I don’t think this invalidates the conclusions in any way. It just means that these stable signatures are actually extremely subtle compared to all the other stuff that is going on. A main caveat to my simulation (assuming I didn’t screw something major up) is that this is just using random data. As your reviewer already suggested, individual observations are not independent in your data and there are probably edges that are completely unrelated between days – in my days they all are. Also, I didn’t use actual fMRI time series. A more sophisticated version of the simulation could probably try that. Anyway, I may try to write about this soon.

          • Hi Sam, sorry, am unable to reply directly to your latest comment below (WordPress thinks we’ve prattled on too long, apparently). In the observed data, the correlations between edge values from the same subject were actually much higher than r = 0.02: for the rest-rest comparison, they were around r = 0.5 – 0.8 for most subjects (you can see this in Fig. 4 of the paper). They dropped a bit for rest-task, but not that much. Perhaps it’s to do with random data vs actual fMRI time series?

          • Yeah kind bummed that wordpress is doing such a bad job with the long comment threads here. It may be related to the style i’m using, will explore some other options. You guys are great!

          • Actually, it did manage to put my comment in the right spot after all. Though now I’m just compounding the problem by continuing to comment 😛

          • Micah: You can change the number of levels that are allowed in comments. Not sure what the maximum is. The system isn’t ideal in any case.

            Emily: Thanks for this. I’ll have a look at the study again. I guess it’s very different then and the simulation is still far too limited. It could be because unlike the simplistic random data I used not all edges are actually correlated in real brain data – but those that are correlate much more strongly? Anyway, I’ll have a look – will let you know if I get further 🙂

    • Hi Neuroskeptic, thanks for your comment.

      As to your smoothing question, not sure exactly what you mean by “local differences” in functional connectivity. The node-based analysis is already inherently a lot more smooth than a voxel-based analysis — these nodes are on the order of hundreds of voxels each (at voxel size 2mm isotropic). We average the signals across all the voxels in a node at each time point, then correlate that time-varying average signal with the average signal from all other nodes in the brain. In my mind, “local connectivity” would be something more akin to coherence across all the voxels in a single node, rather than correlation between nodes covering the entire brain.

      Anatomic differences are most likely to play a role at the “local” level — i.e., at the level of the segmentation itself, where we apply a single 268-node parcellation (defined in standard space) to all subjects. This is where anatomic idiosyncrasies as well as small errors inherent to the registration process could result in differing numbers of gray-matter voxels in a node across subjects, for example. So increasing the smoothing kernel should in theory smooth out some of these local anatomic influences at the node boundaries, but shouldn’t have much effect on the node-to-node functional connections, which are both long-range and “short”-range (but never shorter than a few centimeters, in the “most local” case of two neighboring nodes). Let me know if I misunderstood your point.

      As for the idea of comparing anatomic vs functional similarity, that would be quite interesting. While we didn’t directly test this here, one very interesting feature of this HCP data set — especially these first 126 subjects — is that most subjects had a twin, many of them identical twins, also in the set, and there were many instances of a pair of twins and another sibling. Within this sample, there were only 56 unique mothers. The HCP did this by design, of course, to look at the influence of heritability on all the brain and behavioral variables they’re collecting.

      Other studies have shown the heritability of structural features like regional volumes, gray matter density, cortical folding patterns, etc (here’s a decent review: http://www.ncbi.nlm.nih.gov/pubmed/17415783). So in theory we were making the identification problem as hard as possible on ourselves, by including the one (or two) people in the world that should look most similar to the person whose identity we’re trying to predict on any given trial. (We pointed this out in the Online Methods section, but looking back it would have been nice to emphasize it in the main text.)

      In cases where the predicted identity was wrong, we actually did not observe a strong trend such that the model was more likely to confuse people with their twin or sibling, rather than an unrelated subject. (We didn’t include this in the paper, since it was essentially a null result and we were strapped for space, but again looking back it would have been a nice point to include.) So indirect evidence suggesting that maybe anatomy is not playing a huge role.

      One last thing I’ll point out is that if anatomy were responsible for most of these effects, we wouldn’t expect the identification rates to change much based on the various task/rest conditions (since anatomy is constant regardless of task). In fact, the rates did change a fair amount, and it was harder to identify people when they were doing different tasks rather than just resting, even taking into account the shorter timecourses of the task runs. Our intuition is that imposing the same external task demands on everybody served to homogenize function to some degree across subjects (as one might hope or expect, given the first ~25 years of fMRI research), dampening the individual variability and making identification harder. Of course, the variance in success rates doesn’t necessarily mean that anatomy isn’t playing some role, but it does suggest that a substantial part of what the model is picking up on are the functional features. (We point this out in the BOLD variance analyses under “Effects of anatomical differences,” but it applies equally to the original connectivity analyses.)

      So, to conclude: I don’t know that we could ever claim to have removed anatomical effects entirely, but there seemed to be a lot of converging evidence to suggest that the functional data was providing meaningful information above and beyond the anatomy. Would love to see future work helping to further tease apart these features.

      • Thanks for your work and insights Emily.

        From a study we are running with siblings, we are also not seeing any above-chance similarity of anatomy between sibling pairs (we used BrainPrint to compute similarity of anatomy http://www.sciencedirect.com/science/article/pii/S1053811915000476) while instead we see similarity in function.

        One thing that I have not seen taken in consideration here is that the shape of the HRF seems to be mildly inheritable (http://www.ncbi.nlm.nih.gov/pubmed/26375212) which means that the signal autocorrelation might not be the same across subjects i.e. connections might stay stable across same subject because they went through the same filter (=same HRF). This is just my morning-coffee speculation however and someone should test this 🙂

        • Hi Enrico, thanks for your comment. Interesting that you don’t see above-chance similarity in anatomy between siblings.

          Also thanks for pointing out that HRF paper (looks like it came out the day after ours was accepted!). It’s certainly possible that something like this is contributing to the ID results at some level. Would love to see someone test this more formally in the HCP data – perhaps we’ll give it a go at some point. But the fact that we were not significantly more likely to mistake someone for their identical twin is not what you would predict if (1) the HRF were strongly heritable and (2) the HRF were accounting for the majority of our results. Still, I think this is quite interesting and worthy of investigation.

  3. Since you asked me what I think and Twitter is too bite-sized to continue there: I had a brief look and I don’t see anything blatantly unusual with the chance level. Presumably the chance level is 1/126=0.8%ish. They used a permutation approach though and that suggests that the true chance level is actually around 5%, presumably because of some confounding factors that have nothing to do with the functional connectivity patterns.

    On that note, as far as I understand they did not use resampling with replacement (which most people would call bootstrapping) but a permutation test where labels are shuffled, that is, resampling without replacement. It wouldn’t really make sense to use bootstrapping in this case.

    They do say they used “prediction with replacement” meaning that the classifier was allowed to predict that the same subject appeared more than once in the test set. This is standard procedure – presumably you could force the classifier to predict each participant only once but I have never seen anyone do that before. This has nothing to do with the permutation analysis though but is a general statement about the prediction.

    One thing that isn’t clear to me is whether they used the whole functional connectivity matrix for their analysis. If so then there should be redundant values and all matrices would have contained 1s in the diagonal, which would show up in the subsequent correlation analysis used for prediction. This might explain why the permutation chance level is so much higher than nominal chance? It seems to be an odd choice because it inflates classification performance – but it also inflates the chance level so it probably doesn’t matter. It’s also not clear this is what they did. They only say they used a vector of the matrix but not which values were included.

    • Thanks a lot Sam, I appreciate your more seasoned eye here. I think I got confused by the ‘prediction with replacement’ terminology. It is still not clear to me whether they are randomly sampling any vector of the matrix from the training set and testing on any vector from the rest, or are iteratively going through each and every vector or what. Presumably whatever they did will allow all vectors of the ~200 x ~200 matrix to contribute. Overall seems they used an OK if slightly confusing method.

      • It’s not worded very clearly. I assume what they mean is they vectorised the whole matrix (or perhaps only the one half that isn’t redundant) and compared that to the same vector from day 2.

        • Hi all, thanks for your interest. Apologies if this was unclear in the paper. We did vectorize the connectivity matrices for purposes of both the original identification analyses and the permutation tests. Each 268×268 matrix is symmetric with a diagonal of ones, as you point out, so we vectorize only the elements from upper triangle for a total of 35,778 unique edges ( (268^2-268)/2 ). We then compare these 1×35,778 vectors between days/sessions. This is what was described in the first section of Results (“Whole-brain identification”; in the “Network-based identification,” we used subsets of this vector containing only the edges within each of the eight networks). And yes, the algorithm was allowed to predict the same subject for two (or more) different target matrices.

          In the permutation tests, we preserved the internal structure of these vectors themselves, and permuted subject label on the vectors, such that a trial was called “correct” if the predicted identity matched the pre-determined shuffled identity (calling subject 1 subject 2, for example).

          We probably should have presented more of these numbers in the paper, but the results of the permutation tests didn’t necessarily suggest that “chance” is 5% — rather that was the highest rate we achieved across the 2,000 permutation tests* (6/126 subjects, or 4.7%). This rate was only achieved 3 times out of 2,000 iterations. The median rate was actually 1/126 (or ~0.8%), and the mean was 0.9/126, which is what you’d expect if you calculate chance as 1/126.

          *1,000 permuted identities, 2,000 total trials (since for each permutation we exchanged the roles of database and target session)

          We included this additional control at the suggestion of an anonymous reviewer, who pointed out the following:
          “The authors suggest that the probability of obtaining 27 correct identifications by chance is smaller than 1e-29. I believe the authors assume a binomial distribution with 126 independent trials with probability of success to be 1/126. This doesn’t make sense because the trials are not independent (if the connectivity matrix of a subset of subjects look very similar, then it’s highly likely for many subjects within this subset to be mis- identified). I suggest the authors perform a permutation test (by permuting the subjects’ identities) to obtain an empirical distribution and p value.”

          Please let me know if anything is still unclear.

          • Thanks for your response Emily, I will try to update the post ASAP to clarify this. Sorry for any confusion caused on my part.

          • No worries at all — it strikes me now that presenting these rates as percents (when they were actually discrete numbers of trials) was perhaps more confusing than enlightening. Glad I was able to clarify on this forum! Planning to respond to the other points raised here shortly — grateful for the opportunity.

          • 🙂 always nice to see that post publication peer review can be a constructive force! For now I’ve added a footnote to the post linking to your
            comment to alleviate any confusion.

          • Emily, thanks for your thorough response clarifying this! This all makes complete sense to me (especially using only the triangular matrix).

            Also thanks for clarifying the chance issue. Your anonymous reviewer was wise to suggest permutation test I think. Not sure about independence of data but when unsure then it’s probably good to err on the side of caution. Either way, it makes sense that median of permutation tests is at chance level.

            Don’t worry about lack of detail – we can probably blame NN’s word limits for that?

  4. First, thanks, Micah, for your positive and thoughtful commentary on the paper! Really enjoyed reading this, and it’s lovely to see it generate so much good discussion. Grateful to bloggers like you for providing a forum for these types of post-publication interactions.

    Just to clarify something related to one of your (very minor) points, I don’t believe that the results of the BOLD smoothing kernel analyses and the comparison to the FreeSurfer/Yeo node atlas should be taken as contradictory. The former is attempting to control for registration step which could confer a preference between the same brain on two different days when applying the 268-node segmentation (for reasons that have nothing to do with function). The 4, 6, and 8mm smoothing kernels in theory are smoothing out some anatomic effects at node boundaries (partial volume, number of gray-matter voxels per node which could influence SNR, etc), but we are still using 268 nodes across the whole brain, as opposed to only 68 in the FS atlas. So these different types of “smoothing” are operating at substantially different scales. Even with increased local smoothing, using the higher-resolution atlas lets us detect more subtle connectivity differences (though it should be noted that identification based on the FS matrices was still quite accurate — and it would have been worrisome if it were not).

    And yes, the GSR debate still looms large…wouldn’t it be great if we could all just agree on one preprocessing strategy? (she asked non-ironically). We didn’t want to open that can of worms here, but I would certainly be interested to know how much preprocessing choices can affect identification accuracy. Although at some point I suppose it’s an empirical question, since if we ever get to the point where we are using this for a truly practical application, we’d want to just go with whatever works best…

    In any case, thanks again!

  5. great to see this really useful post and postpub exchange with authors and multiple commenters. (how about adding ratings? — which will be needed for full open evaluation.) i have a quibble with your title: the 93% is a meaningless number if you don’t also state the number of distractor classes (which is very impressive here). a p value would be more informative, though it does not drive the imagination of a broad audience to the same extent. the pairwise classification accuracy is a better thing to remember — for being more comparable among studies.

    people mentioned anatomy as a potential confound. one possibility is that the brains are actually functionally equivalent (at least at the level visible to fMRI), but the same set of functional regions is slightly differently located in each individual. as a consequence, functional correspondencies across datasets would be accurate within but not between subjects. it would be fun to randomise functional alignment locally within and between subjects to reduce this confound. did they already do something like this? how would we go about this?

    • Great comment, Niko :). You have a good point about the accuracy. The authors also report p-values from the permutation tests:

      ‘Thus the P value associated with obtaining at least 68 correct identifications (the minimum rate we achieved) is 0.’

      This is of course unsurprising considering the almost ceiling accuracy and very low chance level. But yes it would probably be a good idea to say ‘almost perfect classification’. Which is pretty much why I was wondering about what this could mean. In many situations, effect sizes of this magnitude are trivial and unsurprising results (say when your correlation is 1, the most likely explanation is that you correlated inches with centimeters or something similar…). That doesn’t seem to be the case here, at least I can’t work out any a priori reason why I would have expected that result. The simplistic simulation I made would suggest results like this for very weakly correlated connectivity matrices, but that doesn’t appear to match the empirical results. So this result highlights something we didn’t know before – of course that’s they whole point of science really.

      Regarding ratings, I know you love the idea of ratings. I personally dislike ratings (I hate doing them when I review as well). It’s odd, because clearly in this case we have all looked at the paper and so clicking a few extra buttons wouldn’t be too much work. But it feels like a drain to me… Anyway, this is probably a discussion better to be had elsewhere!

    • Hi Nikolaus, thanks for your comment. It’s true that idiosyncrasies in functional alignment could be contributing some, though it’s a bit difficult to pinpoint exactly how much. In the course of peer review we did do an analysis where we randomly perturbed the center of mass of each node and then drew a sphere of a certain radius around each new center of mass, then recalculated the matrices based on these random atlases. All together we created 300 new atlases (100 for each radius size: 6, 8 and 10mm; not guaranteed to cover the whole brain). Identification rates were essentially unchanged, which assured us that the results weren’t somehow specific to this particular parcellation. But this isn’t directly addressing your question — to do that I suppose we would need to use a slightly different atlas for each subject for each scan session, but ensure correspondence across atlases so that matrices could be meaningfully compared. Though I’m still not sure how exactly we’d interpret the results if we did see a drop in ID rate, since it’s hard to tease apart local functional organization (node boundaries) from more global organization (edge strengths)…

  6. Fantastic discussion here on a really stimulating paper. I hate to play the role of pedantic statistician, but here I go… As I told Emily in a direct communication, it is impossible to obtain an exact zero P-value. In parametric tests the P-value can numerically underflow to zero but only can be zero if the test statistic is infinity (that’s why it’s best practice to write e.g. “<0.001" instead of "0"). In permutation tests the smallest it can be is one over the number of permutations (1/nPerm) as nicely reviewed in this paper "Permutation P-values should never be zero…" http://www.ncbi.nlm.nih.gov/pubmed/21044043

    Another interesting thread is the connection between 'fingerprinting' and reliability measured with intraclass correlation (ICC). Under a simple Gaussian model for a *univariate* measurement, I think ICC tells you everything you need to know about how successful your prediction will be. If you decided the best match would be determined by the squared difference between two univariate measurements at time 1 and time 2, the expected squared difference for the same subject (i.e. the correct match) is

    (1-ICC) * sigma^2

    and the expected squared difference between time 1 and time 2 for two different subjects (i.e. a wrong match) is

    ICC * sigma^2

    where sigma^2 is the variance over subjects at one time point.

    Thus if ICC = 0.5 you're screwed, and will never be able to detect matches, while if ICC approaches 1 you should be able to reliably make the match. Of course what you'd really like is error rates as a function of ICC, but this requires the *distribution* of the difference of two squared differences, which even in the Gaussian univariate case looks nasty.

    Now of course Emily &co used correlation, but I think this still applies as there's a tight link between squared difference and correlation as a distance measure (E(SqDiff) = 2(1-Corr)sigma^2). The main leap is from from the univariate setting to a 35,778-vector, but my intuition is that the edgewise ICC averaged over edges probably tells you about the fingerprint accuracy limit.


    • That’s a good point about permutation p-values. I usually report them as p<1/nperm as you suggest but I wasn't sure if that was common. In fact, it probably would make sense to report the number of permutations alongside it? If you only did 10 permutations you may very well get a p=0 but you certainly shouldn't trust it. Reporting it as p<0.1 would still look misleading.

      Of course, in this case I would guess we can accept that the p-value is going to be tiny. Even if the test had 1,000,000 permutations you would presumably get a p of zero for 68 correct identifications.

      • P-values are defined as the probability of getting a value as or more extreme as that actually observed. As you should always be counting “that actually observed”, you *cannot* get P=0 if you’re computing P-values correctly.

        Here’s a worked example with just 20 (!) permutations: http://www.econ.uzh.ch/dam/jcr:ffffffff-b116-70a8-0000-0000722d019d/Nichols_SPM2012_2.pdf starting on slide 4, showing how 1/20 is the smallest possible P-value.

        • I know what a p-value is… But I don’t get what you’re saying. I assume you mean that you would count the actually observed parameter estimate, no matter what?

          • Actually I looked at your slides and I think I see what you’re saying now. If you do a full permutation then the proportion you get must be 1/nperm. I suppose the same must apply if you use bootstrapping. But doing a full set of all possible permutations is presumably rare. So you would say than even if your proportion (let’s not call it a p-value :P) is 0 you should nevertheless call it 1/nperm?

          • Right, the non-exhaustive case is exactly the point of that “Permutation P-values should never be zero…” paper: When you do a random subset of permutations, you need to always include the actual (unpermuted) test statistic in the permutation distribution, which ensures the minimum P-value is 1/nPerm. It turns out that if you don’t do that, you don’t get a valid test (i.e. the computed P-value will be less than 0.05 more than 5% of the time).

    • Hi Tom, thanks for your kind comments, here and over email. Of course, you’re right that the correct way to report the p-value from a permutation test like this one would have been 1/nPerm. Sorry for the oversight — noted for next time.

      I’d be very interested to see an edgewise ICC analysis between scan sessions in the HCP data. One can certainly also think of doing identification based on some other metric rather than correlation (including multivariate ones). Wouldn’t be surprised if someone can get this up to a 100% accuracy rate, at least over these fairly short inter-scan intervals of a day or two. Though my hunch is there’s a lot more information to be gleaned about time-varying connectivity from the errors of identification… co-authors and I hoping this spurs some interesting new analyses.

  7. Ah thanks that is very interesting! Is that also true for bootstrapping? I need to try that with my code… 🙂

    • That’s right. Though most papers/books will cast it as P = (c+1)/(B+1), where c is the number of bootstrap samples that exceed (or equal) the orginal test statistic, where the “+1″s implement the idea of including the original test statistic in the bootstrap distribution.

      Note, too, that the intuitive Bootstrap that is usually used (I.e. Resample with replacement, refit your model) is intended only for use in computing SE’s and CI’s. Bootstrap P-values, on the other hand, require a restricted boostrap procedure that respects the null hypothesis. It’s hard to expalin concisely and so I’d just refer you to your favourite text on the bootstrap.

      • Thanks, yes that makes sense of course. Most bootstrapping is actually on the effect size to give CIs. But in my pet procedure we are actually doing both, bootstrap the alternative and the null hypothesis. Ged Ridgeway pulled out some paper making the case one should only bootstrap the null – I guess because in that case it was about calculating p-values. I am not calculating p-values though so I don’t know if this makes any difference. I will check! Anyway, this is very off-topic so will end this here. Would be interested to talk more at some point though…

  8. Emily your correct p-value should be 0.000099990001, so by p-star convention you should only have 8 little stars next to your p-value.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.