User:Jadrian Miles/Streamline clustering

From VrlWiki
Jump to navigation Jump to search

Plan

Tubegen generates an easy-to-parse .nocr file specifying points on streamlines.

  1. Pick a good dataset (Diffusion_MRI#Collaboration_Table) -- $G/data/diffusion/brown3t/cohen_hiv_study_registered.2007.02.07/patient120
  2. Run tubegen on it with modified parameters so it doesn't cull anything---this will result in ~100k curves, with an average of ~70 points per curve.
  3. Write a python script to divide the computation of the curve-to-curve distance matrix among many computers.
    • Try max and mean minimum point-to-curve distance in overlapping region as inter-curve distance measure. Or exponentially weighted mean a la cad.
      • See also cad's /map/gfx0/tools/linux/src/embed/utils/fast_distance_computing/src/ICurveDist/test
    • The per-curve script should return the assigned matrix line as well as a list of curves sorted by distance and annotated with the distance, for fast clustering.
    • After computing the upper half of the matrix, create an ordered list of curve-to-curve distances annotated with the curve pairs. Distributed w:quicksort? [1]
  4. Build up clusters until some termination condition: satisfactory number of non-singleton clusters, satisfactory median size of non-singleton clusters, etc. Or just run until you get one huge cluster, but store the binary cluster tree. It may be really skewed but maybe a tree rebalancing algorithm could help in post-processing.
    • Initialization: each curve is a singleton cluster.
    • A curve's distance to a cluster is the minimum distance to any curve in that cluster.
    • In each iteration, with lowest minimum curve-to-curve distance to its closest cluster.

Code

Files

  • tube_12.nocr (757MB) --- the full list of all the curves generated by tubegen.
  • matfiles --- a directory full of Matlab workspace files with random subsets of the full set of curves from tube_12.nocr stored as the variable lines.
  • hostlist --- a list of hosts I can run remote jobs on.

Distance Measurements

  • Core functions
    • dcc.m computes a single, symmetric distance between two curves, using pdcc.m.
    • pdcc.m computes a set of point-to-point distances between two curves, with a few customizable options.
    • dpc.m computes the distance from a single point to a curve.
  • Helpers
    • followCurve.m gives the point at a specified fractional index along a curve.
    • distOnCurve.m gives the distance along a curve between two fractional indices.
  • Graphics
    • drawpdcc.m plots two curves and the point-to-point matches found on them by pdcc.m.
    • drawpdccset.m plots all four variations of the asymmetric distance for two curves.

Distributed Distance Matrix

Here "priority queue" actually means a list of distances and corresponding curve indices that is sorted by increasing distance.

  • distanceMatrixRows.m computes several lines of the matrix and returns the corresponding priority queue
  • mergeSorted.m merges two priority queues together

Notes

  • There are 322885 curves, with an average of 65.6 vertices on them. 322885^2 * 4bytes = 388.3GB.
  • We only need to store the upper half of the distance matrix. A manageable subset of curves is 30000 (9.3%). (30000^2)/2 * 4bytes = 1.7GB.
  • Brain is ~100mm wide (X=LR), ~145mm long (Y=AP), ~100mm high (Z=IS).
  • No two curves should be closer than 0.5, and anything more than 50 away is basically infinitely far away (on the other side of the brain). These are loose bounds on our clustering threshold. Near-infinite distances also don't need to be stored in the priority queue.
  • About 35% of the distances computed are infinite.

References

  • Corouge, et al. Towards a shape model of white matter fiber bundles using diffusion tensor MRI.
  • Moberts, et al. Evaluation of fiber clustering methods for diffusion tensor imaging.