User:Jadrian Miles/Streamline clustering

Plan

Tubegen generates an easy-to-parse .nocr file specifying points on streamlines.

Pick a good dataset (Diffusion_MRI#Collaboration_Table) -- $G/data/diffusion/brown3t/cohen_hiv_study_registered.2007.02.07/patient120
Run tubegen on it with modified parameters so it doesn't cull anything---this will result in ~100k curves, with an average of ~70 points per curve.
Write a python script to divide the computation of the curve-to-curve distance matrix among many computers.
- Try max and mean minimum point-to-curve distance in overlapping region as inter-curve distance measure. Or exponentially weighted mean a la cad.
  - See also cad's /map/gfx0/tools/linux/src/embed/utils/fast_distance_computing/src/ICurveDist/test
- The per-curve script should return the assigned matrix line as well as a list of curves sorted by distance and annotated with the distance, for fast clustering.
- After computing the upper half of the matrix, create an ordered list of curve-to-curve distances annotated with the curve pairs. Distributed w:quicksort? [1]
Build up clusters until some termination condition: satisfactory number of non-singleton clusters, satisfactory median size of non-singleton clusters, etc. Or just run until you get one huge cluster, but store the binary cluster tree. It may be really skewed but maybe a tree rebalancing algorithm could help in post-processing.
- Initialization: each curve is a singleton cluster.
- A curve's distance to a cluster is the minimum distance to any curve in that cluster.
- In each iteration, with lowest minimum curve-to-curve distance to its closest cluster.

Code

Files

tube_12.nocr (757MB) --- the full list of all the curves generated by tubegen.
matfiles --- a directory full of Matlab workspace files with random subsets of the full set of curves from tube_12.nocr stored as the variable lines.
hostlist --- a list of hosts I can run remote jobs on.

Distance Measurements

Core functions
- dcc.m computes a single, symmetric distance between two curves, using pdcc.m.
- pdcc.m computes a set of point-to-point distances between two curves, with a few customizable options.
- dpc.m computes the distance from a single point to a curve.
Helpers
- followCurve.m gives the point at a specified fractional index along a curve.
- distOnCurve.m gives the distance along a curve between two fractional indices.
Graphics
- drawpdcc.m plots two curves and the point-to-point matches found on them by pdcc.m.
- drawpdccset.m plots all four variations of the asymmetric distance for two curves.

Distributed Distance Matrix

Here "priority queue" actually means a list of distances and corresponding curve indices that is sorted by increasing distance.

distanceMatrixRows.m computes several lines of the matrix and returns the corresponding priority queue
mergeSorted.m merges two priority queues together

Notes

There are 322885 curves, with an average of 65.6 vertices on them. 322885^2 * 4bytes = 388.3GB.
We only need to store the upper half of the distance matrix. A manageable subset of curves is 30000 (9.3%). (30000^2)/2 * 4bytes = 1.7GB.
Brain is ~100mm wide (X=LR), ~145mm long (Y=AP), ~100mm high (Z=IS).
No two curves should be closer than 0.5, and anything more than 50 away is basically infinitely far away (on the other side of the brain). These are loose bounds on our clustering threshold. Near-infinite distances also don't need to be stored in the priority queue.
About 35% of the distances computed are infinite.

References

Corouge, et al. Towards a shape model of white matter fiber bundles using diffusion tensor MRI.
Moberts, et al. Evaluation of fiber clustering methods for diffusion tensor imaging.

User:Jadrian Miles/Streamline clustering

Contents

Plan

Code

Files

Distance Measurements

Distributed Distance Matrix

Notes

References

Navigation menu

User:Jadrian Miles/Streamline clustering

Plan

Code

Files

Distance Measurements

Distributed Distance Matrix

Notes

References

Navigation menu

Search