User:Jadrian Miles/Streamline clustering: Difference between revisions

From VrlWiki
Jump to navigation Jump to search
Jadrian Miles (talk | contribs)
Jadrian Miles (talk | contribs)
Line 16: Line 16:


=== Files ===
=== Files ===
* <tt>tube_12.nocr</tt> (757MB) --- the full list of all the curves generated by <tt>tubegen</tt>.
* <tt>tube_12.nocr</tt> (757MB) --- the full list of all the curves generated by <tt>tubegen</tt>.  Fwiw, there are 322885 curves, with an average of 65.6 vertices on them.
* <tt>lines.mat</tt> (261MB) --- a Matlab workspace file with the variable <tt>lines</tt> in it, which is just a parsing of <tt>tube_12.nocr</tt>.
* <tt>lines.mat</tt> (261MB) --- a Matlab workspace file with the variable <tt>lines</tt> in it, which is just a parsing of <tt>tube_12.nocr</tt>.
* <tt>hostlist</tt> --- a list of hosts I can run remote jobs on.
* <tt>hostlist</tt> --- a list of hosts I can run remote jobs on.

Revision as of 23:18, 26 March 2009

Tubegen generates an easy-to-parse .nocr file specifying points on streamlines.

  1. Pick a good dataset (Diffusion_MRI#Collaboration_Table) -- $G/data/diffusion/brown3t/cohen_hiv_study_registered.2007.02.07/patient120
  2. Run tubegen on it with modified parameters so it doesn't cull anything---this will result in ~100k curves, with an average of ~70 points per curve.
  3. Write a python script to divide the computation of the curve-to-curve distance matrix among many computers.
    • Try max and mean minimum point-to-curve distance in overlapping region as inter-curve distance measure. Or exponentially weighted mean a la cad.
      • See also cad's /map/gfx0/tools/linux/src/embed/utils/fast_distance_computing/src/ICurveDist/test
    • The per-curve script should return the assigned matrix line as well as a list of curves sorted by distance and annotated with the distance, for fast clustering.
    • After computing the upper half of the matrix, create an ordered list of curve-to-curve distances annotated with the curve pairs. Distributed w:quicksort? [1]
  4. Build up clusters until some termination condition: satisfactory number of non-singleton clusters, satisfactory median size of non-singleton clusters, etc. Or just run until you get one huge cluster, but store the binary cluster tree. It may be really skewed but maybe a tree rebalancing algorithm could help in post-processing.
    • Initialization: each curve is a singleton cluster.
    • A curve's distance to a cluster is the minimum distance to any curve in that cluster.
    • In each iteration, with lowest minimum curve-to-curve distance to its closest cluster.

Code

Files

  • tube_12.nocr (757MB) --- the full list of all the curves generated by tubegen. Fwiw, there are 322885 curves, with an average of 65.6 vertices on them.
  • lines.mat (261MB) --- a Matlab workspace file with the variable lines in it, which is just a parsing of tube_12.nocr.
  • hostlist --- a list of hosts I can run remote jobs on.

If we stored the full distance matrix as single-precision floats, it would be about 400GB. Obviously something else needs to happen.

Distance Measurements

  • Core functions
    • dcc.m computes a single, symmetric distance between two curves, using pdcc.m.
    • pdcc.m computes a set of point-to-point distances between two curves, with a few customizable options.
    • dpc.m computes the distance from a single point to a curve.
  • Helpers
    • followCurve.m gives the point at a specified fractional index along a curve.
    • distOnCurve.m gives the distance along a curve between two fractional indices.
  • Graphics
    • drawpdcc.m plots two curves and the point-to-point matches found on them by pdcc.m.
    • drawpdccset.m plots all four variations of the asymmetric distance for two curves.