SUBDUE version 3.3 released 3/1/94

Contributors:

  Diane J. Cook,      University of Texas at Arlington
  Lawrence B. Holder, University of Texas at Arlington
  Surjani Djoko,      University of Texas at Arlington
  Tom Lai,            University of Texas at Arlington

Brief Description:

  The SUBDUE substructure discovery system searches for repetitive subgraphs
  in the input graph.  The search is guided by a minimum description length
  heuristic combined with various background knowledge rules for measuring
  specific properties of a substructure.  SUBDUE uses a computationally-
  bounded, branch-and-bound, inexact graph match to compare two graphs.
  SUBDUE can also be run iteratively, each time considering the graph
  constructed by compressing the graph from the previous iteration with the
  best substructure found during the previous iteration.  Most dimensions of
  SUBDUE can be controlled with command-line arguments described below.

  For more details, see the JAIR article by Cook and Holder entitled
  "Substructure Discovery Using Minimum Description Length and Background
  Knowledge".  The source code is fairly well documented, so the curious
  and courageous user may want to look there.

Installation:

  Just type 'make' or 'make subdue'.  A sample graph is contained in the
  file sample.g, and the output produced by the call 'subdue -nsubs 1
  -iterations 2 sample.g' is in the file sample.out.  The sample.g file
  is actually the rubber example from Figure 7 of the JAIR paper.

Command-Line Arguments:

   subdue [-limit #] [-iterations #] [-threshold #] [-trace #]
          [-n nsubs] [-noprune] [-beam #] [-overlap] [-undirect]
          [-con connectivity] [-com compactness] [-cov coverage]
          graphfile

   [-limit #] The number of different substructures considered during
          the discovery process.  Default is zero.  If this default
          value is used, the limit will actually be set to half the
          number of edges in the input graph.

   [-iterations #] The number of iterations made over the input graph in
          which the best substructure from the previous iteration is used
          to compress the graph for use in the next iteration.  Default
          is one, which implies only one pass, no compression.

   [-threshold #] The fraction of the size of an instance by which the
          instance can be different (according to the distortion
          costs) from the substructure definition.  I.e., the graphs
          match if matchcost(sub,inst) <= size(inst)*threshold.
          Default is 0.0, which implies graphs must match exactly.

   [-trace #] Sets the level of output produced by SUBDUE.  Zero is the
          default, which produces minimal output; namely, the best nsubs
          discovered substructures.  Trace=1 adds all substructures as
          they are considered by the discovery procedure.  Trace=2 adds
          the instances to printed substructures.  Trace=3 adds true
          tracing of some procedures (used mostly for debugging).

   [-nsubs #] Retains no more than the top # substructures, which are
          returned.  Only the best substructure is used for replacement.
          Default is 3.  The value must be greater than zero.

   [-noprune] SUBDUE will discard substructures with less value than their
          parent substructure.  Specifying this argument will prevent this.

   [-beam #] Limits the number of substructures waiting for consideration
          during the discovery process.  Default is 4.

   [-overlap] SUBDUE normally will not allow overlap among the instances
          of a substructure.  Specifying this argument will allow overlap.
          Note that when allowing overlap, compressing the instances of
          a substructure requires extra information indicating where the
          overlap occurs.  This tends to hinder compression due to the
          additional bits needed to describe the overlap.  Also, if overlap
          is allowed, substructures will have more instances, which tends
          to inflate their heuristic value.

   [-undirect] SUBDUE assumes that edges in the input graph file defined
          using 'e' (and edges created to describe overlapping graphs) are
          directed edges.  Specifying this argument makes these edges
          undirected.  Note that graph file edges defined with 'u' are
          always undirected, and edges defined with 'd' are always directed.

   [-con connectivity] The exponent to the connectivity rule value of a
          substructure and its instances.  The value measures the average
          number of connections between the instances and the rest of the
          input graph.  Default is 0.0, which implies this rule has no
          effect on the evaluation of a substructure.  A positive value
          favors substructures with low connectivity, and a negative
          value favors substructures with high connectivity.

   [-com compactness] The exponent to the compactness rule value of a
          substructure and its instances.  The value measures the average
          compactness of the instances.  Default is 0.0, which implies this
          rule has no effect on the evaluation of a substructure.  A positive
          value favors compact (closed) substructures, and a negative value
          favors less dense substructures.

   [-cov coverage] The exponent to the coverage rule value of a substructure
          and its instances.  The value measures the fraction of the input
          graph covered by the instances.  Default is 0.0, which implies this
          rule has no effect on the evaluation of a substructure.  A positive
          value favors substructures with high coverage, and a negative value
          favors substructures with low coverage.

Graph File Format:

  Each line of a graph file defines either a vertex or an edge.  Vertices
  must be defined before they are used in edges.  Vertex lines have the
  form

      v <#> <label>

  where <#> is any unique positive integer, and <label> is any string
  (limit 79 characters).  Edge lines have one of three forms:

     e <label> <v1> <v2>
     d <label> <v1> <v2>
     u <label> <v1> <v2>

  where <v1> and <v2> are the <#>'s for the corresponding vertices, and
  <label> is any string (limit 79 characters).  Edges beginning with 'e'
  are assumed directed unless the [-undirect] option is given, in which
  case all 'e' edges become undirected.  Edges beginning with 'd' are
  always directed, and edges beginning with 'u' are always undirected.
  Anything following a % character on a line is ignored.

  A sample graph (the rubber example from Figure 7 of the JAIR paper) is
  contained in the file sample.g.

Bound on Graph Match:

  The number of alternative mappings considered by SUBDUE's graph match is
  bounded to avoid the possibly exponential number of alternative mappings.
  Currently, this bound is set to n^4, which seems to work pretty well for
  the numerous graphs we have considered.  However, if you would like to
  change this bound, you need to modify the max_node(n) function in the file
  fm.c and re-make subdue.

To Dos:

  1. SUBDUE only considers instances of a substructure that are the same
     size are smaller than the substructure definition.  Ideally, SUBDUE
     should also consider instances larger than the substructure
     definition, but this makes the process of determining instances
     underconstrained.

  2. When SUBDUE uses a substructure to compress a graph, the instances
     of the graph are replaced by single vertices in the compressed graph.
     However, SUBDUE does not record the possible differences (allowed when 
     threshold > 0.0) between the instances and the substructure
     definition.  An option should be added so that these differences are
     recorded during compression.  This added information should also
     be included in the graph encoding.

  3. Currently, the MDL code uses NumLabels as the number of unique labels
     in the original graph.  However, NumLabels changes when compression
     adds new overlap labels.  This would inflate the value used by MDL.
     Ideally, MDL should use the inflated value of NumLabels after
     compression as long as the labels are removed (and NumLabels
     decreased) after MDL is done.

  4. SUBDUE assumes that both sender and receiver have the original list of
     labels, so now each label is specified with (lg u) bits, where u is
     the number of unique labels in the original graph.  This gets a little
     fuzzy when new edges are added to describe overlap, but some method of
     adding more bits for these new labels should be possible.

Questions, Comments, Bug Reports:

  Please send all enquiries about SUBDUE to either Diane Cook
  (cook@cse.uta.edu) or Larry Holder (holder@cse.uta.edu).  We welcome your
  comments.

Disclaimer:

     THIS SOURCE CODE IS SUPPLIED  ``AS IS'' WITHOUT WARRANTY OF ANY KIND, 
     AND ITS AUTHOR AND THE JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH 
     (JAIR) AND JAIR'S PUBLISHERS AND DISTRIBUTORS, DISCLAIM ANY AND ALL 
     WARRANTIES, INCLUDING BUT NOT LIMITED TO ANY IMPLIED
     WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, AND
     ANY WARRANTIES OR NON INFRINGEMENT.  THE USER ASSUMES ALL LIABILITY AND
     RESPONSIBILITY FOR USE OF THIS SOURCE CODE, AND NEITHER THE AUTHOR NOR
     JAIR, NOR JAIR'S PUBLISHERS AND DISTRIBUTORS, WILL BE LIABLE FOR 
     DAMAGES OF ANY KIND RESULTING FROM ITS USE.  Without limiting the 
     generality of the foregoing, neither the author, nor JAIR, nor JAIR's
     publishers and distributors, warrant that the Source Code will be 
     error-free, will operate without interruption, or will meet the needs 
     of the user.
