Title: | Size-Constrained Clustering |
---|---|
Description: | Provides wrappers for 'scclust', a C library for computationally efficient size-constrained clustering with near-optimal performance. See <https://github.com/fsavje/scclust> for more information. |
Authors: | Fredrik Savje [aut, cre], Michael Higgins [aut], Jasjeet Sekhon [aut] |
Maintainer: | Fredrik Savje <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.2.4.9000 |
Built: | 2024-10-25 06:03:03 UTC |
Source: | https://github.com/fsavje/scclust-r |
The scclust
package is an R wrapper for the scclust
library.
The package provides functions to construct near-optimal size-constrained
clusterings. Subject to user-specified constraints on the size and composition
of the clusters, scclust
constructs a clustering so that within-cluster
pair-wise distances are minimized.
The main clustering function is sc_clustering
. Statistics about
clusters can be derived with the get_clustering_stats
function. To check if a clustering satisfies some set of
constraints, use check_clustering
. Use scclust
to
construct a scclust
object from an existing clustering.
Clusters can also be constructed with hierarchical_clustering
.
However, this function does not support type constraints and does not provide
optimality guarantees. Its main use is to refine clusterings constructed with
the sc_clustering
function.
scclust
was made with large data sets in mind, and it can cluster tens
of millions of data points within minutes on an ordinary desktop computer.
See the package's website for more information: https://github.com/fsavje/scclust-R.
More information about the scclust
library is found here:
https://github.com/fsavje/scclust.
Bug reports and suggestions are greatly appreciated. They are best reported here: https://github.com/fsavje/scclust-R/issues.
Higgins, Michael J., Fredrik Sävje and Jasjeet S. Sekhon (2016), ‘Improving massive experiments with threshold blocking’, Proceedings of the National Academy of Sciences, 113:27, 7369–7376.
Sävje, Fredrik and Michael J. Higgins and Jasjeet S. Sekhon (2017), ‘Generalized Full Matching’, arXiv 1703.03882. https://arxiv.org/abs/1703.03882
check_clustering
checks whether a clustering satisfies constraints on
the size and composition of the clusters.
check_clustering( clustering, size_constraint = NULL, type_labels = NULL, type_constraints = NULL, primary_data_points = NULL )
check_clustering( clustering, size_constraint = NULL, type_labels = NULL, type_constraints = NULL, primary_data_points = NULL )
clustering |
a |
size_constraint |
an integer with the required minimum cluster size. If |
type_labels |
a vector containing the type of each data point. May be |
type_constraints |
a named integer vector containing type-specific size constraints. If
|
primary_data_points |
a vector specifying primary data points, either by point indices or with
a logical vector of length equal to the number of points.
|
Returns TRUE
if clustering
satisfies the constraints, and
FALSE
if it does not. Throws an error if clustering
is an
invalid instance of the scclust
class.
See sc_clustering
for details on how to specify the
type_labels
and type_constraints
parameters.
# Example scclust clustering my_scclust <- scclust(c("A", "A", "B", "C", "B", "C", "C", "A", "B", "B")) # Check so each cluster contains at least two data points check_clustering(my_scclust, 2) # > TRUE # Check so each cluster contains at least four data points check_clustering(my_scclust, 4) # > FALSE # Data point types my_types <- factor(c("x", "y", "y", "z", "z", "x", "y", "z", "x", "x")) # Check so each cluster contains at least one point of each type check_clustering(my_scclust, NULL, my_types, c("x" = 1, "y" = 1, "z" = 1)) # > TRUE # Check so each cluster contains one data point of both "x" and "z" # and at least three points in total check_clustering(my_scclust, 3, my_types, c("x" = 1, "z" = 1)) # > TRUE # Check so each cluster contains five data points of type "y" check_clustering(my_scclust, NULL, my_types, c("y" = 5)) # > FALSE
# Example scclust clustering my_scclust <- scclust(c("A", "A", "B", "C", "B", "C", "C", "A", "B", "B")) # Check so each cluster contains at least two data points check_clustering(my_scclust, 2) # > TRUE # Check so each cluster contains at least four data points check_clustering(my_scclust, 4) # > FALSE # Data point types my_types <- factor(c("x", "y", "y", "z", "z", "x", "y", "z", "x", "x")) # Check so each cluster contains at least one point of each type check_clustering(my_scclust, NULL, my_types, c("x" = 1, "y" = 1, "z" = 1)) # > TRUE # Check so each cluster contains one data point of both "x" and "z" # and at least three points in total check_clustering(my_scclust, 3, my_types, c("x" = 1, "z" = 1)) # > TRUE # Check so each cluster contains five data points of type "y" check_clustering(my_scclust, NULL, my_types, c("y" = 5)) # > FALSE
cluster_count
returns the number of clusters in a clustering.
cluster_count(clustering)
cluster_count(clustering)
clustering |
a |
Returns an integer with the number of clusters in clustering
.
# Example scclust clustering my_scclust <- scclust(c("A", "A", "B", "C", "B", "C", "C", "A", "B", "B")) cluster_count(my_scclust) # > 3
# Example scclust clustering my_scclust <- scclust(c("A", "A", "B", "C", "B", "C", "C", "A", "B", "B")) cluster_count(my_scclust) # > 3
get_clustering_stats
calculates statistics of a clustering.
get_clustering_stats(distances, clustering)
get_clustering_stats(distances, clustering)
distances |
a |
clustering |
a |
The function reports the following measures:
num_data_points |
total number of data points |
num_assigned |
number of points assigned to a cluster |
num_clusters |
number of clusters |
min_cluster_size |
size of the smallest cluster |
max_cluster_size |
size of the largest cluster |
avg_cluster_size |
average cluster size |
sum_dists |
sum of all within-cluster distances |
min_dist |
smallest within-cluster distance |
max_dist |
largest within-cluster distance |
avg_min_dist |
average of the clusters' smallest distances |
avg_max_dist |
average of the clusters' largest distances |
avg_dist_weighted |
average of the clusters' average distances weighed by cluster size |
avg_dist_unweighted |
average of the clusters' average distances (unweighed) |
Let denote the distance between data points
and
. Let
be a cluster containing the indices of points
assigned to the cluster. Let
be a function returning all within-cluster distances in . Let
be a set containing all clusters.
sum_dists
is defined as:
min_dist
is defined as:
max_dist
is defined as:
avg_min_dist
is defined as:
avg_max_dist
is defined as:
Let:
be the average within-cluster distance in cluster .
avg_dist_weighted
is defined as:
where is the number of assigned data
points (see above).
avg_dist_unweighted
is defined as:
Returns a list of class clustering_stats
containing the statistics.
my_data_points <- data.frame(x = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0), y = c(10, 9, 8, 7, 6, 10, 9, 8, 7, 6)) my_distances <- distances(my_data_points) my_scclust <- scclust(c("A", "A", "B", "C", "B", "C", "C", "A", "B", "B")) get_clustering_stats(my_distances, my_scclust) # > Value # > num_data_points 10.0000000 # > num_assigned 10.0000000 # > num_clusters 3.0000000 # > min_cluster_size 3.0000000 # > max_cluster_size 4.0000000 # > avg_cluster_size 3.3333333 # > sum_dists 18.2013097 # > min_dist 0.5000000 # > max_dist 3.0066593 # > avg_min_dist 0.8366584 # > avg_max_dist 2.4148611 # > avg_dist_weighted 1.5575594 # > avg_dist_unweighted 1.5847484
my_data_points <- data.frame(x = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0), y = c(10, 9, 8, 7, 6, 10, 9, 8, 7, 6)) my_distances <- distances(my_data_points) my_scclust <- scclust(c("A", "A", "B", "C", "B", "C", "C", "A", "B", "B")) get_clustering_stats(my_distances, my_scclust) # > Value # > num_data_points 10.0000000 # > num_assigned 10.0000000 # > num_clusters 3.0000000 # > min_cluster_size 3.0000000 # > max_cluster_size 4.0000000 # > avg_cluster_size 3.3333333 # > sum_dists 18.2013097 # > min_dist 0.5000000 # > max_dist 3.0066593 # > avg_min_dist 0.8366584 # > avg_max_dist 2.4148611 # > avg_dist_weighted 1.5575594 # > avg_dist_unweighted 1.5847484
hierarchical_clustering
serves two purposes. Its primary use is to
refine existing clusters subject to clustering constraints. That is, given
a (non-optimal) clustering satisfying some constraints, the function splits
clusters so to decrease within-cluster distances without violating the
constraints. The function can also be used to derive size-constrained
clusterings from scratch. In both cases, it uses a hierarchical clustering
algorithm.
hierarchical_clustering( distances, size_constraint, batch_assign = TRUE, existing_clustering = NULL )
hierarchical_clustering( distances, size_constraint, batch_assign = TRUE, existing_clustering = NULL )
distances |
a |
size_constraint |
an integer with the required minimum cluster size. |
batch_assign |
a logical indicating whether data points should be assigned in batches when spliting clusters (see below for details). |
existing_clustering |
a |
While hierarchical_clustering
can be used to derive size-constrained
clusters from scratch, its main purpose is to be used together with
sc_clustering
. The clusters produced by the main clustering
function are guaranteed to be close to optimal (in particular, within a
constant factor of the optimal solution). However, it is occasionally
possible to refine the clustering. In particular,
sc_clustering
tends produce large clusters in regions of the
metric space with many data points. In some cases, it is beneficial to divide
these clusters into smaller groups. hierarchical_clustering
splits
these larger clusters so that all within-cluster distances weakly decrease
while respecting the overall the size constraint.
hierarchical_clustering
implements a divisive hierarchical clustering
algorithm that respect size constraints. Starting from any clustering
satisfying the size constraints (which may be a single cluster containing
all data points), the function searches for clusters that can be broken into
two or more new clusters without violating the constraints. When such a
cluster is found, it breaks the cluster into two new clusters. It continues
in this fashion until all remaining clusters are unbreakable.
Breakable clusters are broken in three stages. First, it tries to find two data points as far as possible from each other. The two points are called centers, and they are the starting points for the new clusters. The remaining data points in the old cluster will be assigned to one of the centers. In the second stage, each center picks the closest data points so that the new two clusters exactly satisfy the size constraint. In the last stage, data points that still are in the old cluster are assigned to the cluster containing the closest center. The final stage is done either for each point one-by-one or in batches (see below). When all data points are assigned to a cluster, the old cluster is removed and the two new are added to the clustering. If one or both of the new clusters are breakable, they will go through the same procedure again.
In some applications, it is desirable to avoid clusters that contain a number
of data points that are not multiples of size_constraint
. After the
second stage described in the previous paragraph, both partial clusters are
exact multiples of the size constraint. By assigning remaining data points
in the third stage in batches of size_constraint
, the function ensures,
to the greatest extent possible, that the size of the final clusters are
multiples of the size constraint.
Returns a scclust
object with the derived clustering.
sc_clustering
is the main clustering function in the
package.
# Make example data my_data <- data.frame(x1 = rnorm(10000), x2 = rnorm(10000), x3 = rnorm(10000)) # Construct distance metric my_dist <- distances(my_data) # Make clustering with `sc_clustering` my_clustering <- sc_clustering(my_dist, 3) # Refine clustering with `hierarchical_clustering` my_refined_clustering <- hierarchical_clustering(my_dist, size_constraint = 3, existing_clustering = my_clustering) # Make clustering from scratch with `hierarchical_clustering` my_other_clustering <- hierarchical_clustering(my_dist, 3)
# Make example data my_data <- data.frame(x1 = rnorm(10000), x2 = rnorm(10000), x3 = rnorm(10000)) # Construct distance metric my_dist <- distances(my_data) # Make clustering with `sc_clustering` my_clustering <- sc_clustering(my_dist, 3) # Refine clustering with `hierarchical_clustering` my_refined_clustering <- hierarchical_clustering(my_dist, size_constraint = 3, existing_clustering = my_clustering) # Make clustering from scratch with `hierarchical_clustering` my_other_clustering <- hierarchical_clustering(my_dist, 3)
is.scclust
checks whether the provided object is a valid instance of
the scclust
class.
is.scclust(x)
is.scclust(x)
x |
object to check. |
is.scclust
does not check whether the clustering itself is sensible
or whether the clustering satisfies some set of constraints. See
check_clustering
for that functionality.
Returns TRUE
if x
is a valid scclust
object,
otherwise FALSE
.
sc_clustering
constructs near-optimal size-constrained clusterings.
Subject to user-specified constraints on the size and composition of the
clusters, a clustering is constructed so that within-cluster pair-wise
distances are minimized. The function does not restrict the number of
clusters.
sc_clustering( distances, size_constraint = NULL, type_labels = NULL, type_constraints = NULL, seed_method = "inwards_updating", primary_data_points = NULL, primary_unassigned_method = "closest_seed", secondary_unassigned_method = "ignore", seed_radius = NULL, primary_radius = "seed_radius", secondary_radius = "estimated_radius", batch_size = 100L )
sc_clustering( distances, size_constraint = NULL, type_labels = NULL, type_constraints = NULL, seed_method = "inwards_updating", primary_data_points = NULL, primary_unassigned_method = "closest_seed", secondary_unassigned_method = "ignore", seed_radius = NULL, primary_radius = "seed_radius", secondary_radius = "estimated_radius", batch_size = 100L )
distances |
a |
size_constraint |
an integer with the required minimum cluster size. |
type_labels |
a vector containing the type of each data point. May be |
type_constraints |
a named integer vector containing type-specific size constraints. If
|
seed_method |
a character scalar indicating how seeds should be selected. |
primary_data_points |
a vector specifying primary data points, either with point indices or with
a logical vector of length equal to the number of points. |
primary_unassigned_method |
a character scalar indicating how unassigned (primary) points should be assigned to clusters. |
secondary_unassigned_method |
a character scalar indicating how unassigned secondary points should be assigned to clusters. |
seed_radius |
a positive numeric scalar restricting the maximum length of an edge in the
graph used to construct the clustering. |
primary_radius |
a positive numeric scalar, a character scalar or NULL restricting the match
distance for unassigned primary points. If numeric, the value is used to
restrict the distances. If character, it must be one of "no_radius",
"seed_radius" or "estimated_radius" (see below for details). |
secondary_radius |
a positive numeric scalar, a character scalar or NULL restricting the match
distance for unassigned secondary points. If numeric, the value is used to
restrict the distances. If character, it must be one of "no_radius",
"seed_radius" or "estimated_radius" (see below for details). |
batch_size |
an integer scalar specifying batch size when |
sc_clustering
constructs a clustering so to minimize within-cluster
dissimilarities while ensuring that constraints on the size and composition
of the clusters are satisfied. It is possible to impose an overall size
constraint so that each cluster must contain at least a certain number of
points in total. It is also possible to impose constraints on the composition
of the clusters so that each cluster must contain a certain number of points
of different types. For example, in a sample with "red" and "blue" data
points, one can constrain the clustering so that each cluster must contain
at least 10 points in total of which at least 3 must be "red" and at least
2 must be "blue".
The function implements an algorithm that first summaries the distances between data points in a sparse graph and then constructs the clustering based on the graph. This admits fast execution while ensuring near-optimal performance. In particular, the maximum within-cluster distance is garantueed to be at most four times the maximum distance in the optimal clustering. The average performance is much closer to optimal than the worst case bound.
In more detail, the clustering algorithm has four steps:
Construct sparse graph encoding clustering constraints.
Select a set of vertices in the graph to act as "seeds".
Construct initial clusters from the seeds' neighborhoods in the graph.
Assign remaining data points to clusters.
Each data point is represented by a vertex in the sparse graph, and arcs are
weighted by the distance between the data points they connect. The graph is
constructed so that each vertex's neighborhood satisfies the clustering
constraints supplied by the user. For example, if the clusters must contain
at least five points, each closed neighborhood in the graph will contain five
vertices. sc_clustering
constructs the graph that minimize the arc
weights subject to the neighborhood constraints; this ensures near-optimal
performance. The function selects "seeds" that have non-overlapping
neighborhood in the graph. By constructing clusters as supersets of the
neighborhoods, it ensures that the clustering will satisfy the constraints
imposed by the user.
The seed_method
option governs how the seeds are chosen. Any set of
seeds yields a near-optimal clustering, but, heuristically, performance can
be improved by picking the seeds more carefully. In most cases, smaller
clusters are desirable since they tend to minimize within-cluster distances.
As the number of data points is fixed, we minimize the cluster size by
maximizing the number of clusters (and, thus, the number of seeds). When
seed_method
is set to "lexical", seeds are chosen in lexical order.
This admits a fast solution, but the function might pick points that are
central in the graph. Central points tend to exclude many other points from
being seeds and, thus, lead to larger clusters.
The "exclusion_order" and "exclusion_updating" options calculate for each vertex how many other vertices are excluded if the vertex is picked as a seed. By picking seeds that exclude few other vertices, the function avoids central points and increases the number of seeds. "exclusion_updating" updates the count after each picked seed so that already excluded vertices are not counted twice; "exclusion_order" derives the count once. The former option is, thus, better-performing but slower.
Deriving the exclusion count when using the "exclusion_order" and "exclusion_updating"
options is an expensive operation, and it might not be feasible to do so in
large samples. This is particularly problematic when the data points have
equidistant nearest neighbors, which tends to happen when the dataset contains
only discrete variables. It also happens when the dataset contains many
identical data points. The exclusion count operation may in these cases take
several orders of magnitude longer to run than the rest of the operations
combined. To ensure sane behavior for the default options, the exclusion
count is not used by default. If the dataset contains at least one continuous
variable, it is generally safe to call sc_clustering
with either
"exclusion_order" or "exclusion_updating", and this will often improve performance
over the default.
The "inwards_order" and "inwards_updating" options count the number of inwards-pointing arcs in the graph, which approximates the exclusion count. "inwards_updating" updates the count after each picked seed, while "inwards_order" derives the count once. The inwards counting options work well with nearly all types of data, and "inwards_updating" is the default.
The "batches" option is identical to "lexical" but it derives the graph in
batches. This limits the use of memory to a value proportional to
batch_size
irrespectively of the size of the dataset. This can be
useful when imposing large size constraints, which consume a lot of memory
in large datasets. The "batches" option is still experimental and can currently
only be used when one does not impose type constraints.
Once the function has selected seeds so that no additional points can be
selected without creating overlap in the graph, it constructs the initial
clusters. A unique cluster label is assigned to each seed, and all points
in the seed's neighborhood is assigned the same label. Some
data points might not be in a neighborhood of a seed and will, therefore,
not be assigned to a cluster after the initial clusters have been formed.
primary_unassigned_method
specifies how the unassigned points are
assigned. With the "ignore" option, the points are left unassigned. With the
"any_neighbor", the function re-uses the sparse graph and assigns the
unassigned points to any adjacent cluster in the graph. The "closest_assigned"
option assigns the points to the clusters that contain their closest assigned
vertex, and the "closest_seed" option assigns to the cluster that contain
the closest seed. All these options ensure that the clustering is near-optimal.
Occasionally, some data points are allowed to be
left unassigned. Consider the following prediction problem as an example. We
have a set of data points with a known outcome value ("training points") and
another set of points for which the outcome is unknown ("prediction points").
We want to use the training points to predict the outcome for the prediction
points. By clustering the data, we can use the cluster mean of the training
points to predict the values of the prediction points in the same cluster. To
make this viable, we need to ensure that each cluster contains at least one
training point (the cluster mean would otherwise be undefined). We can impose
this constraint using the type_labels
and type_constraints
options. We also need to make sure that each prediction point is assigned to
a cluster. We do, however, not need that all training points are assigned a
cluster; some training points might not provide useful information (e.g.,
if they are very dissimilar to all prediction points). In this case, by
specifying only the prediction points in primary_data_points
, we
ensure that all those points are assigned to clusters. The points not
specified in primary_data_points
(i.e., the training points) will
be assigned to clustering only insofar that it is needed to satisfy the
clustering constraints. This can lead to large improvements in the clustering
if the types of points are unevenly distributed in the metric space. Points
not specified in primary_data_points
are called "secondary".
Generally, one does not want to discard all unassigned secondary points –
some of them will, occasionally, be close to a cluster and contain useful
information. Similar to primary_unassigned_method
, we can use the
secondary_unassigned_method
to specify how the leftover secondary
points should be assigned. The three possible options are "ignore",
"closest_assigned" and "closest_seed". In nearly all cases, it is beneficial
to impose a radius constraint when assigning secondary points (see below for
details).
sc_clustering
tries to minimize within-cluster distances subject to
that all (primary) data points are assigned to clusters. In some cases, it
might be beneficial to leave some points unassigned to avoid clusters with
very dissimilar points. The function has options that can be used to
indirectly constrain the maximum within-cluster distance. Specifically, by
restricting the maximum distance in the four steps of the function, we can
bound the maximum within-cluster distance. The bound is, however, a blunt
tool. A too small bound might lead to that only a few data points are
assigned to clusters. If a bound is needed, it is recommended to use a very
liberal bound. This will avoid the very dissimilar clusters, but let the
function optimize the remaining cluster assignments.
The seed_radius
option limits the maximum distance between adjacent
vertices in the sparse graph. If the distance to a neighbor in a point's
neighborhood is greater than seed_radius
, the neighborhood will be
deleted and the point is not allowed be a seed (it can, however, still be
in other points' neighborhoods). The primary_radius
option limits
the maximum distance when a primary point is assigned to in the fourth step.
When primary_radius
is set to a positive numeric value, this will be
used to as the radius restriction. "no_radius" and NULL
indicate no
restriction. "seed_radius" indicates that the same restriction as
seed_radius
should be used, and "estimated_radius" sets the
restriction to the estimated average distance between the seeds and their
neighbors. When primary_unassigned_method
is set to "any_neighbor",
primary_radius
must be set to "seed_radius".
The way the radius constraints restrict the maximum within-cluster distance
depends on the primary_unassigned_method
option. When
primary_unassigned_method
is "ignore", the maximum distance is bounded
by 2 * seed_radius
. When primary_unassigned_method
is
"any_neighbor", it is bounded by 4 * seed_radius
. When
primary_unassigned_method
is "closest_assigned", it is bounded by
2 * seed_radius
+ 2 * primary_radius
. When
primary_unassigned_method
is "closest_seed", it is bounded by the
maximum of 2 * seed_radius
and 2 * primary_radius
.
The secondary_radius
option restricts how secondary points are
assigned, but is otherwise identical to the primary_radius
option.
Returns a scclust
object with the derived clustering.
Higgins, Michael J., Fredrik Sävje and Jasjeet S. Sekhon (2016), ‘Improving massive experiments with threshold blocking’, Proceedings of the National Academy of Sciences, 113:27, 7369–7376.
Sävje, Fredrik and Michael J. Higgins and Jasjeet S. Sekhon (2017), ‘Generalized Full Matching’, arXiv 1703.03882. https://arxiv.org/abs/1703.03882
hierarchical_clustering
can be used to refine the clustering
constructed by sc_clustering
.
# Make example data my_data <- data.frame(id = 1:50000, type = factor(rbinom(50000, 3, 0.3), labels = c("A", "B", "C", "D")), x1 = rnorm(50000), x2 = rnorm(50000), x3 = rnorm(50000)) # Construct distance metric my_dist <- distances(my_data, id_variable = "id", dist_variables = c("x1", "x2", "x3")) # Make clustering with at least 3 data points in each cluster my_clustering <- sc_clustering(my_dist, 3) # Check so clustering satisfies constraints check_clustering(my_clustering, 3) # > TRUE # Get statistics about the clustering get_clustering_stats(my_dist, my_clustering) # > num_data_points 5.000000e+04 # > ... # Make clustering with at least one point of each type in each cluster my_clustering <- sc_clustering(my_dist, type_labels = my_data$type, type_constraints = c("A" = 1, "B" = 1, "C" = 1, "D" = 1)) # Check so clustering satisfies constraints check_clustering(my_clustering, type_labels = my_data$type, type_constraints = c("A" = 1, "B" = 1, "C" = 1, "D" = 1)) # > TRUE # Make clustering with at least 8 points in total of which at least # one must be "A", two must be "B" and five can be any type my_clustering <- sc_clustering(my_dist, size_constraint = 8, type_labels = my_data$type, type_constraints = c("A" = 1, "B" = 2))
# Make example data my_data <- data.frame(id = 1:50000, type = factor(rbinom(50000, 3, 0.3), labels = c("A", "B", "C", "D")), x1 = rnorm(50000), x2 = rnorm(50000), x3 = rnorm(50000)) # Construct distance metric my_dist <- distances(my_data, id_variable = "id", dist_variables = c("x1", "x2", "x3")) # Make clustering with at least 3 data points in each cluster my_clustering <- sc_clustering(my_dist, 3) # Check so clustering satisfies constraints check_clustering(my_clustering, 3) # > TRUE # Get statistics about the clustering get_clustering_stats(my_dist, my_clustering) # > num_data_points 5.000000e+04 # > ... # Make clustering with at least one point of each type in each cluster my_clustering <- sc_clustering(my_dist, type_labels = my_data$type, type_constraints = c("A" = 1, "B" = 1, "C" = 1, "D" = 1)) # Check so clustering satisfies constraints check_clustering(my_clustering, type_labels = my_data$type, type_constraints = c("A" = 1, "B" = 1, "C" = 1, "D" = 1)) # > TRUE # Make clustering with at least 8 points in total of which at least # one must be "A", two must be "B" and five can be any type my_clustering <- sc_clustering(my_dist, size_constraint = 8, type_labels = my_data$type, type_constraints = c("A" = 1, "B" = 2))
The scclust
function constructs a scclust
object from existing
cluster labels.
scclust(cluster_labels, unassigned_labels = NULL, ids = NULL)
scclust(cluster_labels, unassigned_labels = NULL, ids = NULL)
cluster_labels |
a vector containing each data point's cluster label. |
unassigned_labels |
labels that denote unassigned data points. If |
ids |
IDs of the data points. Should be a vector of the same length as
|
scclust
does not derive clusters from sets of data points; see
sc_clustering
and hierarchical_clustering
for
that functionality.
Returns a scclust
object with the clustering described by the
provided labels.
# 10 data points in 3 clusters my_scclust1 <- scclust(c("A", "A", "B", "C", "B", "C", "C", "A", "B", "B")) # 8 data points in 3 clusters, 2 points unassigned my_scclust2 <- scclust(c(1, 1, 2, 3, 2, NA, 3, 1, NA, 2)) # Custom labels indicating unassiged points my_scclust3 <- scclust(c("A", "A", "B", "C", "NONE", "C", "C", "NONE", "B", "B"), unassigned_labels = "NONE") # Two different labels indicating unassiged points my_scclust4 <- scclust(c("A", "A", "B", "C", "NONE", "C", "C", "0", "B", "B"), unassigned_labels = c("NONE", "0")) # Custom data point IDs my_labels5 <- scclust(c("A", "A", "B", "C", "B", "C", "C", "A", "B", "B"), ids = letters[1:10])
# 10 data points in 3 clusters my_scclust1 <- scclust(c("A", "A", "B", "C", "B", "C", "C", "A", "B", "B")) # 8 data points in 3 clusters, 2 points unassigned my_scclust2 <- scclust(c(1, 1, 2, 3, 2, NA, 3, 1, NA, 2)) # Custom labels indicating unassiged points my_scclust3 <- scclust(c("A", "A", "B", "C", "NONE", "C", "C", "NONE", "B", "B"), unassigned_labels = "NONE") # Two different labels indicating unassiged points my_scclust4 <- scclust(c("A", "A", "B", "C", "NONE", "C", "C", "0", "B", "B"), unassigned_labels = c("NONE", "0")) # Custom data point IDs my_labels5 <- scclust(c("A", "A", "B", "C", "B", "C", "C", "A", "B", "B"), ids = letters[1:10])