Title: | Clustering Longitudinal Trajectories |
---|---|
Description: | Clusters longitudinal trajectories over time (can be unequally spaced, unequal length time series and/or partially overlapping series) on a common time axis. Performs k-means clustering on a single continuous variable measured over time, where each mean is defined by a thin plate spline fit to all points in a cluster. Distance is MSE across trajectory points to cluster spline. Provides graphs of derived cluster splines, silhouette plots, and Adjusted Rand Index evaluations of the number of clusters. Scales well to large data with multicore parallelism available to speed computation. |
Authors: | George Ostrouchov [aut, cre], David Gagnon [aut], Hanna Gerlovin [aut], Chen Wei-Chen [ctb], Schmidt Drew [ctb], Oak Ridge National Laboratory [cph], U.S. Department of Veteran's Affairs [fnd] (Project: Million Veteran Program Data Core) |
Maintainer: | George Ostrouchov <[email protected]> |
License: | BSD 2-clause License + file LICENSE |
Version: | 0.2.1 |
Built: | 2024-10-29 04:19:57 UTC |
Source: | https://github.com/go-ski/clustra |
Clusters trajectories (unequally spaced and unequal length time series) on a common time axis. Clustering proceeds by an EM algorithm that iterates switching between fitting a thin plate spline (TPS) to combined responses within each cluster (M-step) and reassigning cluster membership based on the nearest fitted TPS (E-step). Initial cluster assignments are random or distant trajectories. The fitting is done with the mgcv package function bam, which scales well to very large data sets. Additional parallelism available via multicore on unix and mac platforms.
This research is based on data from the Million Veteran Program, Office of Research and Development, Veterans Health Administration, and was supported by award No.~MVP000. This research used resources from the Knowledge Discovery Infrastructure (KDI) at Oak Ridge National Laboratory, which is supported by the Office of Science of the US Department of Energy under Contract No. DE-AC05-00OR22725.
This research used resources of the Compute and Data Environment for Science (CADES) at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
George Ostrouchov, David Gagnon, Hanna Gerlovin
Runs RandIndex
for all pairs of cluster results in its
list input and produces a matrix for use by rand_plot
.
Understands replicates within k
values.
allpair_RandIndex(results)
allpair_RandIndex(results)
results |
A list with each element packed internally by the
|
A data frame with RandIndex
for all pairs from
trajectories results. The data frame names and its format is intended to be
the input for rand_plot
. Note that all pairs is the lower
triangle plus diagonal of an all-pairs symmetric matrix.
A sample of 10,000 individuals from the full 80,000 individuals in a dataset available on GitHub at https://github.com/MVP-CHAMPION/clustra-SAS/raw/main/bp_data/simulated_data_27June2023.csv.gz
bp10k
bp10k
bp10k
A "data.table" and "data.frame" with 167,277 rows and 4 columns:
An integer in 1:80000.
An integer in 1:5.
An integer between -365 and 730, giving observation day with reference to an intervention at time 0.
The systolic blood pressure on that day.
The full data set contains 80,000 individuals, each with an average of about 17 observations in 5 clusters with scatter. The individuals are generated from a 5-cluster thin spline model of actual blood pressures collected from roughly the same number of individuals at U.S. Department of Veterans Affairs facilities in connection with the MVP-CHAMPION project. Each cluster-mean generated individual has a random number of observations at random times with one observation at intervention time 0, and with added standard normal error. The resulting data has 1,353,910 rows and 4 columns.
Checks if non-empty groups have enough data for spline fit degrees of freedom.
check_df(group, loss, data, maxdf)
check_df(group, loss, data, maxdf)
group |
An integer vector of group membership for each id. |
loss |
A matrix with rows of computed loss values of each id across all models as columns. |
data |
A data.table with data. See |
maxdf |
Fitting parameters. See |
When a group has insufficient data for maxdf
, its nearest model loss
values are set to Inf
, and new nearest model is assigned. The check
repeats until all groups have sufficient data.
Returns the vector of group membership of id's either unchanged or changed to have sufficient data in non-zero groups.
The usual top level function for clustering longitudinal trajectories. After
initial setup, it calls trajectories
to perform k-means
clustering on continuous response
measured over time
, where each mean
is defined by a thin plate spline fit to all points in a cluster. See
clustra_vignette.Rmd
for examples of use.
clustra( data, k, starts = "random", maxdf = 30, conv = c(10, 0), mccores = 1, verbose = FALSE, ... )
clustra( data, k, starts = "random", maxdf = 30, conv = c(10, 0), mccores = 1, verbose = FALSE, ... )
data |
Data frame or, preferably, also a data.table with response measurements, one response per observation. Required variables are (id, time, response). Other variables are ignored. |
k |
Number of clusters |
starts |
One of c("random", "distant") or an integer vector with values 1:k corresponding to unique ids of starting cluster assignments. For "random", starting clusters are assigned at random. For "distant", a FastMap-like algorithm selects k distant ids to which TPS models are fit and used as starting cluster centers to which ids are classified. Only id with more than median number of time points are used. Distance from an id to a TPS model is median absolute difference at id time points. Starting with a random id, distant ids are selected sequentially as the id with the largest minimum absolute distance to previous selections (a maximin concept). The first random selection is discarded and the next k selected ids are kept. Their TPS fits become the first cluster centers to which all ids are classified. See comments in code and DOI: 10.1109/TPAMI.2005.164 for the FastMap analogy. |
maxdf |
Fitting parameters. See |
conv |
Fitting parameters. See |
mccores |
See |
verbose |
Logical to turn on more output during fit iterations. |
... |
Additional parameters of optional plotting under |
A list returned by trajectories
plus one more element ido
,
giving the original id numbers is invisibly returned. Invisible returns are
useful for repeated runs that explore verbose clustra output.
set.seed(13) data = gen_traj_data(n_id = c(50, 100), types = c(1, 2), intercepts = c(100, 80), m_obs = 20, s_range = c(-365, -14), e_range = c(0.5*365, 2*365)) cl = clustra(data, k = 2, maxdf = 20, conv = c(5, 0), verbose = TRUE) tabulate(data$group) tabulate(data$true_group)
set.seed(13) data = gen_traj_data(n_id = c(50, 100), types = c(1, 2), intercepts = c(100, 80), m_obs = 20, s_range = c(-365, -14), e_range = c(0.5*365, 2*365)) cl = clustra(data, k = 2, maxdf = 20, conv = c(5, 0), verbose = TRUE) tabulate(data$group) tabulate(data$true_group)
Performs trajectories
runs for several k and several random
start replicates within k. Returns a data frame with a Rand Index
comparison between all pairs of clusterings. This data frame is typically
input to rand_plot
to produce a heat map with the Adjusted
Rand Index results.
clustra_rand( data, k, starts = "random", mccores = 1, replicates = 10, maxdf = 30, conv = c(10, 0), save = FALSE, verbose = FALSE )
clustra_rand( data, k, starts = "random", mccores = 1, replicates = 10, maxdf = 30, conv = c(10, 0), save = FALSE, verbose = FALSE )
data |
The data (see |
k |
Vector of k values to try. |
starts |
See |
mccores |
Number of cores for replicate parallelism via mclapply. |
replicates |
Number of replicates for each k. |
maxdf |
Fitting parameters. See |
conv |
Fitting parameters. See |
save |
Logical. When TRUE, save all results as file |
verbose |
Logical. When TRUE, information about each run of clustra (but not iterations within) is printed. |
See allpair_RandIndex
.
Performs clustra
runs for several k and prepares silhouette
plot data. Computes a proxy silhouette index based on distances to cluster
centers rather than trajectory pairs. The cost is essentially that of
running clustra for several k as this information is available directly from
clustra. Can also reuse a previous clustra run and produce data for a single
silhouette plot.
clustra_sil( data, kv = NULL, starts = "random", mccores = 1, maxdf = 30, conv = c(10, 0), save = FALSE, verbose = FALSE )
clustra_sil( data, kv = NULL, starts = "random", mccores = 1, maxdf = 30, conv = c(10, 0), save = FALSE, verbose = FALSE )
data |
A data.frame (see the |
kv |
Vector of |
starts |
See |
mccores |
See |
maxdf |
Fitting parameters. See |
conv |
Fitting parameters. See |
save |
Logical. When TRUE, save all results as file |
verbose |
Logical. When TRUE, information about each run of clustra is printed. |
When given the raw data as the first parameter (input data
parameter of
trajectories
), kv
specifies a vector of k
parameters for
clustra
and produces data for silhouette plots of each of them.
Alternatively, the input can be the output from a single clustra
run, in
which case data for a single silhouette plot will be made without running
clustra
.
Invisibly returns a list of length length(kv)
, where each element is
a matrix with nrow(data)
rows and three columns cluster
, neighbor
,
silhouette
. The matrix in each element of this list can be used to draw a
silhouette plot. When the input was a completed clustra
run, the output is a
list with a single element for a single silhouette plot.
Timing function
deltime(ltime = proc.time()["elapsed"], text = NULL, units = FALSE, nl = FALSE) deltimeT(ltime, text)
deltime(ltime = proc.time()["elapsed"], text = NULL, units = FALSE, nl = FALSE) deltimeT(ltime, text)
ltime |
Result of last call to deltime. |
text |
Text to display along with elapsed time since |
units |
Logical. If TRUE, print units |
nl |
Logical. If TRUE, a newline is added at the end. |
"elapsed" component of current proc.time
.
deltimeT()
: A shortcut frequent use. Requires ltime and text
parameters, includes units, and adds a newline after message.
Generates a collection of longitudinal responses with possibly varying
lengths and varying numbers of observations. Support is
start
. . . 0 . . . end
, where
start
~uniform(s_range) and end
~uniform(e_range), so that
all trajectories are aligned at 0 but can start and end at different times.
Zero is the intervention time.
gen_traj_data( n_id, types, intercepts, m_obs, s_range, e_range, noise = c(0, abs(mean(intercepts)/20)), min_obs = 3 )
gen_traj_data( n_id, types, intercepts, m_obs, s_range, e_range, noise = c(0, abs(mean(intercepts)/20)), min_obs = 3 )
n_id |
Vector whose length is the number of clusters, giving the number of id's to generate in each cluster. |
types |
A vector of integers from |
intercepts |
A vector of first responses at minimum time for the curve base vectors of
same length as n_id. Each |
m_obs |
Mean number of observation per id. Provides |
s_range |
A vector of length 2, giving the min and max limits of uniformly generated start observation time. |
e_range |
A vector of length 2, giving the min and max limits of uniformly generated end observation time. |
noise |
Vector of length 2 giving the mean and sd of added N(mean, sd) noise. |
min_obs |
Minimum number of observations in addition to zero time observation. |
A data table with one response per row and four columns:
id
, time
, response
, and true_group
.
Generate longitudinal data for a response variable. Trajectories start at time uniformly distributed in s_range and end at time uniformly distributed in e_range. Number of observations in a trajectory is Poisson(m_obs). The result is a number of trajectories, all starting at time 0, with different time spans, and with independently different numbers of observations within the time spans. Each trajectory follows one of three possible response functions possibly with a different mean and with added N(mean, sd) error.
data = gen_traj_data(n_id = c(50, 100), types = c(1, 2), intercepts = c(100, 80), m_obs = 20, s_range = c(-365, -14), e_range = c(0.5*365, 2*365)) head(data) tail(data)
data = gen_traj_data(n_id = c(50, 100), types = c(1, 2), intercepts = c(100, 80), m_obs = 20, s_range = c(-365, -14), e_range = c(0.5*365, 2*365)) head(data) tail(data)
Generates data for up to three trajectory clusters
gendata(n_id, types, intercepts, m_obs, s_range, e_range, min_obs, noise)
gendata(n_id, types, intercepts, m_obs, s_range, e_range, min_obs, noise)
n_id |
See parameters of |
types |
See parameters of |
intercepts |
See parameters of |
m_obs |
See parameters of |
s_range |
See parameters of |
e_range |
See parameters of |
min_obs |
See parameters of |
noise |
See parameters of |
Time support of each id
is at least s . . . 0 . . . . e
, where s
is in
s_range
and e
is in e_range
.
A list of length sum(n_id)
, where each element is a matrix output by oneid
.
kchoose
.Function to test information criteria. Not exported and used by internal
function kchoose
.
ic_fun(cl, data, fn)
ic_fun(cl, data, fn)
cl |
Output from |
data |
A valid data set for |
fn |
Character file name for output. |
Numerical value of computed AIC. Also writes a line of computed information
criteria to fn
file for each k.
A test function to evaluate information criteria for several k values. Not exported and only for debugging internal use.
kchoose(K, var = 5, maxdf = 10, mc = 1, fn = "ic.txt")
kchoose(K, var = 5, maxdf = 10, mc = 1, fn = "ic.txt")
K |
Integer vector of k values to try. |
var |
A numerical value of noise variance in generated data. |
maxdf |
Fitting parameters. See |
mc |
Number of cores to use. Increase up to largest k, or number of cores available, whichever is less. (On hyperthreaded cores, up to 2x number of cores.) |
fn |
Character file name for output. |
Various Loss functions used internally by clustra
mse_g(pred, id, response) mae_g(pred, id, response) mme_g(pred, id, response) mxe_g(pred, id, response)
mse_g(pred, id, response) mae_g(pred, id, response) mme_g(pred, id, response) mxe_g(pred, id, response)
pred |
Vector of predicted values. |
id |
Integer vector of group assignments. |
response |
Vector of response values. |
A numeric value. For mse_g(), returns the mean-squared error. For mae_g(), returns mean absolute error. For mme_g(), returns median absolute error. For mxe_g(), returns the maximum absolute error.
id
Generates data for one id
oneid(id, n_obs, type, intercept, start, end, smin, emax, noise)
oneid(id, n_obs, type, intercept, start, end, smin, emax, noise)
id |
A unique integer. |
n_obs |
An integer number of observations to produce. |
type |
Response type, 1 is constant, 2 is a sin curve portion, and 3 is a sigmoid portion. |
intercept |
Used to set response at |
start |
Negative integer giving time of first observation. |
end |
Positive integer giving time of last observation. |
smin |
The smallest possible |
emax |
The largest possible |
noise |
Standard deviation of zero mean Gaussian noise added to response functions. |
An n_obs
by 4 matrix with columns id
, time
, response
, true_group
.
Plots a sample of ids in a small mutiples layout
plot_sample(dat, layout = c(3, 3), sample = prod(layout), group = NULL)
plot_sample(dat, layout = c(3, 3), sample = prod(layout), group = NULL)
dat |
A data frame with a few id trajectories to plot. |
layout |
The small multiples layout as c(rows, columns). |
sample |
If zero, all data in dat are displayed. If >0 a sample of that many data points from dat are displayed. |
group |
If not NULL, a character string giving the variable name in data that should color the data points. |
Invisibly returns the number of trajectories plotted.
clustra_sil
along
with the average silhouette value. Typically used via
lapply(list, plot_silhouette)
Plots a list item, a silhouette, from the result of clustra_sil
along
with the average silhouette value. Typically used via
lapply(list, plot_silhouette)
plot_silhouette(sil)
plot_silhouette(sil)
sil |
A data frame that is a list item returned by |
Returns invisibly the average silhouette value.
Plots data and smooths from clustra
output or internally from within
start_groups()
plot_smooths( data, fits = NULL, max.data = 1e+05, select.data = NULL, group = "group", ... )
plot_smooths( data, fits = NULL, max.data = 1e+05, select.data = NULL, group = "group", ... )
data |
The data. If after |
fits |
The |
max.data |
The maximum number of data points to plot. If zero,
no points are plotted (overrides select.data). Use |
select.data |
Either NULL or a list of length k, each element a data.frame (like data)
with time and response components. The select.data points will be
highlighted with cluster colors on the plot. This is used internally in the
|
group |
Character variable name in |
... |
Other parameters to |
Function to predict for new data based on fitted gam object.
pred_g(tps, newdata)
pred_g(tps, newdata)
tps |
Output structure of |
newdata |
See |
A numeric vector of predicted values corresponding to rows of newdata. If gam object is NULL, NULL is returned instead.
Matrix plot of Rand Index comparison of replicated clusters
rand_plot(rand_pairs, name = NULL)
rand_plot(rand_pairs, name = NULL)
rand_pairs |
A data frame result of |
name |
Character string file name for pdf plot. If omitted or NULL, plot will render to current graphics device. |
Invisible. Full path name of file with plot.
Wei-Chen Chen and George Ostrouchov
Wei-chen Chen, George Ostrouchov, David Pugmire, Prabhat, and Michael Wehner. 2013. A Parallel EM Algorithm for Model-Based Clustering Applied to the Exploration of Large Spatio-Temporal Data. Technometrics, 55:4, 513-523.
Sorts replicates within cluster K Assumes K starts from 2
Either a random assignment of k approximately equal size clusters or a FastMap-like algorithm that sequentially selects k distant ids from those that have more than the median number of observations. TPS fits to these ids are used as cluster centers for a starting group assignment. A user supplied starting assignment is also possible.
start_groups(k, data, starts, maxdf, conv, mccores = 1, verbose = FALSE)
start_groups(k, data, starts, maxdf, conv, mccores = 1, verbose = FALSE)
k |
Number of clusters (groups). |
data |
Data.table with response measurements, one per observation.
Column names are id, time, response, group. Note that |
starts |
Type of start groups generated. See |
maxdf |
Fitting parameters. See |
conv |
Fitting parameters. See |
mccores |
See |
verbose |
Turn on more output for debugging. Values 0, 1, 2, 3 add more output. 2 and 3 produce graphs during iterations - use carefully! |
An integer vector corresponding to unique id
s, giving group number
assignments.
For distant
, each sequential selection takes an id that has the largest
minimum distance from smooth TPS fits (<= 5 deg) of previous selections.
The distance of an id to a single TPS is the median absolute error across
the id time points. Distance of an id to a set of TPS is the minimum of
the individual distances. We pick the id that has the maximum of such
a minimum of medians.
bam
.Fits a thin plate spline to a single group (one list element) in data with
bam
. Uses data from only one group rather than a
zero weights approach. Zero weights would result in
incorrect crossvalidation sampling.
tps_g(g, data, maxdf, nthreads)
tps_g(g, data, maxdf, nthreads)
g |
Integer group number. |
data |
A list of group-separated data using lapply with
|
maxdf |
See |
nthreads |
Controls |
Returns an object of class "gam". See bam
value.
If group data has zero rows, NULL is returned instead.
Function to run trajectories inside mclapply with one core.
traj_rep(group, data, k, maxdf, conv)
traj_rep(group, data, k, maxdf, conv)
group |
Vector of starting group values for unique id's. |
data |
The data (see |
k |
Integer number of clusters. |
maxdf |
Fitting parameters. See |
conv |
Fitting parameters. See |
See return of trajectories
.
Performs k-means clustering on continuous response
measured over time
,
where each mean is defined by a thin plate spline fit to all points in a
cluster. Typically, this function is called by clustra
.
trajectories( data, k, group, maxdf, conv = c(10, 0), mccores = 1, verbose = FALSE, ... )
trajectories( data, k, group, maxdf, conv = c(10, 0), mccores = 1, verbose = FALSE, ... )
data |
Data table or data frame with response measurements, one per observation.
Column names are |
k |
Number of clusters (groups) |
group |
Vector of initial group numbers corresponding to |
maxdf |
Integer. Basis dimension of smooth term. See |
conv |
A vector of length two, |
mccores |
Integer number of cores to use by |
verbose |
Logical, whether to produce debug output. A value > 1 will plot tps fit lines in each iteration. |
... |
See |
A list with components
deviance
- The final deviance in each cluster added across clusters.
group
- Integer vector of group assignments corresponding to unique id
s.
loss
- Numeric matrix with rows corresponding to unique id
s and one
column for each cluster. Each entry is the mean squared loss for the data in
the id
relative to the cluster model.
k
- An integer giving the requested number of clusters.
k_cl
- An integer giving the converged number of clusters. Can be
smaller than k
when some clusters become too small for degrees of freedom
during convergence.
data_group
- An integer vector, giving group assignment as expanded into
all id
time points.
tps
- A list with k_cl
elements, each an object returned by the
mgcv::bam
fit of a cluster thin plate spline model.
iterations
- An integer giving the number of iterations taken.
counts
- An integer vector giving the number of id
s in each cluster.
counts_df
- An integer vector giving the total number of observations in
each cluster (sum of the number of observations for id
s belonging to the
cluster).
changes
- An integer, giving the number of id
s that changed clusters in
the last iteration. This is zero if converged.
George Ostrouchov and David Gagnon
Examines trajectories output to name what was concluded, such as convergence, maximum iterations reached, a zero cluster, etc. Multiple conclusions are possible as not all are mutually exclusive.
xit_report(cl, maxdf, conv)
xit_report(cl, maxdf, conv)
cl |
Output structure from |
maxdf |
Fitting parameters. See |
conv |
Fitting parameters. See |
NULL or a character vector of exit criteria satisfied.