API Reference¶
This page provides auto-generated API documentation from Python docstrings. All classes and modules are documented using mkdocstrings, which extracts comprehensive information directly from the source code including method signatures, parameters, return types, and detailed descriptions.
Synthesizers¶
The synthesizer module provides multiple approaches for generating synthetic omics data, all implementing a unified interface through the BaseSynthesizer class.
BaseSynthesizer
¶
Base class providing a consistent pipeline: preprocess -> fit -> sample -> postprocess -> save.
Subclasses should override
- preprocess (if they need data transformations before fitting, e.g., Gaussian Copula)
- fit (required)
- sample (required)
- postprocess (optional to extend/override base behavior)
Source code in src/synomicsbench/synthesizer/BaseSynthesizer.py
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 | |
__init__(output_path, metadata=None)
¶
Initialize the BaseSynthesizer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_path
|
str
|
Path to save outputs and logs. |
required |
metadata
|
dict
|
Metadata dictionary containing column type information. |
None
|
Source code in src/synomicsbench/synthesizer/BaseSynthesizer.py
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | |
detect_discrete_columns(data)
¶
Detect discrete columns from data and self.metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Input data. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
list |
List[str]
|
List of discrete column names. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If metadata is not set. |
Source code in src/synomicsbench/synthesizer/BaseSynthesizer.py
113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 | |
detect_numerical_columns(data)
¶
Detect numerical columns from data and self.metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Input data. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
list |
List[str]
|
List of numerical column names. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If metadata is not set. |
Source code in src/synomicsbench/synthesizer/BaseSynthesizer.py
130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 | |
postprocess(synthetic_data, original_data, data_ids=None, enforce_rounding=True, enforce_min_max=True, masking=False)
¶
Postprocess synthetic data, including anonymization, rounding, and min-max scaling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
synthetic_data
|
DataFrame
|
Generated synthetic data. |
required |
original_data
|
DataFrame
|
Original data for reference. |
required |
data_ids
|
list
|
IDs to anonymize. |
None
|
enforce_rounding
|
bool
|
Apply rounding with digits inferred from original_data. |
True
|
enforce_min_max
|
bool
|
Clip to min/max observed in original_data. |
True
|
masking
|
bool
|
If True and mask_func is provided, apply it to introduce missingness. |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: Postprocessed synthetic data. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If metadata is not set. |
Source code in src/synomicsbench/synthesizer/BaseSynthesizer.py
194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 | |
preprocess(data)
¶
Preprocess input data for synthesizer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Input data. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: Preprocessed data (default: returns input unchanged). |
Source code in src/synomicsbench/synthesizer/BaseSynthesizer.py
87 88 89 90 91 92 93 94 95 96 97 | |
save_synthetic_data(synthetic_data, filename, index=False)
¶
Save synthetic data to CSV. Supports a single DataFrame or a list (e.g., Synthpop m>1).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
synthetic_data
|
DataFrame or list[DataFrame]
|
Data to save. |
required |
filename
|
str
|
Base filename for saving. |
required |
index
|
bool
|
Whether to write DataFrame index. |
False
|
Source code in src/synomicsbench/synthesizer/BaseSynthesizer.py
165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 | |
set_metadata(metadata)
¶
Set or update the metadata dictionary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata
|
dict
|
Metadata dictionary containing column type information. |
required |
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in src/synomicsbench/synthesizer/BaseSynthesizer.py
99 100 101 102 103 104 105 106 107 108 109 110 | |
CTGANsynthesizer
¶
Bases: BaseSynthesizer
CTGANsynthesizer for generating synthetic data using CTGAN.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_path
|
str
|
Path to save outputs. |
required |
metadata
|
dict
|
Metadata dictionary containing column type information. |
None
|
Methods:
| Name | Description |
|---|---|
fit |
Train the CTGAN model. |
sample |
Generate synthetic samples. |
postprocess |
Apply postprocessing (anonymize IDs, rounding, scaling). |
generate |
Orchestrate data synthesis pipeline. |
Source code in src/synomicsbench/synthesizer/CTGANsynthesizer.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | |
fit(data, seed=None, *, epochs=100, verbose=True, cuda=True, **kwargs)
¶
Train the CTGAN model on provided data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
DataFrame for training. |
required |
seed
|
int
|
Random seed used for reproducible training. |
None
|
epochs
|
int
|
Number of training epochs. |
100
|
verbose
|
bool
|
Verbosity flag. |
True
|
cuda
|
bool
|
Use GPU if True. |
True
|
**kwargs
|
Extra CTGAN parameters. |
{}
|
Returns:
| Type | Description |
|---|---|
|
None |
Raises:
| Type | Description |
|---|---|
ValueError
|
If metadata is not set, or fitting fails. |
Source code in src/synomicsbench/synthesizer/CTGANsynthesizer.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 | |
sample(n_samples=10, seed=None, **kwargs)
¶
Generate synthetic samples from fitted CTGAN model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_samples
|
int
|
Number of samples to generate. |
10
|
seed
|
int
|
Random seed used for sampling. |
None
|
**kwargs
|
Extra parameters for model.sample(). |
{}
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: Synthetic samples. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If model not fitted or sampling fails. |
Source code in src/synomicsbench/synthesizer/CTGANsynthesizer.py
100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | |
TVAEsynthesizer
¶
Bases: BaseSynthesizer
TVAEsynthesizer for generating synthetic data using TVAE.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_path
|
str
|
Path to save outputs. |
required |
metadata
|
dict
|
Metadata dictionary containing column type information. |
None
|
Methods:
| Name | Description |
|---|---|
fit |
Train the TVAE model. |
sample |
Generate synthetic samples. |
postprocess |
Apply postprocessing (anonymize IDs, rounding, scaling). |
generate |
Orchestrate data synthesis pipeline. |
Source code in src/synomicsbench/synthesizer/TVAEsynthesizer.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 | |
fit(data, seed=None, *, epochs=100, verbose=True, cuda=True, **kwargs)
¶
Train the TVAE model on provided data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
DataFrame for training. |
required |
epochs
|
int
|
Number of training epochs. |
100
|
verbose
|
bool
|
Verbosity flag. |
True
|
cuda
|
bool
|
Use GPU if True. |
True
|
seed
|
int
|
Random seed for reproducibility. |
None
|
**kwargs
|
Extra TVAE parameters. |
{}
|
Returns:
| Type | Description |
|---|---|
|
None |
Raises:
| Type | Description |
|---|---|
ValueError
|
If metadata is not set, input is invalid, or fitting fails. |
Source code in src/synomicsbench/synthesizer/TVAEsynthesizer.py
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 | |
sample(n_samples=10, seed=None, **kwargs)
¶
Generate synthetic samples from fitted TVAE model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_samples
|
int
|
Number of samples to generate. |
10
|
seed
|
int
|
Random seed for reproducibility during sampling. |
None
|
**kwargs
|
Extra parameters for model.sample(). |
{}
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: Synthetic samples. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If model not fitted or sampling fails. |
Source code in src/synomicsbench/synthesizer/TVAEsynthesizer.py
108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 | |
GaussianCopulasynthesizer
¶
Bases: BaseSynthesizer
GaussianCopulasynthesizer for generating synthetic data using a Gaussian Copula model.
This refactor aligns the class with BaseSynthesizer
- preprocess: encodes categorical variables (dummy + ordinal) and keeps numerical/missing indicators
- fit: trains GaussianMultivariate_Parallel on preprocessed data
- sample: draws synthetic samples
- postprocess: decodes categorical variables, enforces constraints, anonymizes IDs
Notes
- Metadata must be provided as a dictionary to the constructor or via set_metadata():
{
"col_name": "ordinal_categorical" | "dummy_categorical" | "missing_categorical" |
} - Ordinal variables are encoded using sklearn's OrdinalEncoder (0..n_levels-1). During postprocess, synthetic ordinal columns are rounded and clipped to valid ranges derived from encoder categories, then inverse-transformed.
Source code in src/synomicsbench/synthesizer/GaussianCopulasynthesizer.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 | |
encode_dummy_cat_features(data)
¶
One-hot encode the provided categorical columns (no drop-first, no dummy for NaN).
Source code in src/synomicsbench/synthesizer/GaussianCopulasynthesizer.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 | |
fit(data, seed=None, n_jobs=8, chunk_size=20, **kwargs)
¶
Train the Gaussian Copula model on preprocessed data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Preprocessed DataFrame (output of preprocess()). |
required |
n_jobs
|
int
|
Number of parallel worker threads. |
8
|
chunk_size
|
int
|
Chunk size for parallel fitting. |
20
|
**kwargs
|
Extra GaussianMultivariate_Parallel parameters. |
{}
|
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in src/synomicsbench/synthesizer/GaussianCopulasynthesizer.py
128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 | |
inverse_dummy_cat_features(data, dummy_cat_columns)
¶
Inverse-transform one-hot encoded dummy categorical columns back to single categorical columns.
Source code in src/synomicsbench/synthesizer/GaussianCopulasynthesizer.py
198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 | |
inverse_missing_indicators(data, missing_indicators)
¶
Threshold missing indicator columns at 0.5 and cast to float.
Source code in src/synomicsbench/synthesizer/GaussianCopulasynthesizer.py
252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 | |
inverse_ordinal_cat_features(data, ordinal_cat_columns)
¶
Perform inverse ordinal encoding on categorical features.
Source code in src/synomicsbench/synthesizer/GaussianCopulasynthesizer.py
225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 | |
postprocess(synthetic_data, original_data, data_ids=None, enforce_rounding=True, enforce_min_max=True, masking=False)
¶
Inverse the encodings back to original space, then delegate anonymization, rounding, min-max clipping, and optional masking to BaseSynthesizer.postprocess.
Source code in src/synomicsbench/synthesizer/GaussianCopulasynthesizer.py
269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 | |
preprocess(data)
¶
Preprocess input data for Gaussian Copula synthesizer.
- Splits columns by type using self.metadata
- One-hot encodes dummy categorical columns
- Ordinal-encodes ordinal categorical columns
- Keeps numerical and missing indicator columns unchanged
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
The input DataFrame. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: Preprocessed DataFrame. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If metadata is not set. |
Source code in src/synomicsbench/synthesizer/GaussianCopulasynthesizer.py
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 | |
sample(n_samples=10, seed=None, **kwargs)
¶
Generate synthetic samples from fitted Gaussian Copula model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_samples
|
int
|
Number of samples to generate. |
10
|
seed
|
int
|
Random seed for reproducibility. |
None
|
**kwargs
|
Extra parameters for model.sample(). |
{}
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: Synthetic samples. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If model is not fitted. |
Source code in src/synomicsbench/synthesizer/GaussianCopulasynthesizer.py
169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 | |
SynthpopSynthesizer
¶
Bases: BaseSynthesizer
SynthpopSynthesizer for generating synthetic data using Synthpop (R) via rpy2.
This implementation aligns with BaseSynthesizer
- preprocess: pass-through by default
- fit: stores the training DataFrame for the R call
- sample: calls R's synthpop::syn, returns a DataFrame or list of DataFrames (for m > 1)
- postprocess: anonymize IDs, rounding, min-max; works for single or multiple datasets via BaseSynthesizer.generate
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_path
|
str
|
Directory where outputs are saved. |
required |
metadata
|
dict
|
Column-type metadata. Keys are column names, values are one of 'dummy_categorical', 'ordinal_categorical', 'missing_categorical', or a numeric type string. |
None
|
r_home
|
str
|
Path to R_HOME. If provided, sets os.environ['R_HOME'] at init. |
None
|
r_terminal
|
str
|
R executable name or path used to configure library search paths. Default 'R'. |
'R'
|
Source code in src/synomicsbench/synthesizer/Synthpopsynthesizer.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 | |
fit(data, seed=42, **kwargs)
¶
Store the training data for later use in sample().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Preprocessed training data. |
required |
**kwargs
|
Unused; for API compatibility. |
{}
|
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in src/synomicsbench/synthesizer/Synthpopsynthesizer.py
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 | |
preprocess(data)
¶
Preprocess input data for Synthpop (default: pass-through).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Input data. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: Preprocessed data (unchanged). |
Source code in src/synomicsbench/synthesizer/Synthpopsynthesizer.py
47 48 49 50 51 52 53 54 55 56 57 | |
sample(n_samples='auto', seed=42, *, discrete_columns=None, method='cart', minimumlevels=3, proper=False, n_datasets=1, visit_sequence=None, cont_na=None, verbose=True, predictor_matrix=None, **kwargs)
¶
Call R's synthpop::syn to generate synthetic data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_samples
|
int or auto
|
Number of rows to synthesize; 'auto' uses training size. |
'auto'
|
discrete_columns
|
list[str]
|
categorical columns. |
None
|
method
|
str
|
Synthpop synthesis method (e.g., 'cart', 'parametric', ...). |
'cart'
|
minimumlevels
|
int
|
Minimum levels for categorical variables. |
3
|
proper
|
bool
|
Proper synthesis flag. |
False
|
n_datasets
|
int
|
Number of synthetic datasets (m). |
1
|
visit_sequence
|
list[str]
|
Variable visit sequence. |
None
|
cont_na
|
dict
|
Settings for NA handling of continuous variables. |
None
|
seed
|
int
|
Random seed for reproducibility on the R side. |
42
|
verbose
|
bool
|
Verbosity for R synthpop. |
True
|
predictor_matrix
|
DataFrame
|
Square 0/1 matrix restricting predictors. Must have the same index and columns as training data columns. |
None
|
Returns:
| Type | Description |
|---|---|
Union[DataFrame, List[DataFrame]]
|
pd.DataFrame or list[pd.DataFrame]: Synthetic dataset(s). |
Raises:
| Type | Description |
|---|---|
RuntimeError
|
If R synthesis fails. |
ValueError
|
If fit() has not been called or inputs are invalid. |
Source code in src/synomicsbench/synthesizer/Synthpopsynthesizer.py
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 | |
Processing¶
The processing module handles data integration, preprocessing, metadata management, and gene-level queries for multi-omics datasets.
DataIntegrationPipeline
¶
End-to-end pipeline for processing and integrating clinical and transcriptomics data.
This pipeline: - Cleans data (removes undefined IDs, deduplicates, filters over-missing samples) - Performs feature engineering (over-missing feature filtering, type classification, imputation) - Optionally maps Ensembl gene IDs to HUGO symbols - Integrates processed clinical and transcriptomics data on a common ID - Exports a feature metadata JSON and logs detailed progress for easy monitoring
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_dir
|
str
|
Directory where logs and outputs (e.g., feature_metadata.json) are saved. |
required |
logger
|
str
|
Suffix for the log filename (e.g., Preprocess_{logger}.log). |
''
|
Returns:
| Type | Description |
|---|---|
|
None |
Raises:
| Type | Description |
|---|---|
OSError
|
If the output directory cannot be created. |
Source code in src/synomicsbench/processing/pipeline.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 | |
__init__(output_dir, logger='')
¶
Initialize the data integration pipeline and configure logging.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_dir
|
str
|
Directory where logs and outputs are stored. |
required |
logger
|
str
|
Suffix for the log filename. |
''
|
Returns:
| Type | Description |
|---|---|
|
None |
Raises:
| Type | Description |
|---|---|
OSError
|
If the output directory cannot be created. |
Source code in src/synomicsbench/processing/pipeline.py
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | |
integrate_data(processed_clinical, processed_transcriptomics, clinical_id_column, transcriptomics_id_column, integration_id_column, steps_config)
¶
Integrate processed clinical and transcriptomics data on a common ID.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
processed_clinical
|
DataFrame
|
Processed clinical dataset. |
required |
processed_transcriptomics
|
DataFrame
|
Processed transcriptomics dataset. |
required |
clinical_id_column
|
str
|
ID column in processed_clinical. |
required |
transcriptomics_id_column
|
str
|
ID column in processed_transcriptomics. |
required |
integration_id_column
|
str
|
Name for the common ID column after renaming. |
required |
steps_config
|
dict
|
Flags controlling whether to integrate. |
required |
Returns:
| Type | Description |
|---|---|
Optional[DataFrame]
|
pd.DataFrame or None: Integrated dataset if integration is enabled; otherwise None. |
Raises:
| Type | Description |
|---|---|
KeyError
|
If required ID columns are missing in the provided DataFrames. |
TypeError
|
If inputs are not pandas DataFrames. |
Source code in src/synomicsbench/processing/pipeline.py
422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 | |
process_clinical_data(clinical_data, clinical_id_column, steps_config, overmissing_samples_threshold=50.0, overmissing_features_threshold=50.0, unique_threshold=10, scaler='minmax', imputer='knn', imputer_params=None, ordinal_cat_columns=None, add_indicators=True, verbose=True)
¶
Process raw clinical data with configurable steps.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
clinical_data
|
DataFrame
|
Raw clinical data. |
required |
clinical_id_column
|
str
|
Column name for sample/patient IDs. |
required |
steps_config
|
dict
|
Flags controlling which steps to run. |
required |
overmissing_samples_threshold
|
float
|
Remove rows with missingness > threshold (0-100). |
50.0
|
overmissing_features_threshold
|
float
|
Remove columns with missingness > threshold (0-100). |
50.0
|
unique_threshold
|
int
|
Unique value threshold to classify categorical features. |
10
|
scaler
|
str
|
Scaler type for numerical features ('minmax', 'standard', 'robust'). |
'minmax'
|
imputer
|
str
|
Imputation method to use ('knn' or 'mice'). |
'knn'
|
imputer_params
|
dict
|
Method-specific params. For 'knn': e.g., {'n_neighbors': 5}. For 'mice': e.g., {'iterations': 20, 'n_estimators': 300}. |
None
|
ordinal_cat_columns
|
list
|
Known ordinal categorical columns. |
None
|
add_indicators
|
bool
|
Whether to add missingness indicator columns during imputation. |
True
|
verbose
|
bool
|
Print progress from lower-level utilities. |
True
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: Processed clinical data with imputed features and indicators. |
Raises:
| Type | Description |
|---|---|
KeyError
|
If clinical_id_column is not found in clinical_data. |
TypeError
|
If clinical_data is not a pandas DataFrame. |
ValueError
|
On invalid thresholds or processing errors in underlying steps. |
Source code in src/synomicsbench/processing/pipeline.py
269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 | |
process_transcriptomics_data(transcriptomics_data, transcriptomics_id_column, steps_config, overmissing_samples_threshold=50.0, overmissing_features_threshold=50.0, unique_threshold=10, scaler='minmax', imputer='mice', imputer_params=None, low_expression_variance_threshold=0.0005, add_indicators=True, verbose=True)
¶
Process raw transcriptomics data with configurable steps.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
transcriptomics_data
|
DataFrame
|
Raw transcriptomics data. |
required |
transcriptomics_id_column
|
str
|
Column name for sample/patient IDs. |
required |
steps_config
|
dict
|
Flags controlling which steps to run. |
required |
overmissing_samples_threshold
|
float
|
Remove rows with missingness > threshold (0-100). |
50.0
|
overmissing_features_threshold
|
float
|
Remove columns with missingness > threshold (0-100). |
50.0
|
unique_threshold
|
int
|
Threshold to classify categorical features (not used for transcriptomics). |
10
|
scaler
|
str
|
Scaler type for numerical features ('minmax', 'standard', 'robust'). |
'minmax'
|
imputer
|
str
|
Imputation method to use ('knn' or 'mice'). |
'mice'
|
imputer_params
|
dict
|
Method-specific params. For 'knn': e.g., {'n_neighbors': 5}. For 'mice': e.g., {'iterations': 20, 'n_estimators': 300}. |
None
|
low_expression_variance_threshold
|
float
|
Variance cutoff for near-constant genes when
|
0.0005
|
add_indicators
|
bool
|
Whether to add missingness indicator columns during imputation. |
True
|
verbose
|
bool
|
Print progress from lower-level utilities. |
True
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: Processed transcriptomics data with imputed features and optional indicators. |
Raises:
| Type | Description |
|---|---|
KeyError
|
If transcriptomics_id_column is not found in the input DataFrame. |
TypeError
|
If transcriptomics_data is not a pandas DataFrame. |
ValueError
|
On invalid thresholds or processing errors in underlying steps. |
Source code in src/synomicsbench/processing/pipeline.py
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 | |
run_pipeline(clinical_data, transcriptomics_data, clinical_id_column, transcriptomics_id_column, integration_id_column='PATIENT_ID', steps_config=None, overmissing_samples_threshold=50.0, overmissing_features_threshold=50.0, unique_threshold=10, scaler='minmax', ordinal_cat_columns=None, imputer='mice', imputer_params=None, low_expression_variance_threshold=0.0005, add_indicators=True, verbose=True)
¶
Run the full pipeline across transcriptomics and clinical data and optionally integrate them.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
clinical_data
|
DataFrame
|
Raw clinical data. |
required |
transcriptomics_data
|
DataFrame
|
Raw transcriptomics data. |
required |
clinical_id_column
|
str
|
ID column name in clinical_data. |
required |
transcriptomics_id_column
|
str
|
ID column name in transcriptomics_data. |
required |
integration_id_column
|
str
|
Common ID column name for integration. |
'PATIENT_ID'
|
steps_config
|
dict
|
Dict specifying which steps to run; defaults enable all steps. |
None
|
overmissing_samples_threshold
|
float
|
Remove rows with missingness > threshold (0-100). |
50.0
|
overmissing_features_threshold
|
float
|
Remove columns with missingness > threshold (0-100). |
50.0
|
unique_threshold
|
int
|
Unique value threshold to classify categorical features (clinical). |
10
|
scaler
|
str
|
Scaler type for numerical features ('minmax', 'standard', 'robust'). |
'minmax'
|
ordinal_cat_columns
|
list
|
Ordinal categorical columns in clinical data. |
None
|
imputer
|
str
|
Imputation method to use ('knn' or 'mice'). |
'mice'
|
imputer_params
|
dict
|
Method-specific params. |
None
|
low_expression_variance_threshold
|
float
|
Variance cutoff for near-constant genes when
|
0.0005
|
add_indicators
|
bool
|
Whether to add missingness indicator columns during imputation. |
True
|
verbose
|
bool
|
Print progress from lower-level utilities. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
dict |
Dict[str, Optional[DataFrame]]
|
Dictionary containing: - 'processed_clinical' (pd.DataFrame): Processed clinical data - 'processed_transcriptomics' (pd.DataFrame): Processed transcriptomics data - 'integrated_data' (pd.DataFrame or None): Integrated dataset if integration enabled |
Raises:
| Type | Description |
|---|---|
TypeError
|
If inputs are not pandas DataFrames. |
KeyError
|
If required ID columns are missing. |
ValueError
|
On processing errors in underlying steps. |
Source code in src/synomicsbench/processing/pipeline.py
501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 | |
DataProcessor
¶
DataProcessor provides static methods for preprocessing, encoding, imputing, and engineering features in omics or clinical datasets.
This class enables duplicate removal, missing value filtering, encoding of categorical features, normalization, KNN imputation, and feature engineering classification.
Methods are tailored for pandas DataFrame workflows in bioinformatics and genomics.
Source code in src/synomicsbench/processing/preprocessing.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 | |
encode_dummy_features(data)
staticmethod
¶
Encode categorical features into dummy/one-hot columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
DataFrame with categorical features. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: Dummy-encoded DataFrame. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If input is not a pandas DataFrame. |
ValueError
|
If encoding fails. |
Source code in src/synomicsbench/processing/preprocessing.py
259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 | |
encode_ordinal_features(data)
staticmethod
¶
Encode ordinal categorical features using OrdinalEncoder.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
DataFrame with ordinal categorical features. |
required |
Returns:
| Type | Description |
|---|---|
Tuple[DataFrame, OrdinalEncoder]
|
Tuple[pd.DataFrame, OrdinalEncoder]: Encoded DataFrame and fitted encoder. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If input is not a pandas DataFrame. |
ValueError
|
If encoding fails. |
Source code in src/synomicsbench/processing/preprocessing.py
284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 | |
extract_missingindicator_columns(data)
staticmethod
¶
Extract columns indicating missingness from imputed DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
DataFrame after imputation. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: DataFrame with only missing indicator columns. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If input is not a pandas DataFrame. |
ValueError
|
On extraction errors. |
Source code in src/synomicsbench/processing/preprocessing.py
583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 | |
feature_engineering(data, data_type, overmissing_threshold=50, imputer='mice', ordinal_cat_columns=None, unique_threshold=10, scaler='minmax', imputer_params=None, add_indicators=True, verbose=True)
staticmethod
¶
Perform feature engineering and imputation using a chosen method with its specific params.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Input DataFrame. |
required |
data_type
|
str
|
'transcriptomics' or 'clinical'. |
required |
overmissing_threshold
|
float
|
Drop features with missingness > threshold. |
50
|
imputer
|
str
|
'mice' or 'knn'. |
'mice'
|
ordinal_cat_columns
|
List[str]
|
Known ordinal categorical columns. |
None
|
unique_threshold
|
int
|
Threshold for classifying dummy features (clinical). |
10
|
scaler
|
str
|
'minmax' | 'standard' | 'robust'. |
'minmax'
|
imputer_params
|
dict
|
Method-specific params. - For 'knn': pass keys as in knn_imputer (e.g., n_neighbors, dummy_cat_columns, ordinal_cat_columns) - For 'mice': pass keys as in mice_imputation (e.g., iterations, n_estimators, random_state, verbose, …) |
None
|
add_indicators
|
bool
|
Append missing indicators. |
True
|
verbose
|
bool
|
Print progress. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
tuple |
tuple
|
(processed_df, numerical_features, dummy_categorical_features) |
Source code in src/synomicsbench/processing/preprocessing.py
881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 | |
find_missing_percent(data)
staticmethod
¶
Calculate the percentage of missing values for each column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Input DataFrame. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: DataFrame with columns 'ColumnName' and 'PercentMissing'. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If input is not a pandas DataFrame. |
RuntimeError
|
On error during calculation. |
Source code in src/synomicsbench/processing/preprocessing.py
118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | |
impute_router(data, method='mice', *, add_indicators=True, knn_params=None, mice_params=None)
staticmethod
¶
Dispatch imputation to KNN or MICE with method-specific parameters.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Input frame (already preprocessed if needed). |
required |
method
|
str
|
'knn' or 'mice'. |
'mice'
|
add_indicators
|
bool
|
Whether to append missing indicators. |
True
|
knn_params
|
dict
|
Passed to knn_imputer (n_neighbors, etc.). |
None
|
mice_params
|
dict
|
Passed to mice_imputation (iterations, n_estimators, etc.). |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: Imputed dataframe (with indicators if requested). |
Source code in src/synomicsbench/processing/preprocessing.py
414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 | |
inverse_dummy_features(data, dummy_cat_columns)
staticmethod
¶
Decode dummy-encoded categorical features back to original categories.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
DataFrame with dummy-encoded features. |
required |
dummy_cat_columns
|
list
|
List of dummy categorical feature names. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: DataFrame with decoded categorical columns. |
Raises:
| Type | Description |
|---|---|
TypeError
|
On invalid input types or empty list. |
ValueError
|
On decoding errors. |
Source code in src/synomicsbench/processing/preprocessing.py
609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 | |
inverse_ordinal_features(data, encoder)
staticmethod
¶
Decode ordinal categorical features from encoded values back to original categories.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
DataFrame with encoded ordinal features. |
required |
encoder
|
OrdinalEncoder
|
Fitted ordinal encoder. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: Decoded ordinal categorical features. |
Raises:
| Type | Description |
|---|---|
AttributeError
|
If encoder is not initialized. |
TypeError
|
If input is not a DataFrame. |
ValueError
|
On decoding errors. |
Source code in src/synomicsbench/processing/preprocessing.py
650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 | |
inverse_standardization(data, scaler)
staticmethod
¶
Undo normalization/standardization on numerical features.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
DataFrame with normalized features. |
required |
scaler
|
BaseEstimator
|
Fitted scaler. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: DataFrame with original scale restored. |
Raises:
| Type | Description |
|---|---|
AttributeError
|
If scaler is not initialized. |
TypeError
|
If input is not a DataFrame. |
ValueError
|
On inverse normalization errors. |
Source code in src/synomicsbench/processing/preprocessing.py
685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 | |
knn_imputation(data, dummy_cat_columns=None, ordinal_cat_columns=None, numerical_columns=None, scaler='minmax', n_neighbors=5, add_indicators=True, verbose=True, **kwargs)
staticmethod
¶
Perform full preprocessing and KNN imputation, including encoding, normalization, and postprocessing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Input DataFrame. |
required |
dummy_cat_columns
|
List
|
Dummy categorical columns. |
None
|
ordinal_cat_columns
|
List
|
Ordinal categorical columns. |
None
|
numerical_columns
|
List
|
Numerical columns. |
None
|
scaler
|
str
|
Scaler type ('minmax', 'standard', 'robust'). |
'minmax'
|
n_neighbors
|
int
|
KNN neighbors. |
5
|
add_indicators
|
bool
|
Add missing indicators. |
True
|
verbose
|
bool
|
Print progress. |
True
|
**kwargs
|
Additional KNNImputer arguments. |
{}
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: DataFrame after imputation and decoding. |
Raises:
| Type | Description |
|---|---|
TypeError
|
On invalid input. |
ValueError
|
On missing columns or errors. |
Source code in src/synomicsbench/processing/preprocessing.py
718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 | |
knn_imputer(data, dummy_cat_columns, ordinal_cat_columns=None, n_neighbors=5, add_indicators=True, **kwargs)
staticmethod
¶
Impute missing values using KNNImputer, with special handling for dummy and ordinal features.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Preprocessed DataFrame. |
required |
dummy_cat_columns
|
list
|
List of dummy categorical feature names. |
required |
ordinal_cat_columns
|
List
|
List of ordinal categorical feature names. |
None
|
n_neighbors
|
int
|
Number of neighbors for KNN. |
5
|
add_indicators
|
bool
|
Whether to add missing indicators. |
True
|
**kwargs
|
Additional KNNImputer arguments. |
{}
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: DataFrame with imputed values and indicators. |
Raises:
| Type | Description |
|---|---|
TypeError
|
On invalid input types. |
ValueError
|
On imputation errors. |
Source code in src/synomicsbench/processing/preprocessing.py
458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 | |
mice_imputation(data, *, random_state=42, iterations=20, n_estimators=300, add_indicators=True, verbose=True, **mice_kwargs)
staticmethod
¶
Impute with miceforest and optionally append missing indicators. Additional miceforest arguments can be passed via **mice_kwargs (e.g., variable_schema).
Source code in src/synomicsbench/processing/preprocessing.py
366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 | |
remove_duplications(data, axis)
staticmethod
¶
Remove duplicate rows or columns from the DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Input DataFrame. |
required |
axis
|
int
|
0 to remove duplicate rows, 1 to remove duplicate columns. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: DataFrame with duplicates removed. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If input is not a pandas DataFrame. |
ValueError
|
If axis is not 0 or 1. |
RuntimeError
|
On error during duplicate removal. |
Source code in src/synomicsbench/processing/preprocessing.py
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | |
remove_low_expression_genes(data, gene_id_column='gene_id', variance_threshold=0.0005)
staticmethod
¶
Remove low-expressed genes from transcriptomics expression data.
This utility supports two common transcriptomics orientations:
- Genes as rows and samples as columns (optionally with a gene_id column).
- Samples as rows and genes as columns (pipeline convention in synomicsbench).
A gene is removed if either:
- Total expression equals 0 across all samples, OR
- Variance is <= variance_threshold across all samples.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Expression DataFrame. |
required |
gene_id_column
|
str
|
Column containing gene IDs when genes are rows. |
'gene_id'
|
variance_threshold
|
float
|
Near-zero variance threshold. |
0.0005
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: Filtered DataFrame in the same orientation as the input. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If data is not a pandas DataFrame. |
ValueError
|
If variance_threshold is negative or data contains non-numeric columns. |
Source code in src/synomicsbench/processing/preprocessing.py
179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 | |
remove_overmissing_entities(data, threshold)
staticmethod
¶
Remove rows (entities) with missing value percentage above a threshold.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Input DataFrame. |
required |
threshold
|
float
|
Percentage threshold (0-100). |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: DataFrame with over-missing entities removed. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If input is not a pandas DataFrame. |
ValueError
|
If threshold is not between 0 and 100. |
Source code in src/synomicsbench/processing/preprocessing.py
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 | |
remove_overmissing_features(data, threshold)
staticmethod
¶
Remove columns (features) with missing value percentage above a threshold.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Input DataFrame. |
required |
threshold
|
float
|
Percentage threshold (0-100). |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: DataFrame with over-missing features removed. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If input is not a pandas DataFrame. |
ValueError
|
If threshold is not between 0 and 100. |
Source code in src/synomicsbench/processing/preprocessing.py
144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 | |
remove_unknown_entities(data, id_column)
staticmethod
¶
Remove rows where the identifier column contains missing values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Input DataFrame. |
required |
id_column
|
str
|
Name of the identifier column. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: Filtered DataFrame. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If input is not a pandas DataFrame. |
KeyError
|
If id_column is not in DataFrame. |
Source code in src/synomicsbench/processing/preprocessing.py
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | |
standardization(data, scaler)
staticmethod
¶
Standardize numerical features using specified scaler.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
DataFrame with numerical features. |
required |
scaler
|
str
|
Scaler type ('standard', 'minmax', or 'robust'). |
required |
Returns:
| Type | Description |
|---|---|
Tuple[DataFrame, BaseEstimator]
|
Tuple[pd.DataFrame, BaseEstimator]: Scaled DataFrame and scaler object. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If scaler type is invalid. |
TypeError
|
If input is not a pandas DataFrame. |
RuntimeError
|
If scaling fails. |
Source code in src/synomicsbench/processing/preprocessing.py
313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 | |
postprocessing
¶
anonymize_ids(ids, synthetic_data, output_path, mapping_file='anonymized_ids.json')
¶
Anonymize a list of IDs using random UUIDs and save the mapping.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ids
|
list or Series
|
List of original IDs to anonymize. |
required |
synthetic_data
|
DataFrame
|
synthetic data |
required |
output_path
|
str
|
Directory to save the ID mapping. |
required |
mapping_file
|
str
|
Filename for the mapping JSON. Defaults to 'anonymized_ids.json'. |
'anonymized_ids.json'
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: synthetic data with anonymized IDs |
Source code in src/synomicsbench/processing/postprocessing.py
153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 | |
apply_min_max(synthetic_data, numerical_columns, min_values, max_values)
¶
Apply minimum and maximum constraints to numerical columns in synthetic data. Args: synthetic_data (pd.DataFrame): Synthetic data DataFrame. numerical_columns (list): List of numerical column names. min_values: Dictionary of min value of the corresponding column max_values: Dictionary of max value of the corresponding column
Source code in src/synomicsbench/processing/postprocessing.py
122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 | |
apply_rounding(synthetic_data, numerical_columns, rounding_digits)
¶
Round numerical columns to the sotred numer of decimal places Args: synthetic_data (pd.DataFrame): Synthetic data DataFrame. numerical_columns (list): List of numerical column names rounding_digits (dict): Dictionary containing the number of rounding digits for each numerical column.
Source code in src/synomicsbench/processing/postprocessing.py
138 139 140 141 142 143 144 145 146 147 148 149 150 151 | |
load_metadata(metadata_path)
¶
Load metadata from a JSON file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata_path
|
str
|
Path to the metadata JSON file. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
dict
|
Loaded metadata. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the metadata file does not exist. |
JSONDecodeError
|
If the metadata file is not a valid JSON. |
Source code in src/synomicsbench/processing/postprocessing.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | |
MetaData
¶
Source code in src/synomicsbench/processing/metadata.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 | |
classify_features_types(data, threshold_unique_values, ordinal_features=None, binary_values={0, 1})
staticmethod
¶
Classify columns into dummy categorical and numerical features.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Input dataframe. |
required |
threshold_unique_values
|
int
|
Threshold for unique values to consider as categorical. |
required |
ordinal_features
|
Optional[List]
|
List of columns to treat as ordinal (will be excluded from both outputs). |
None
|
binary_values
|
set
|
Set of values to consider as binary for dummy categorical. |
{0, 1}
|
Returns:
| Name | Type | Description |
|---|---|---|
tuple |
(List[str], List[str])
|
(dummy_categorical_columns, numerical_columns) |
Source code in src/synomicsbench/processing/metadata.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | |
get_column_indices(data, column_list)
staticmethod
¶
Get indices of columns in a given list in a given dataframe.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
The DataFrame containing columns extracted indices. |
required |
column_list
|
list
|
List of columns. |
required |
Returns:
| Type | Description |
|---|---|
array
|
np.array: Array containing the column indices. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If a specified column does not exist in the DataFrame. |
Source code in src/synomicsbench/processing/metadata.py
202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 | |
get_metadata(data, threshold_unique_values, id_columns=None, ordinal_features=None, transcriptomic_cols=None, binary_values={0, 1})
staticmethod
¶
Extract column type metadata from a DataFrame and assign columns to corresponding types.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
DataFrame with features. |
required |
threshold_unique_values
|
int
|
Threshold for unique values to consider as categorical. |
required |
id_columns
|
Optional[List]
|
List of columns to ignore (e.g., sample IDs). |
None
|
ordinal_features
|
Optional[List]
|
List of columns to treat as ordinal. |
None
|
binary_values
|
set
|
Set of values to consider as binary. |
{0, 1}
|
Returns:
| Name | Type | Description |
|---|---|---|
dict |
dict[str, str]
|
Dictionary mapping each column to its feature type: 'numerical', 'ordinal_categorical', 'dummy_categorical', 'missing_categorical', or 'unclassified'. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If classification fails. |
Source code in src/synomicsbench/processing/metadata.py
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 | |
grouping_features_astype(data, metadata)
staticmethod
¶
Extract column type metadata from a JSON file and assign columns to corresponding types.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
DataFrame used to filter available columns. |
required |
metadata
|
str
|
metadata dictionary. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
dict
|
Dictionary with keys 'numerical', 'ordinal_categorical', 'dummy_categorical', and 'missing_categorical', mapping to lists of column names. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the metadata file is not found. |
JSONDecodeError
|
If the metadata file is not a valid JSON. |
Source code in src/synomicsbench/processing/metadata.py
122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 | |
load(metadata_dir)
staticmethod
¶
Load metadata from a JSON file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata_path
|
str
|
Path to the metadata JSON file. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
dict
|
Loaded metadata. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the metadata file does not exist. |
JSONDecodeError
|
If the metadata file is not a valid JSON. |
Source code in src/synomicsbench/processing/metadata.py
234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 | |
metadata_as_SDV(data, metadata)
staticmethod
¶
Extract column type metadata from a JSON file and assign columns to corresponding types follow SDV library format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
DataFrame used to filter available columns. |
required |
metadata
|
str
|
metadata dictionary. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
dict
|
Dictionary followed SDV format. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the metadata file is not found. |
JSONDecodeError
|
If the metadata file is not a valid JSON. |
Source code in src/synomicsbench/processing/metadata.py
172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 | |
GeneQuery
¶
A class to query gene information using MyGeneInfo and convert Ensembl IDs to HUGO symbols.
This class facilitates querying gene data, checking for duplicates, and mapping Ensembl gene IDs to HUGO symbols. Results are saved as JSON/CSV files, and operations are logged for debugging.
Attributes:
| Name | Type | Description |
|---|---|---|
fields |
List[str]
|
Fields to retrieve from MyGeneInfo (e.g., ["symbol"]). |
scopes |
List[str]
|
Scopes for gene queries (e.g., ["ensembl.gene"]). |
species |
List[str]
|
Species to query (e.g., ["human"]). |
output_dir |
str
|
Directory to save results and logs. |
logger |
Logger
|
Logger instance for debugging and error tracking. |
Source code in src/synomicsbench/processing/gene_query.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 | |
__init__(fields, scopes, species, output_dir)
¶
Initialize GeneQuery with query parameters and logging setup.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fields
|
List[str]
|
Fields to retrieve from MyGeneInfo queries. |
required |
scopes
|
List[str]
|
Scopes defining the gene identifier types. |
required |
species
|
List[str]
|
Species to include in the query. |
required |
output_dir
|
str
|
Directory to store output files and logs. |
required |
Raises:
| Type | Description |
|---|---|
RuntimeError
|
If the log file cannot be created due to permissions or path issues. |
Source code in src/synomicsbench/processing/gene_query.py
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | |
check_duplicates(data, **kwargs)
¶
Identify and group duplicate genes based on expression values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
DataFrame with samples as rows and genes as columns. |
required |
**kwargs
|
Any
|
Additional arguments for gene_query. |
{}
|
Returns:
| Type | Description |
|---|---|
Dict[str, List[str]]
|
Tuple of two dictionaries: |
Dict[str, List[str]]
|
|
Tuple[Dict[str, List[str]], Dict[str, List[str]]]
|
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If duplicate checking fails due to invalid data or query errors. |
Source code in src/synomicsbench/processing/gene_query.py
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | |
convert_genes(data, **kwargs)
¶
Convert Ensembl gene IDs to HUGO symbols.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
DataFrame with Ensembl gene IDs as columns. |
required |
**kwargs
|
Any
|
Additional arguments for gene_query. |
{}
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame containing gene mapping information. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If conversion fails due to invalid input or query errors. |
Source code in src/synomicsbench/processing/gene_query.py
132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 | |
gene_query(genes_list, **kwargs)
¶
Query gene information using MyGeneInfo.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
genes_list
|
List[str]
|
List of gene identifiers to query. |
required |
**kwargs
|
Any
|
Additional arguments for MyGeneInfo.querymany. |
{}
|
Returns:
| Type | Description |
|---|---|
List[Dict[str, Any]]
|
List of dictionaries containing gene mapping information. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the query fails due to network issues or invalid parameters. |
Source code in src/synomicsbench/processing/gene_query.py
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 | |
Metrics: Fidelity¶
The metrics.fidelity module provides assessment tools for evaluating the distribution quality of synthetic data against real data.
UnivariateSimilarity
¶
Compute and validate univariate similarity between original and synthetic data, including score computation, logging, result saving, and visualization.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_dir
|
str
|
Directory to save outputs and logs. |
required |
logger_name
|
str
|
Logger name |
'UnivariateSimilarity'
|
Attributes: output_dir (str): Output directory path. column_shapes (ColumnShapes): SDMetrics ColumnShapes property. logger (logging.Logger): Logger for this class.
Source code in src/synomicsbench/metrics/fidelity/UnivariateSimilarity.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 | |
get_detail_df()
¶
Get the DataFrame with column-level univariate similarity details.
Returns:
| Type | Description |
|---|---|
|
pd.DataFrame: Details DataFrame. |
Raises:
| Type | Description |
|---|---|
AttributeError
|
If column_shapes is not initialized. |
Source code in src/synomicsbench/metrics/fidelity/UnivariateSimilarity.py
84 85 86 87 88 89 90 91 92 93 94 95 96 | |
get_univariate_score(original_data, synthetic_data, metadata, save=True)
¶
Compute univariate similarity score between original and synthetic data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
original_data
|
DataFrame
|
Original data. |
required |
synthetic_data
|
DataFrame
|
Synthetic data. |
required |
metadata_path
|
str
|
Path to SDMetrics metadata. |
required |
save
|
bool
|
Save score dataframe and score distribution. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
float |
float
|
Overall univariate similarity score. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If score computation fails. |
Source code in src/synomicsbench/metrics/fidelity/UnivariateSimilarity.py
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 | |
get_visualization(plotly=False, data_name='', bins=50)
¶
Get a visualization of the column shape scores.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
plotly
|
bool
|
Whether to use Plotly for visualization. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
Figure |
Matplotlib or Plotly figure. |
Source code in src/synomicsbench/metrics/fidelity/UnivariateSimilarity.py
108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 | |
plot_column_score_histogram(details, data_name='', bins=50)
staticmethod
¶
Plot a decorated histogram of column shape scores for a set of features.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
details
|
DataFrame
|
DataFrame containing at least a 'Score' column with numerical values. |
required |
data_name
|
str
|
Optional label for the data (for title). |
''
|
bins
|
int
|
Number of bins for the histogram. |
50
|
Returns:
| Type | Description |
|---|---|
|
matplotlib.figure.Figure: Figure object for saving or further manipulation. |
Raises:
| Type | Description |
|---|---|
KeyError
|
If the 'Score' column is not present in the DataFrame. |
TypeError
|
If details is not a pandas DataFrame. |
Source code in src/synomicsbench/metrics/fidelity/UnivariateSimilarity.py
124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 | |
summarize()
¶
Summarize the scores by metric type.
Returns:
| Type | Description |
|---|---|
|
pd.DataFrame: Grouped summary statistics by metric. |
Source code in src/synomicsbench/metrics/fidelity/UnivariateSimilarity.py
98 99 100 101 102 103 104 105 106 | |
PairwiseSimilarity
¶
Compute and analyze pairwise similarity metrics (Pearson and contingency) between columns of original and synthetic datasets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
original_data
|
DataFrame
|
Input original dataset. |
required |
synthetic_data
|
DataFrame
|
Input synthetic dataset. |
required |
metadata
|
dict
|
Column type metadata. |
required |
output_dir
|
str
|
Directory for logs and outputs. |
''
|
Returns:
| Name | Type | Description |
|---|---|---|
PairwiseSimilarity |
An initialized instance for similarity analysis. |
Raises:
| Type | Description |
|---|---|
OSError
|
If the output directory cannot be created. |
Source code in src/synomicsbench/metrics/fidelity/PairwiseSimilarity.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 | |
get_cross_group_associations(group_1_features, group_2_features, score_matrix=None, original_matrix=None, synthetic_matrix=None, condensed=True)
¶
Calculate bivariate scores ONLY between two groups of features (e.g., clinical vs transcriptomics).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
group_1_features
|
List[str]
|
List of feature names in first group (e.g., clinical). |
required |
group_2_features
|
List[str]
|
List of feature names in second group (e.g., transcriptomics). |
required |
score_matrix
|
Optional[ndarray]
|
Score matrix (condensed or square). |
None
|
original_matrix
|
Optional[ndarray]
|
Original correlation matrix. |
None
|
synthetic_matrix
|
Optional[ndarray]
|
Synthetic correlation matrix. |
None
|
condensed
|
bool
|
Whether matrices are in condensed form. |
True
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: DataFrame with cross-group associations only. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If any feature name is not found in the data. |
Source code in src/synomicsbench/metrics/fidelity/PairwiseSimilarity.py
575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 | |
get_multiple_associations(feature_pairs, score_matrix=None, original_matrix=None, synthetic_matrix=None, condensed=True)
¶
Extract association metrics for multiple feature pairs using vectorized indexing.
This implementation minimizes Python-loop overhead by computing all indices at once and gathering values from correlation/score matrices via NumPy advanced indexing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
feature_pairs
|
List[Tuple[Union[str, int], Union[str, int]]]
|
List of (feature_1, feature_2) pairs specified by name or index. |
required |
score_matrix
|
Optional[ndarray]
|
Score matrix. If |
None
|
original_matrix
|
Optional[ndarray]
|
Original correlation matrix (condensed or square depending on |
None
|
synthetic_matrix
|
Optional[ndarray]
|
Synthetic correlation matrix (condensed or square depending on |
None
|
condensed
|
bool
|
If True, matrices are interpreted as condensed upper-triangular vectors (no diagonal). If False, matrices are interpreted as square (n x n) arrays. |
True
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: DataFrame with columns: - 'Feature_1' - 'Feature_2' - 'Metrics' ("CramersV_Correlation" if either feature is categorical, else "Spearman_Correlation") - 'Original_Correlation' - 'Synthetic_Correlation' - 'Score' |
Raises:
| Type | Description |
|---|---|
RuntimeError
|
|
ValueError
|
|
Source code in src/synomicsbench/metrics/fidelity/PairwiseSimilarity.py
336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 | |
get_pairwise_scores(method, n_bins=10)
¶
Compute the pairwise similarity matrices and print summary statistics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
method
|
str
|
Correlation method for MixedCorrelation. |
required |
n_bins
|
int
|
Number of bins for discretization (default 10). |
10
|
Returns:
| Name | Type | Description |
|---|---|---|
dict |
Dictionary containing pairwise score matrix, original condensed correlation, and synthetic condensed correlation. |
Raises:
| Type | Description |
|---|---|
Exception
|
If any computation or extraction fails. |
Source code in src/synomicsbench/metrics/fidelity/PairwiseSimilarity.py
202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 | |
get_single_associations(feature_1, feature_2, score_matrix=None, original_matrix=None, synthetic_matrix=None)
¶
Retrieve detailed correlation and score between two features from the condensed correlation matrices.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
feature_1
|
Union[str, int]
|
First feature name or column index. |
required |
feature_2
|
Union[str, int]
|
Second feature name or column index. |
required |
score_matrix
|
Optional[array]
|
Condensed score matrix. |
None
|
original_matrix
|
Optional[array]
|
Condensed correlation matrix for original data. |
None
|
synthetic_matrix
|
Optional[array]
|
Condensed correlation matrix for synthetic data. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
dict |
Dictionary containing feature names, metric type, original correlation, synthetic correlation, and score. |
Raises:
| Type | Description |
|---|---|
RuntimeError
|
If features are not in the data, or indices are out-of-bounds. |
ValueError
|
If correlation matrices are not provided. |
Source code in src/synomicsbench/metrics/fidelity/PairwiseSimilarity.py
260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 | |
plot_column_score_histogram(results, data_name='', bins=50)
staticmethod
¶
Plot a histogram of column shape scores for a set of features.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results
|
array
|
Numpy array containing at least a 'Score' column with numerical values. |
required |
data_name
|
str
|
Optional label for the data (for title). |
''
|
bins
|
int
|
Number of bins for the histogram. |
50
|
Returns:
| Type | Description |
|---|---|
|
matplotlib.figure.Figure: Figure object for saving or further manipulation. |
Raises:
| Type | Description |
|---|---|
KeyError
|
If the 'Score' column is not present in the DataFrame. |
TypeError
|
If results is not a pandas DataFrame. |
Source code in src/synomicsbench/metrics/fidelity/PairwiseSimilarity.py
511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 | |
print_condensed_matrix_summary(condensed_matrix, name='Condensed Matrix')
¶
Print statistical summary of a condensed matrix (1D array containing the upper triangle of a square matrix).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
condensed_matrix
|
ndarray
|
1D numpy array containing condensed matrix values. |
required |
name
|
str
|
Name for display. |
'Condensed Matrix'
|
Returns:
| Type | Description |
|---|---|
|
None |
Raises:
| Type | Description |
|---|---|
ValueError
|
If condensed_matrix is not a 1D numpy array. |
Source code in src/synomicsbench/metrics/fidelity/PairwiseSimilarity.py
178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 | |
summarize(results)
staticmethod
¶
Summarize pairwise similarity scores by metric type.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results
|
DataFrame
|
DataFrame containing pairwise association results. |
required |
Returns:
| Type | Description |
|---|---|
|
pd.DataFrame: Summary statistics (count, mean, std, min, max, etc.) grouped by metric type. |
Raises:
| Type | Description |
|---|---|
Exception
|
If summary calculation fails. |
Source code in src/synomicsbench/metrics/fidelity/PairwiseSimilarity.py
490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 | |
MissingValueSimilarity
¶
MissingValue_Similarity(origin_data, synthetic_data, missing_indicators)
¶
Compute missing value similarity scores for each target column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
origin_data
|
DataFrame
|
Original data. |
required |
synthetic_data
|
DataFrame
|
Synthetic data. |
required |
missing_indicators
|
list
|
List of column names with missing indicators. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
Mapping from column name to similarity score. |
Raises:
| Type | Description |
|---|---|
Exception
|
If computation fails for any column. |
Source code in src/synomicsbench/metrics/fidelity/MissingValueSimilarity.py
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | |
BayesianComparison
¶
BaycompStyle
dataclass
¶
Store manuscript-style visualization settings for baycomp heatmaps.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fontsize
|
int
|
Base font size used in matplotlib rcParams. |
11
|
nature_font
|
Dict[str, Sequence[str]]
|
Font family configuration. |
(lambda: dict(_FONT))()
|
plot_colors
|
Dict[str, str]
|
Common background/grid colors. |
(lambda: dict(STABILITY_PLOT_COLORS))()
|
pbetter_cmap
|
LinearSegmentedColormap
|
Colormap for P(Better) heatmaps. |
(lambda: PBETTER_FOCUS_CMAP)()
|
cancer_colors
|
Dict[str, str]
|
Color strip mapping for each cancer panel. |
(lambda: dict(CANCER_COLORS))()
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If fontsize is not positive. |
Source code in src/synomicsbench/metrics/fidelity/BayesianComparison.py
78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 | |
BayesianComparison
dataclass
¶
Compute Bayesian pairwise comparisons using the external baycomp package.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rope
|
float
|
ROPE threshold for practical equivalence. |
0.01
|
seed
|
int
|
Random seed for reproducibility in baycomp. |
0
|
style
|
BaycompStyle
|
Plot styling settings for consistent manuscript figures. |
BaycompStyle()
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If rope is not positive. |
Source code in src/synomicsbench/metrics/fidelity/BayesianComparison.py
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 | |
compare_methods(method_to_scores, methods_order=None)
¶
Compute ordered-pair Bayesian comparison probabilities for a set of methods.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
method_to_scores
|
Mapping[str, Sequence[float]]
|
Mapping method -> scores across seeds/folds. |
required |
methods_order
|
Optional[Sequence[str]]
|
Optional ordering for methods. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: Table with columns: ["Method 1", "Method 2", "Better Prob", "Worse Prob", "Equivalent Prob"]. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If fewer than 2 methods are provided. |
ValueError
|
If any method has fewer than 2 finite scores. |
Source code in src/synomicsbench/metrics/fidelity/BayesianComparison.py
156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 | |
comparison_to_matrix(comparison_df, methods_order, value_col='Better Prob', nan_diagonal=True)
staticmethod
¶
Convert a comparison table into a square matrix suitable for heatmap plotting.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
comparison_df
|
DataFrame
|
Output of compare_methods. |
required |
methods_order
|
Sequence[str]
|
Method ordering for rows and columns. |
required |
value_col
|
str
|
Which column to visualize. |
'Better Prob'
|
nan_diagonal
|
bool
|
If True, set diagonal values to NaN. |
True
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: Square matrix with index=Method 1 and columns=Method 2. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If value_col is not present in comparison_df. |
Source code in src/synomicsbench/metrics/fidelity/BayesianComparison.py
202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 | |
plot_pbetter_heatmap_grid(cancer_to_method_scores, cancers_order=('ccRCC', 'Melanoma', 'NSCLC'), methods_order=None, value_col='Better Prob', figsize=(18, 5.5), annot=True, fmt='.2f', show=True)
¶
Plot a 1xN grid of Bayesian comparison probability heatmaps (one per cancer cohort).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cancer_to_method_scores
|
Dict[str, Dict[str, Sequence[float]]]
|
cancer -> {method -> scores across seeds/folds}. |
required |
cancers_order
|
Sequence[str]
|
Order of cohorts in the grid. |
('ccRCC', 'Melanoma', 'NSCLC')
|
methods_order
|
Optional[Sequence[str]]
|
Global method ordering. If None, uses union over cancers. |
None
|
value_col
|
str
|
One of ["Better Prob", "Worse Prob", "Equivalent Prob"] to visualize. |
'Better Prob'
|
figsize
|
Tuple[int, int]
|
Figure size in inches. |
(18, 5.5)
|
annot
|
bool
|
If True, annotate each cell with numeric values. |
True
|
fmt
|
str
|
Annotation format passed to seaborn. |
'.2f'
|
show
|
bool
|
If True, calls plt.show(). |
True
|
Returns:
| Type | Description |
|---|---|
Tuple[Figure, ndarray, Dict[str, DataFrame], Dict[str, DataFrame]]
|
Tuple[plt.Figure, np.ndarray, Dict[str, pd.DataFrame], Dict[str, pd.DataFrame]]: fig, axes, matrices_by_cancer, comparisons_by_cancer |
Raises:
| Type | Description |
|---|---|
ValueError
|
If value_col is invalid. |
ValueError
|
If cancers_order contains a cohort not present in cancer_to_method_scores. |
Source code in src/synomicsbench/metrics/fidelity/BayesianComparison.py
237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 | |
visualization
¶
plot_violin_grid_by_cancer(cancer_to_method_scores, value_name='Score', methods_order=None, palette=DATASET_COLORS, figsize=(18, 5), fontsize=11, annotate_mean=True, mean_fmt='{:.3f}', mean_marker='D', mean_marker_size=70.0, mean_text_offset_frac=0.03, sharey=True, show=True)
¶
Plot a 1xN grid of violin plots, one per cancer type, using a shared method color palette. The mean of each method is shown as a white diamond and optionally annotated as text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cancer_to_method_scores
|
Dict[str, Dict[str, List[float]]]
|
Mapping cancer -> {method -> list of scores}. |
required |
value_name
|
str
|
Y-axis label for the score metric. |
'Score'
|
methods_order
|
Optional[Sequence[str]]
|
Global ordering of methods across all panels. If None, uses the union of methods in insertion order. |
None
|
palette
|
Optional[Dict[str, str]]
|
Mapping method -> color. If None, uses DATASET_COLORS. |
DATASET_COLORS
|
figsize
|
Tuple[int, int]
|
Figure size. |
(18, 5)
|
fontsize
|
int
|
Base font size. |
11
|
annotate_mean
|
bool
|
If True, write the mean value above each violin. |
True
|
mean_fmt
|
str
|
Format string for the mean annotation (e.g., "{:.3f}"). |
'{:.3f}'
|
mean_marker
|
str
|
Marker for mean point. |
'D'
|
mean_marker_size
|
float
|
Marker size for mean point. |
70.0
|
mean_text_offset_frac
|
float
|
Vertical offset for mean text as a fraction of y-span. |
0.03
|
sharey
|
bool
|
If True, share y-axis across panels. |
True
|
show
|
bool
|
If True, calls plt.show(). |
True
|
Returns:
| Type | Description |
|---|---|
Tuple[Figure, ndarray, Dict[str, Series]]
|
Tuple[plt.Figure, np.ndarray, Dict[str, pd.Series]]: (figure, axes, mean_by_cancer). |
Raises:
| Type | Description |
|---|---|
ValueError
|
If cancer_to_method_scores is empty. |
ValueError
|
If any cancer has no methods or no numeric scores after cleaning. |
ValueError
|
If methods_order contains methods missing from all cancers. |
Source code in src/synomicsbench/metrics/fidelity/visualization.py
122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 | |
Metrics: Biological Utility¶
The metrics.narrow_utility module evaluates the capability of synthetic data in downstream bioinformatics tasks.
DGE
¶
GCSAnalyzer
¶
Refactor of the original GCS computation code into a class, preserving identical logic.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
term_col
|
str
|
Column name for pathway / gene-set names. |
'Gene'
|
nes_col
|
str
|
Column name for effect size (e.g., Log2FC). |
'Log2FC'
|
q_col
|
str
|
Column name for Q-value. |
'Q_value'
|
seed
|
int
|
Seed used for np.random.seed to generate jitter. |
42
|
q_thr
|
float
|
Q-value threshold used to define significance boundary. |
0.05
|
w
|
float
|
Weight for non-significant concordance in GCS. |
0.5
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If q_thr is not in (0, 1]. |
ValueError
|
If w is negative. |
Source code in src/synomicsbench/metrics/narrow_utility/DGE.py
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 | |
align_rank_scores(rnk_ori, rnk_syn)
staticmethod
¶
Inner join on pathway / gene names.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rnk_ori
|
DataFrame
|
Original rank table indexed by term. |
required |
rnk_syn
|
DataFrame
|
Synthetic rank table indexed by term. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: Aligned rank table. |
Source code in src/synomicsbench/metrics/narrow_utility/DGE.py
138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 | |
compute_rank_score(df)
¶
Compute rank score = sign(Log2FC) * -log10(Q-value) with tiny jitter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DGE results table. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: Ranked table indexed by term_col with columns rank_score and qval. |
Source code in src/synomicsbench/metrics/narrow_utility/DGE.py
107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 | |
gene_set_concordance_score(df_rank, ori_rank_size)
¶
Compute GCS using original pathway count to avoid bias due to pathway loss after alignment.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df_rank
|
DataFrame
|
Aligned rank table with rank_ori, rank_syn, q_ori. |
required |
ori_rank_size
|
int
|
Number of pathways in original ranking. |
required |
Returns:
| Type | Description |
|---|---|
Tuple[float, int, int, float]
|
tuple[float, int, int, float]: (GCS, N_sign, N_non_sign, M). |
Source code in src/synomicsbench/metrics/narrow_utility/DGE.py
158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 | |
plot_gcs_datasets(ori_data, dataset_dict, figsize=(18, 10))
¶
Plot all datasets in a grid.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ori_data
|
str | PathLike | DataFrame
|
Path to original CSV or DataFrame. |
required |
dataset_dict
|
Mapping[str, str | PathLike | DataFrame]
|
Tool name -> path or DataFrame. |
required |
figsize
|
tuple[int, int]
|
Figure size. |
(18, 10)
|
Returns:
| Type | Description |
|---|---|
Tuple[Figure, Dict[str, float]]
|
tuple[matplotlib.figure.Figure, dict[str, float]]: (figure, gcs_dict). |
Source code in src/synomicsbench/metrics/narrow_utility/DGE.py
335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 | |
plot_single_gcs_panel(ax, x, y, gcs, n_zone1, n_zone2, n_zone3, n_zone4, tool_name, m)
¶
Plot single GCS comparison panel (identical logic and styling to original code).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ax
|
Axes
|
Target axis. |
required |
x
|
ndarray
|
Rank scores (original). |
required |
y
|
ndarray
|
Rank scores (synthetic). |
required |
gcs
|
float
|
GCS value. |
required |
n_zone1
|
int
|
Count in zone1. |
required |
n_zone2
|
int
|
Count in zone2. |
required |
n_zone3
|
int
|
Count in zone3. |
required |
n_zone4
|
int
|
Count in zone4. |
required |
tool_name
|
str
|
Tool/dataset name for title. |
required |
m
|
float
|
M normalization used in GCS. |
required |
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in src/synomicsbench/metrics/narrow_utility/DGE.py
249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 | |
process_single_dge_result(dge_ori, dge_syn)
¶
Process original and synthetic DGE results (paths or DataFrames).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dge_ori
|
str | PathLike | DataFrame
|
Original DGE table/path. |
required |
dge_syn
|
str | PathLike | DataFrame
|
Synthetic DGE table/path. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
tuple |
Tuple[ndarray, ndarray, float, int, int, int, int, float, int, int, int]
|
(x, y, GCS, n_zone1, n_zone2, n_zone3, n_zone4, M, ori_rank_size, aligned_size, seed_used). |
Source code in src/synomicsbench/metrics/narrow_utility/DGE.py
195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 | |
GCSResult
dataclass
¶
Container for GCS computation outputs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
gcs
|
float
|
Gene-set Concordance Score (same formula as PCS, renamed). |
required |
n_sign
|
int
|
Number of concordant significant pathways (zones 3 + 4). |
required |
n_non_sign
|
int
|
Number of concordant non-significant pathways (zones 1 + 2). |
required |
m
|
float
|
Normalization constant M used in the GCS formula. |
required |
n_zone1
|
int
|
Count in non-significant lower-left zone. |
required |
n_zone2
|
int
|
Count in non-significant upper-right zone. |
required |
n_zone3
|
int
|
Count in significant lower-left zone. |
required |
n_zone4
|
int
|
Count in significant upper-right zone. |
required |
ori_rank_size
|
int
|
Number of ranked pathways in the original ranking. |
required |
aligned_size
|
int
|
Number of pathways after alignment (inner join). |
required |
Source code in src/synomicsbench/metrics/narrow_utility/DGE.py
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | |
GSEA
¶
PCSAnalyzer
¶
Compute PCS from GSEA pathway enrichment tables and generate manuscript-style panels.
This class preserves the original logic provided in gsea_pathway_analysis.py,
including jitter behavior and normalization by aligned pathways (len(x)).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
term_col
|
str
|
Pathway term column name. |
'Term'
|
nes_col
|
str
|
Normalized enrichment score column name. |
'NES'
|
q_col
|
str
|
FDR q-value column name. |
'FDR q-val'
|
seed
|
int
|
Seed passed to np.random.seed for jitter reproducibility. |
42
|
q_thr
|
float
|
Q-value threshold for significance zones. |
0.05
|
w
|
float
|
Weight for non-significant concordance. |
0.5
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If q_thr is not in (0, 1]. |
ValueError
|
If w is negative. |
Source code in src/synomicsbench/metrics/narrow_utility/GSEA.py
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 | |
align_rank_scores(rnk_ori, rnk_syn)
staticmethod
¶
Inner join on pathway names.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rnk_ori
|
DataFrame
|
Original rank table. |
required |
rnk_syn
|
DataFrame
|
Synthetic rank table. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: Aligned rank table. |
Source code in src/synomicsbench/metrics/narrow_utility/GSEA.py
136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 | |
gsea_rank_score(df)
¶
Compute pathway rank score = sign(NES) * -log10(FDR-q) with tiny jitter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
GSEA results table. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: Ranked table indexed by term_col with columns rank_score and qval. |
Raises:
| Type | Description |
|---|---|
KeyError
|
If required columns are missing. |
Source code in src/synomicsbench/metrics/narrow_utility/GSEA.py
103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 | |
pathway_concordance_score(df_rank)
¶
Compute PCS for aligned rank scores.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df_rank
|
DataFrame
|
Aligned rank table with rank_ori, rank_syn, q_ori. |
required |
Returns:
| Type | Description |
|---|---|
Tuple[float, int, int, float]
|
tuple[float, int, int, float]: (PCS, N_sign, N_non_sign, M). |
Source code in src/synomicsbench/metrics/narrow_utility/GSEA.py
156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 | |
plot_gsea_datasets(ori_data, dataset_dict, figsize=(9, 10))
¶
Plot multiple synthetic datasets against one original GSEA file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ori_data
|
str | PathLike | DataFrame
|
Original GSEA data or path. |
required |
dataset_dict
|
Mapping[str, str | PathLike | DataFrame]
|
Tool name -> synthetic data or path. |
required |
figsize
|
tuple[float, float]
|
Figure size. |
(9, 10)
|
Returns:
| Type | Description |
|---|---|
Tuple[Figure, Dict[str, float]]
|
tuple[matplotlib.figure.Figure, dict[str, float]]: (fig, pcs_dict). |
Source code in src/synomicsbench/metrics/narrow_utility/GSEA.py
317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 | |
plot_single_gsea_panel(ax, x, y, result, tool_name, show_xlabel=False)
¶
Plot one PCS scatter panel (same geometry/logic as original script).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ax
|
Axes
|
Matplotlib axis. |
required |
x
|
ndarray
|
Original rank scores. |
required |
y
|
ndarray
|
Synthetic rank scores. |
required |
result
|
PCSResult
|
PCS result container. |
required |
tool_name
|
str
|
Tool name for title. |
required |
show_xlabel
|
bool
|
Whether to show x-axis label/ticks. |
False
|
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in src/synomicsbench/metrics/narrow_utility/GSEA.py
236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 | |
process_single_gsea_result(gsea_ori, gsea_syn)
¶
Process original and synthetic GSEA results (paths or DataFrames).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
gsea_ori
|
str | PathLike | DataFrame
|
Original GSEA table. |
required |
gsea_syn
|
str | PathLike | DataFrame
|
Synthetic GSEA table. |
required |
Returns:
| Type | Description |
|---|---|
Tuple[ndarray, ndarray, PCSResult]
|
tuple[np.ndarray, np.ndarray, PCSResult]: (x, y, result). |
Source code in src/synomicsbench/metrics/narrow_utility/GSEA.py
188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 | |
PCSResult
dataclass
¶
Container for PCS computation outputs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pcs
|
float
|
Pathway Concordance Score. |
required |
n_sign
|
int
|
Number of concordant significant pathways (zones 3 + 4). |
required |
n_non_sign
|
int
|
Number of concordant non-significant pathways (zones 1 + 2). |
required |
m
|
float
|
Normalization constant M used in the PCS formula. |
required |
n_zone1
|
int
|
Count in non-significant lower-left zone. |
required |
n_zone2
|
int
|
Count in non-significant upper-right zone. |
required |
n_zone3
|
int
|
Count in significant lower-left zone. |
required |
n_zone4
|
int
|
Count in significant upper-right zone. |
required |
aligned_size
|
int
|
Number of pathways after alignment (inner join). |
required |
Source code in src/synomicsbench/metrics/narrow_utility/GSEA.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | |
cell_deconvolution
¶
Aitchison distance for compositional cell-type deconvolution data.
Provides functions to compute the Aitchison distance between immune cell composition profiles estimated by CIBERSORTx (or similar tools) on original and synthetic datasets.
The Aitchison distance is computed in Centered Log-Ratio (CLR) space after multiplicative replacement of zeros and geometric mean centering.
aitchison_distance(df_orig, df_syn, cell_types)
¶
Compute the Aitchison distance between two cell-type composition datasets.
The distance is defined as the Euclidean distance between the CLR- transformed compositional centers of the original and synthetic datasets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df_orig
|
DataFrame
|
Original CIBERSORTx results (rows = samples, columns include the cell-type columns). |
required |
df_syn
|
DataFrame
|
Synthetic CIBERSORTx results with the same cell-type columns. |
required |
cell_types
|
List[str]
|
List of column names for the cell types to compare. |
required |
Returns:
| Type | Description |
|---|---|
float
|
Aitchison distance (non-negative float). |
Source code in src/synomicsbench/metrics/narrow_utility/cell_deconvolution.py
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 | |
aitchison_score(distance)
¶
Convert Aitchison distance to a similarity score via exp(-d).
Higher values indicate better agreement. A distance of 0 yields a score of 1.0.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
distance
|
float
|
Non-negative Aitchison distance. |
required |
Returns:
| Type | Description |
|---|---|
float
|
Score in (0, 1]. |
Source code in src/synomicsbench/metrics/narrow_utility/cell_deconvolution.py
74 75 76 77 78 79 80 81 82 83 84 85 86 | |
geometric_center(X)
¶
Compute the compositional center (geometric mean) for each component.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
2-D array of shape (n_samples, n_parts). |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
1-D array of length n_parts representing the geometric mean per part. |
Source code in src/synomicsbench/metrics/narrow_utility/cell_deconvolution.py
20 21 22 23 24 25 26 27 28 29 | |
survival_analysis
¶
SurvivalEvaluator
¶
Perform survival analysis and visualization across multiple datasets comparing two phenotype groups.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datasets_dict
|
Dict[str, DataFrame]
|
Mapping from dataset names to DataFrames. |
required |
phenotype
|
Dict[str, List[Any]]
|
{column_name: [value_A, value_B]}, specifying the phenotype column and comparison values. |
required |
time_target
|
str
|
Name of the survival time column. |
'OS'
|
event_target
|
str
|
Name of the event indicator column. |
'OS_CNSR'
|
dataset_order
|
Optional[List[str]]
|
Custom plotting order for datasets. If None, use input order. |
None
|
dataset_colors
|
Optional[Dict[str, str]]
|
Colors for dataset annotation strips. |
None
|
group_colors
|
Optional[Dict[str, str]]
|
Colors for phenotype groups (A, B). |
None
|
font_scale
|
float
|
Global scaling for plot text sizes. |
1.0
|
Returns:
| Type | Description |
|---|---|
|
None |
Raises:
| Type | Description |
|---|---|
ValueError
|
If datasets_dict is empty or phenotype is not a {col: [A,B]} dict. |
Source code in src/synomicsbench/metrics/narrow_utility/survival_analysis.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 | |
__init__(datasets_dict, phenotype, time_target='OS', event_target='OS_CNSR', dataset_order=None, dataset_colors=None, group_colors=None, font_scale=1.0, is_pdf=False, original_name='Origin')
¶
Initialize SurvivalGridEvaluator for grid-based survival comparison.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datasets_dict
|
Dict[str, DataFrame]
|
Input datasets. |
required |
phenotype
|
Dict[str, List[Any]]
|
{col: [value_A, value_B]}. |
required |
time_target
|
str
|
Survival duration column. |
'OS'
|
event_target
|
str
|
Event indicator column. |
'OS_CNSR'
|
dataset_order
|
List[str]
|
Custom dataset plotting order. |
None
|
dataset_colors
|
Dict[str, str]
|
Strip colors per dataset. |
None
|
group_colors
|
Dict[str, str]
|
Colors for phenotype groups. |
None
|
font_scale
|
float
|
Plot font scaling. |
1.0
|
is_pdf
|
boolean
|
save as pdf or png. |
False
|
original_name
|
str
|
Name of the reference dataset (default: "Origin"). |
'Origin'
|
Returns:
| Type | Description |
|---|---|
None
|
None |
Raises:
| Type | Description |
|---|---|
ValueError
|
On empty datasets or invalid phenotype specification. |
Source code in src/synomicsbench/metrics/narrow_utility/survival_analysis.py
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 | |
compute_cindex_scores(original_name=None)
¶
Compute C-index similarity scores between the reference and synthetic datasets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
original_name
|
Optional[str]
|
Reference dataset for score calculation. If None, uses self.original_name. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: DataFrame with additional column 'C-index_score'. |
Raises:
| Type | Description |
|---|---|
KeyError
|
If required columns or reference row are missing. |
RuntimeError
|
If compute_survival_metrics() was not called prior. |
Source code in src/synomicsbench/metrics/narrow_utility/survival_analysis.py
202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 | |
compute_survival_metrics()
¶
Compute log-rank test p-values and C-index for each dataset grid panel.
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: DataFrame with columns ['Dataset', 'pvalue', 'C-index', 'n_A', 'n_B']. |
Source code in src/synomicsbench/metrics/narrow_utility/survival_analysis.py
128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 | |
plot_grid(figsize=(18, 6), show_censors=True, ci_show=False, title_prefix='Survival', dataset_strip_height=0.02, dataset_strip_y=1.01, save_dir=None)
¶
Plot a manuscript-style grid of Kaplan–Meier survival curves for all datasets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
figsize
|
Tuple[float, float]
|
Figure dimensions (W, H). |
(18, 6)
|
show_censors
|
bool
|
Whether to display censor marks. |
True
|
ci_show
|
bool
|
Whether to render CI for KM curves. |
False
|
title_prefix
|
str
|
Prefix for subplot titles. |
'Survival'
|
dataset_strip_height
|
float
|
Height of colored dataset strip. |
0.02
|
dataset_strip_y
|
float
|
Y-position of top dataset strip. |
1.01
|
save_dir
|
str
|
If set, save individual dataset KM curves to this folder. |
None
|
Returns:
| Type | Description |
|---|---|
Tuple[Figure, DataFrame]
|
Tuple[plt.Figure, pd.DataFrame]: The matplotlib Figure and the summary dataframe. |
Source code in src/synomicsbench/metrics/narrow_utility/survival_analysis.py
249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 | |
predictive_model_comp
¶
BayesianComparison
¶
BaycompStyle
dataclass
¶
Store manuscript-style visualization settings for baycomp heatmaps.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fontsize
|
int
|
Base font size used in matplotlib rcParams. |
11
|
nature_font
|
Dict[str, Sequence[str]]
|
Font family configuration. |
None
|
plot_colors
|
Dict[str, str]
|
Common background/grid colors. |
None
|
pbetter_cmap
|
LinearSegmentedColormap
|
Colormap for P(Better) heatmaps. |
(lambda: PBETTER_FOCUS_CMAP)()
|
cancer_colors
|
Dict[str, str]
|
Color strip mapping for each cancer panel. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If fontsize is not positive. |
Source code in src/synomicsbench/metrics/narrow_utility/BayesianComparison.py
205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 | |
BayesianComparison
dataclass
¶
Compute Bayesian pairwise comparisons using the external baycomp package and plot heatmaps.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rope
|
float
|
ROPE threshold for practical equivalence. |
0.01
|
seed
|
int
|
Random seed for reproducibility in baycomp. |
0
|
style
|
BaycompStyle
|
Plot styling settings for consistent manuscript figures. |
BaycompStyle()
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If rope is not positive. |
Source code in src/synomicsbench/metrics/narrow_utility/BayesianComparison.py
237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 | |
compare_methods(method_to_scores, methods_order=None)
¶
Compute ordered-pair Bayesian comparison probabilities for a set of methods.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
method_to_scores
|
Mapping[str, Sequence[float]]
|
Mapping method -> scores across seeds/folds. |
required |
methods_order
|
Optional[Sequence[str]]
|
Optional ordering for methods. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: Table with columns: ["Method 1", "Method 2", "Better Prob", "Worse Prob", "Equivalent Prob"]. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If fewer than 2 methods are provided. |
ValueError
|
If any method has fewer than 2 finite scores. |
Source code in src/synomicsbench/metrics/narrow_utility/BayesianComparison.py
310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 | |
comparison_to_matrix(comparison_df, methods_order, value_col='Better Prob', nan_diagonal=True)
staticmethod
¶
Convert a comparison table into a square matrix suitable for heatmap plotting.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
comparison_df
|
DataFrame
|
Output of compare_methods. |
required |
methods_order
|
Sequence[str]
|
Method ordering for rows and columns. |
required |
value_col
|
str
|
Which column to visualize. |
'Better Prob'
|
nan_diagonal
|
bool
|
If True, set diagonal values to NaN. |
True
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: Square matrix with index=Method 1 and columns=Method 2. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If value_col is not present in comparison_df. |
Source code in src/synomicsbench/metrics/narrow_utility/BayesianComparison.py
358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 | |
plot_pbetter_heatmap_grid(cancer_to_method_scores, cancers_order=('ccRCC', 'Melanoma', 'NSCLC'), methods_order=None, value_col='Better Prob', figsize=(18, 5), annot=True, fmt='.2f', missing_hatch='///', show=True)
¶
Plot a 1xN grid of baycomp probability heatmaps (one per cancer).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cancer_to_method_scores
|
Mapping[str, Mapping[str, Sequence[float]]]
|
Cancer -> method/tool -> score list. |
required |
cancers_order
|
Sequence[str]
|
Order of cancers in the grid. |
('ccRCC', 'Melanoma', 'NSCLC')
|
methods_order
|
Optional[Sequence[str]]
|
Global method ordering. If None, uses union across cancers. |
None
|
value_col
|
str
|
Which probability to visualize: "Better Prob", "Worse Prob", or "Equivalent Prob". |
'Better Prob'
|
figsize
|
Tuple[float, float]
|
Figure size. |
(18, 5)
|
annot
|
bool
|
If True, annotate cells (NaN cells are blank). |
True
|
fmt
|
str
|
Annotation format. |
'.2f'
|
missing_hatch
|
str
|
Hatch pattern for missing (NaN) cells. |
'///'
|
show
|
bool
|
If True, calls plt.show(). |
True
|
Returns:
| Type | Description |
|---|---|
Tuple[Figure, ndarray, Dict[str, DataFrame], Dict[str, DataFrame]]
|
Tuple[plt.Figure, np.ndarray, Dict[str, pd.DataFrame], Dict[str, pd.DataFrame]]: Tuple containing: - fig: Matplotlib Figure object. - axes: Array of Axes objects. - matrices_by_cancer: Probability matrices for each cancer. - comparison_dfs_by_cancer: Comparison DataFrames for each cancer. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If value_col is invalid. |
ValueError
|
If any requested cancer is missing from cancer_to_method_scores. |
ValueError
|
If resolved methods_order is empty. |
Source code in src/synomicsbench/metrics/narrow_utility/BayesianComparison.py
393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 | |
build_baycomp_comparison_df(method_to_scores, methods_order=None, rope=0.01)
¶
Compute pairwise Bayesian comparison probabilities (better/worse/equivalent) using baycomp.two_on_single for all ordered pairs of methods.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
method_to_scores
|
Mapping[str, Sequence[float]]
|
Mapping tool/method -> list/array of scores. |
required |
methods_order
|
Optional[Sequence[str]]
|
Optional method ordering. If None, uses dict keys. |
None
|
rope
|
float
|
ROPE threshold for practical equivalence. |
0.01
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: Comparison table with columns: ["Method 1", "Method 2", "Better Prob", "Worse Prob", "Equivalent Prob"]. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If rope is not positive. |
ValueError
|
If fewer than 2 methods are provided. |
ValueError
|
If any method has fewer than 2 finite scores. |
Source code in src/synomicsbench/metrics/narrow_utility/BayesianComparison.py
134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 | |
Metrics: Privacy¶
The metrics.privacy module provides tools to assess the likelihood of privacy attacks using synthetic data.
singling_out
¶
Singling-out risk evaluation for synthetic data.
Provides helper functions that wrap the Anonymeter SinglingOutEvaluator to run univariate and multivariate singling-out attacks across multiple feature-sampling proportions.
eval_singling_out_multivariate(ori, syns, n_cols_list=(2, 3, 5, 7, 10, 20, 50), n_attacks=10000, max_attempts=1000000, seed=42)
¶
Run multivariate singling-out attacks at varying attribute-combination sizes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ori
|
DataFrame
|
Original dataset. |
required |
syns
|
Dict[str, DataFrame]
|
Mapping from synthetic dataset name to its DataFrame. |
required |
n_cols_list
|
Sequence[int]
|
Number of columns the attacker uses per predicate. |
(2, 3, 5, 7, 10, 20, 50)
|
n_attacks
|
int
|
Number of attack predicates per evaluation. |
10000
|
max_attempts
|
int
|
Maximum predicate generation attempts. |
1000000
|
seed
|
int
|
Random seed. |
42
|
Returns:
| Type | Description |
|---|---|
Dict[str, List[SinglingOutEvaluator]]
|
Mapping from synthetic dataset name to a list of evaluated |
Dict[str, List[SinglingOutEvaluator]]
|
|
Source code in src/synomicsbench/metrics/privacy/singling_out.py
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 | |
eval_singling_out_univariate(ori, syns, n_attacks=10000, max_attempts=1000000, proportions=(0.25, 0.5, 0.75, 1.0), seed=42)
¶
Run univariate singling-out attacks at varying feature proportions.
For each synthetic dataset and each proportion p, a random subset of
p × n_features columns is selected and the SinglingOutEvaluator
is run in univariate mode.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ori
|
DataFrame
|
Original (real) dataset after post-processing. |
required |
syns
|
Dict[str, DataFrame]
|
Mapping from synthetic dataset name to its DataFrame. |
required |
n_attacks
|
int
|
Number of attack predicates to generate per evaluation. |
10000
|
max_attempts
|
int
|
Maximum predicate generation attempts. |
1000000
|
proportions
|
Sequence[float]
|
Fractions of columns to sample (e.g. 25 %, 50 %, …). |
(0.25, 0.5, 0.75, 1.0)
|
seed
|
int
|
Random seed for column sampling and the evaluator. |
42
|
Returns:
| Type | Description |
|---|---|
Dict[str, List[SinglingOutEvaluator]]
|
Mapping from synthetic dataset name to a list of evaluated |
Dict[str, List[SinglingOutEvaluator]]
|
|
Source code in src/synomicsbench/metrics/privacy/singling_out.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 | |
linkability
¶
Linkability risk evaluation for synthetic data.
Evaluates whether an attacker can use synthetic data to link molecular (gene expression) profiles to clinical attributes of the same individual.
eval_linkability_genes_clinical(ori, syns, clinical_cols, transcriptomic_cols=None, n_neighbors=1, proportions=(0.25, 0.5, 0.75, 1.0), seed=42)
¶
Evaluate linkability risk between gene expression and clinical features.
The attack attempts to link two disjoint attribute sets of the same patient — clinical variables and a randomly sampled subset of gene expression features — using the synthetic dataset as a bridge.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ori
|
DataFrame
|
Original dataset. |
required |
syns
|
Dict[str, DataFrame]
|
Mapping from synthetic dataset name to its DataFrame. |
required |
clinical_cols
|
Union[List[int], List[str]]
|
List of column names or integer indices for clinical
features. If indices are provided, they refer to |
required |
transcriptomic_cols
|
Union[List[int], List[str]]
|
List of column names or integer indices for
transcriptomic (gene) features. If |
None
|
n_neighbors
|
int
|
Number of nearest neighbors for the linkability attack. |
1
|
proportions
|
Sequence[float]
|
Fractions of gene columns to sample for the attack. |
(0.25, 0.5, 0.75, 1.0)
|
seed
|
int
|
Random seed for gene column sampling. |
42
|
Returns:
| Type | Description |
|---|---|
Dict[str, List[LinkabilityEvaluator]]
|
Mapping from dataset name to a list of evaluated |
Dict[str, List[LinkabilityEvaluator]]
|
|
Source code in src/synomicsbench/metrics/privacy/linkability.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 | |
inference
¶
Attribute inference risk evaluation for synthetic data. Evaluates whether an adversary can infer sensitive clinical attributes (secrets) from auxiliary gene expression features using the synthetic dataset.
eval_inference_genes_clinical(ori, syns, clinical_cols, transcriptomic_cols=None, save_path=None)
¶
Evaluate attribute inference risk for each clinical variable.
For every synthetic dataset and each clinical feature (treated as a
secret), the InferenceEvaluator from Anonymeter uses all gene
expression columns as auxiliary information to predict the secret
attribute.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ori
|
DataFrame
|
Original dataset. |
required |
syns
|
Dict[str, DataFrame]
|
Mapping from synthetic dataset name to its DataFrame. |
required |
clinical_cols
|
Union[List[int], List[str]]
|
List of column names or integer indices for clinical
features (secrets). If indices are provided, they refer to
|
required |
transcriptomic_cols
|
Union[List[int], List[str]]
|
List of column names or integer indices for
transcriptomic (gene) features (auxiliary). If |
None
|
save_path
|
Optional[str]
|
Optional path to incrementally save results as pickle. |
None
|
Returns:
| Type | Description |
|---|---|
Dict[str, List[Tuple[str, object]]]
|
Mapping from dataset name to a list of |
Dict[str, List[Tuple[str, object]]]
|
tuples, where |
Dict[str, List[Tuple[str, object]]]
|
an error dict if the evaluation failed. |
Source code in src/synomicsbench/metrics/privacy/inference.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 | |
Utilities¶
Utility modules provide monitoring capabilities, evaluation utilities, and correlation analysis tools used throughout the framework.
monitoring
¶
monitor_resources(func)
¶
Decorator to monitor CPU, RAM, and GPU (NVIDIA via nvidia-ml-py) resources during function execution.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
func
|
callable
|
The function to be monitored. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
callable |
The wrapped function with resource monitoring. |
Raises:
| Type | Description |
|---|---|
Exception
|
Re-raises any exception from the wrapped function after reporting. |
Source code in src/synomicsbench/utils/monitoring.py
63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 | |
set_logger(logger_name, output_path, log_file_name='activity.log')
¶
Configures a logger with both file and console handlers and returns it.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
logger_name
|
str
|
The name for the logger (e.g., name or 'myscript'). |
required |
output_path
|
str
|
The directory where the log file should be created. |
required |
log_file_name
|
str
|
The name of the log file. |
'activity.log'
|
Returns:
| Type | Description |
|---|---|
|
logging.Logger: The configured logger instance. |
Source code in src/synomicsbench/utils/monitoring.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | |
correlations
¶
MixedCorrelation
¶
Compute mixed-type correlation matrices combining BLAS/vectorized numeric-numeric and Numba-accelerated categorical interactions.
Missing handling
- Numerical: pairwise deletion (drop rows with NaN in either column) for Pearson/Spearman.
- Categorical (including discretized numeric for Cramér's V): missing is its own category.
Parameters¶
categorical_indices : array-like of int Column indices treated as categorical. numerical_indices : array-like of int Column indices treated as numerical. method : {'pearson','spearman'}, default 'pearson' Correlation for numeric-numeric pairs. n_bins : int, default 10 Number of bins for discretizing numerical columns when paired with categorical. engine : {'auto','numba','blas'}, default 'auto' - 'blas': use block GEMM path for numeric-numeric (fast for many features). - 'numba': compute everything in Numba loops. - 'auto': choose 'blas' when many numeric features, else 'numba'. block_cols : int, default 512 Block width for BLAS engine.
Source code in src/synomicsbench/utils/correlations.py
573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 | |
compute(data)
¶
Compute the mixed-type correlation matrix on the provided data.
Returns¶
corr : np.ndarray of shape (n_features, n_features)
Source code in src/synomicsbench/utils/correlations.py
635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 | |
cramers_v_bincount_numba(x, y, nx, ny)
¶
Fast Cramér's V using bincount on combined indices. x in [0..nx-1], y in [0..ny-1].
Source code in src/synomicsbench/utils/correlations.py
309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 | |
pearson_correlation_matrix_numba(X)
¶
Pearson correlation matrix (dropping NaN pairs), Numba-parallel.
Source code in src/synomicsbench/utils/correlations.py
777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 | |
spearman_correlation_matrix_numba(X)
¶
Spearman correlation matrix (dropping NaN pairs), Numba-parallel.
Source code in src/synomicsbench/utils/correlations.py
753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 | |
utils
¶
DataProcessValidation
¶
Bases: DataProcessor
Preprocessing pipeline for preparing tabular data for validation, including encoding, scaling, and imputation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Input data to preprocess. |
required |
target_col
|
str
|
Name of the target column. |
required |
output_dir
|
str
|
Directory for outputs. Defaults to ".". |
required |
ordinal_cat_columns
|
Optional[List[str]]
|
List of ordinal categorical column names. |
required |
dummy_cat_columns
|
Optional[List[str]]
|
List of dummy categorical column names. |
required |
numerical_columns
|
Optional[List[str]]
|
List of numerical column names. |
required |
scaler
|
str
|
Which scaler to use for numerical columns. Defaults to "minmax". |
'minmax'
|
n_neighbors
|
int
|
Number of neighbors for KNN imputation. Defaults to 5. |
5
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the target column is not found in the input data. |
Source code in src/synomicsbench/metrics/fidelity/utils.py
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 | |
__init__(data, metadata=None, scaler='minmax', n_neighbors=5, **kwargs)
¶
Initialize the PreprocessingForValidation class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Input data to preprocess. |
required |
metadata (dict)
|
Metadata |
required | |
scaler
|
str
|
Which scaler to use for numerical columns. Defaults to "minmax". |
'minmax'
|
n_neighbors
|
int
|
Number of neighbors for KNN imputation. Defaults to 5. |
5
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the target column is not found in the input data. |
Source code in src/synomicsbench/metrics/fidelity/utils.py
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 | |
fit()
¶
Preprocess the data by encoding categorical features, normalizing numerical features, imputing missing values, and encoding the target variable.
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: The preprocessed data with transformed features and target column. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If required columns are missing or preprocessing fails. |
RuntimeError
|
If an unexpected error occurs during preprocessing. |
Source code in src/synomicsbench/metrics/fidelity/utils.py
97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 | |
check_column_consistency(origin_data, synthetic_data)
¶
Check if columns and their data types match between the original and synthetic DataFrames.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
origin_data
|
DataFrame
|
The original DataFrame. |
required |
synthetic_data
|
DataFrame
|
The synthetic DataFrame to compare. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if both column names and data types match, False otherwise. |
Source code in src/synomicsbench/metrics/fidelity/utils.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | |
np_encoder(obj)
¶
Convert NumPy data types to native Python types for JSON serialization.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
obj
|
The object to be encoded, potentially a NumPy scalar or array. |
required |
Returns:
| Type | Description |
|---|---|
|
int, float, bool, list: The object converted to a native Python type suitable for JSON serialization. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If the object type is not supported for conversion. |
Source code in src/synomicsbench/metrics/fidelity/utils.py
173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 | |