Here are some example projects that may be offered in summer 2025 .
Actual projects will be chosen based on student and faculty mentor interest.
Biological Network Inference for Discovering Disease-Associated Genes (Cowen, Computer Science): Experimentally determined protein-protein interaction data, represented in the form of a network (where the genes or proteins are represented as nodes, and an edge is placed between two genes if they are experimentally determined to interact in the cell), have proved to be a rich type of high-throughput biological data resource for the discovery of genes and genetic modules that drive human disease. The disease-gene prioritization problem takes as input a set of genes that are known to be associated with a particular disease, and then outputs a ranked list of additional genes whose proximity to the disease gene set makes them most likely to also be involved with the disease in question.
Graph diffusion-based methods have proved to be a very successful way to discover new genes that might be involved in a disease of interest. The goal of this project is to devise algorithms that are aware of the clustering structure of the known disease genes in the network while ranking. In particular, students will propose and test different ways to generalize gene-gene distances to gene-geneset distances, in a fashion that is more aware of the cluster structure of the known disease genes within the network, and compare their methods to existing methods in cross-validation experiments. All the necessary networks and disease gene data is freely available in public data repositories. Once we demonstrate the strength of the new methods in cross-validation, the students will apply their methods to predict new disease genes for both Parkinson’s disease and for Crohn’s disease.
Knowledge-guided Machine-Learning Models for Predicting the Promiscuity of Enzymes on Substrates (Hassoun, Computer Science): One focus of Soha Hassoun’s research group is predicting the activities of enzymes on substrates for biological engineering applications. Although traditionally assumed to be specific (transforming a single substrate), many enzymes are promiscuous and act on substrates other than those that the enzymes evolved to act on. Despite characterizing enzymes through sequencing, annotation, and homology, the current state of enzyme function characterization limits our understanding of metabolism and our ability to transform microorganisms into efficient cellular factories. One approach is to develop novel deep-learning models that can maximally utilize heterologous data unique to enzymatic interactions, available from databases such as the BRENDA and KEGG databases. T
Undergraduate students will develop and evaluate machine-learning models to characterize the promiscuity of enzymes on substrates. The students will tackle the problem of predicting the site-of-metabolism, which is the atom position within a molecule that is most likely to undergo a biochemical transformation due to query enzyme. The students will explore Graph Neural networks to classify nodes within the molecular graph. Students will also tackle the problem of predicting the interaction between a molecule and an enzyme. The students will use data from the KEGG and BRENDA databases to train and evaluate the model. The students will explore different partitions of the datasets: random and realistic splits, and different ways of generating negative examples and including inhibitors.
Assessing and Repairing Miscalibration in Cross-Institution Predictive Models for Clinical Time Series (Hughes, Computer Science): In Mike Hughes’ research group, undergraduate students will learn to develop modern machine learning methods that could be effectively deployed to improve patient outcomes in critical care hospital settings. Specifically, students will develop models that predict personalized risk scores for outcomes such as mortality or sepsis onset given a history of the patient’s vital signs and laboratory measurements over time. Several electronic health record (EHR) datasets have been fully-deidentified and released to the public for research purposes: including MIMIC-III from a hospital in Boston, and eICU from hospitals across the USA. Because these are open-access de-identified datasets containing no identifiable health information, undergraduates can gain access easily. Students can further build upon a multi-institution open-source software collaboration which PI Hughes has helped lead – MIMIC-Extract to easily train and evaluate models on the multivariate time-series prediction problems that arise from EHR settings. Student researchers can get access to the Tufts High Performance Computing cluster, where they can access 8 RTX6000 GPUs and 40 A100 GPUs to train deep neural network models efficiently on these large datasets.
Given the recent availability of authentic EHR datasets from multiple sites across two continents, two natural research questions arise: (1) how well do models trained on data from one institution perform when transferred to another institution? And (2) if out-of-the-box performance suffers after transfer (as might be expected), what remedies are available to improve performance? In typical applications, we are concerned with measuring several kinds of performance, including measures of discrimination quality (such as area-under-the-ROC-curve, sensitivity, specificity, and precision) and calibration (when the model predicts that a patient has X% risk of some outcome, can we measure empirically that around X% of such patients really do have the outcome?). While discriminative performance is an obvious goal of most ML research in this area, calibration of models is rarely assessed in the ML literature, yet is critical for effective deployment when the probabilities produced by the model need to be used in decision-making.
In the proposed summer research project, students will assess several baseline and recent methods for post-hoc calibration when applied to clinical time series problems (e.g., predict risk of deterioration given all past vitals and labs). Calibration methods to consider will include Platt scaling, isotonic regression, smooth isotonic regression, I-Spline smoothing, and Bayesian Binning.