Example Projects |

Here are some example projects that may be offered in summer 2025 .
Actual projects will be chosen based on student and faculty mentor interest.

Biological Network Inference for Discovering Disease-Associated Genes (Cowen, Computer Science): The power of network representations of gene-gene or protein-protein interactions to help with biological inference, whether that be prediction of the function of unknown genes or proteins, or identifying new disease genes, or pathways, is now being supported by increasingly strong mathematical theory. Most of this theory has been designed to analyze the classical physical protein-protein interaction network, where the vertices represent proteins and an edge is placed between two proteins if there is physical evidence they bind in the cell, but other types of pairwise association edges, such as co-expression, or genetic interaction, are common. In all these biological networks, their power in inference comes from homophily, the fact that nearby vertices are likely to be involved in similar roles, when an appropriate graph distance is defined so that proximity well correlates to similarity. The disease-gene prioritization problem takes as input a set of genes that are known to be associated with a particular disease, and then outputs a ranked list of additional genes whose proximity to the disease gene set makes them most likely to also be involved with the disease in question. Graph diffusion-based methods have proved to be a very effective way to rank new genes for their likelihood in disease involvement. The goal of this project is to extend Adagio, a disease-prioritization algorithm that was created with the help of students who participated in DIAMONDS in Summer 2021, to devise algorithms that are aware of the clustering structure of the known disease genes in the network while ranking. In particular, students will propose and test different ways to generalize gene-gene distances to gene-geneset distances, in a fashion that is more aware of the cluster structure of the known disease genes within the network. They will compare their methods to existing methods in cross-validation experiments. All the necessary networks and disease gene data is freely available in public data repositories. Once we demonstrate the strength of the new methods in cross-validation, the students will apply their methods to predict new disease genes involved in IBS and Crohn’s disease. We will seek to present and publish our results in a computational biology conference such as ACM-BCB or GLBIO, and in a computational biology journal.

Large Language Models to streamline access to biological databases (Hassoun, Computer Science): Recent advances in AI-based natural language processing (NLP) have revolutionized AI copilots, which are conversational agents that can interact with users through natural languages. Copilots can provide context-aware assistance, automate mundane tasks, analyze data, enable seamless communication, and unify disparate systems. The underlying technology is large language models (LLMs), which are built using transformer models. These models tokenize their inputs and learn parameters (numerical values that define the behavior of the model. Some LLMs are indeed large: they are trained on vast amounts of data and require billions of parameters to specify model behavior. Importantly, trained LLMs can be used for generative AI: to produce content (text and code) based on input prompts in natural language. This extraordinary generative capability made the ChatGPT (Generative Pre-trained Transformer) model an internet sensation. LLMs are disruptive and transformative, creating novel capabilities that allow for NLP interfaces between humans and machines.

Hassoun’s group is focused on develop LLM-based techniques to streamline access to biological databases, as they are essential for biological and biomedical research, yet users spend hours navigating through links from one database entry to the next, with a large cognitive load in tracking the relationships between the entries. Creating NLP interfaces to databases and using LLM technology to provide coherent analytical summaries grounded in database knowledge will revolutionize database access and create new research modalities.

Undergraduate students will develop and evaluate the use of pretrained LLMs for data retrieval from the PubChem Database. PubChem aggregates chemical information from over 995 sources including chemical vendors, research and governmental organization, journal publishers, and other sources. The students will explore the use of LLMs that allow for real-time internet search, e.g., OpenAI’s GPT4 and GPT4o and Google’s Gemini, to retrieve the PubChem context relevant to the user query. The students will examine the results due to LLM retrieval and compare those against results obtained via search protocols or programmatic access. The students will perform this evaluation on multi-step, commonly requested PubChem protocols. These use cases encompassed a variety of retrieval tasks, including finding genes and proteins that interact with a specific compound, identifying drug-like compounds based on structural similarity and bioactivity data for a query compounds, and others. Through these case studies, students will investigate best practices in prompt engineering for retrieving data from PubChem, and outline the values and limitations offered via GPT4 and Gemini in terms of database retrieval.

LaAssessing and Repairing Miscalibration in Cross-Institution Predictive Models for Clinical Time Series (Hughes, Computer Science): Rec Mike Hughes’ research group works on machine learning methods that could be deployed to improve patient outcomes in critical care hospital settings. DIAMONDS undergraduates can help develop models that predict personalized risk scores for outcomes such as mortality or sepsis onset given a history of the patient’s vital signs and laboratory measurements over time. Several electronic health record (EHR) datasets have been released to the public for research purposes: MIMIC-III from a hospital in Boston, eICU from hospitals across the USA, and the HiRiD dataset from Switzerland. These are open-access de-identified datasets that undergraduates can access easily; the data contain no identifiable health information. Given the availability of authentic EHR datasets from multiple sites across two continents, two natural research questions arise: (1) how well do models trained on data from one institution perform when transferred to another institution? and (2) if out-of-the-box performance suffers after transfer (as might be expected), what remedies are available to improve performance? Students can build upon a multi- institution open-source software collaboration Hughes has co-led — MIMIC-Extract — to train and evaluate prediction models for EHR time series data.

We are particularly interested in calibration: when the model predicts that a patient has X% risk of some outcome, can we measure empirically that around X% of such patients really do have the outcome? While discriminative metrics such as sensitivity and specificity are common for evaluation, \emph{calibration} is rarely assessed yet is critical for effective deployment when the probabilities produced by the model need to be used in decision making. Some work raises awareness of calibration issues in deep supervised learning in general. However, remedies to improve calibration remain underexplored, especially when models transfer across domains. Early efforts assess post-hoc calibration across hospital sites, but do not pursue clinical time-series applications.

In the proposed summer research project, students will assess several baseline and recent methods for post-hoc calibration applied to clinical time series problems. Methods will include Platt scaling, isotonic regression, smooth isotonic regression, I-Spline smoothing, and Bayesian Binning. Students will extend the MIMIC-Extract toolbox to include calibration methods and assess cross-institution model transfer. Students will be well-positioned to submit a paper to a workshop co-located with ML conferences such as ICML or NeurIPS.

Investigating Changes Over Time in the Software Vulnerability Information Ecosystem (Votipka, Computer Science): System administrators are responsible for maintaining secure, working systems for organizations of all sizes. This means they must maintain awareness of and resolve any vulnerabilities identified in their systems. While mitigating these vulnerabilities often simply means updating the software to a new version without the bug, organizational factors make this process challenging. First, the number of vulnerabilities and vulnerable systems that must be managed is often large, while system administration teams are smaller and have many other responsibilities. Additionally, software updates often require the system to be rebooted. This creates operational costs that have to be weighed against the potential costs of an intrusion. Because of these factors, vulnerabilities can remain unmitigated for years leading to high-cost intrusions that could have been easily prevented. To avoid these negative outcomes, system administrators must quickly gather relevant and accurate information about new vulnerabilities to triage their efforts to prioritize mitigating vulnerabilities most likely to lead to an intrusion. Fortunately, there are several sources of public information about vulnerabilities system administrators can draw on to support their decision-making. However, in practice, system administrators report only relying on a few sources, sources vary widely by the individual, and it is not clear whether these sources actually provide the information system administrators need. Students who participated in DIAMONDS in Summer 2023 and in continuing research with Tufts researchers, developed a dataset of vulnerability information from the sources commonly reported by system administrators for all vulnerabilities reported over a one-year period. This data collection and its subsequent analysis provided a measurement of information available in the overall information ecosystem and prioritization of sources most likely to answer specific system administrator questions. However, this data was intentionally collected on historical vulnerabilities to assess a stable state of the ecosystem. This is not necessarily representative of the information ecosystem system administrators would come in contact with as they search in practice. The information ecosystem likely changes over time and some sources may progress at different rates. The goal of this project is to extend prior data collection and analysis to support daily capture of the information ecosystem to produce a more realistic measure of information available to system administrators. In particular, students will develop efficient website scraping and data storage protocols to capture the higher volume of data. Additionally, they will propose and implement new analyses to measure rate of daily change. We will use the results of this work to refine our existing suggested list of information sources for system administrators to be sensitive to real-time updates. We will seek to present and publish our results in a computer security conference such as the USENIX Security Symposium or IEEE Security and Privacy Symposium.advances in AI-based natural language processing (NLP) have revolutionized AI copilots, which are conversational agents that can interact

DIAMONDS: Summer Research in Data Science for Undergraduates

Example Projects