
Data Science at Khoury College of Computer Sciences
Understanding how we collect, organize, and make sense of data to create knowledge and support human decision-making
Big data is all around us. Extremely large and complex datasets have become a backbone of our digital life — underpinning transportation and navigation systems, detecting financial fraud, and advancing scientific research in biology, among many other spheres.
Data science is the interdisciplinary study of all aspects of data: collecting it, managing it, storing and retrieving it, analyzing it, and building systems that can mine valuable patterns and knowledge, leading to applications that help run our world. Khoury College researchers are breaking new ground in both theory and systems in data science. Together, these areas ensure that data can be efficiently managed and utilized, supporting advancements in artificial intelligence, industry, scientific research, health care, and the humanities.


Changing how we think of data
Data science’s impact on society is all around us, and its role is increasing exponentially. It has revolutionized decision-making in health care and drug discovery. Businesses depend on data science to optimize operations, manage supply chains and keep transactions secure. Scientific research has also benefited from data science, enabling faster discoveries and deeper understanding in fields like genomics, astronomy, and environmental science.
Data science research has also changed how we think of data: It is now a valuable asset in its own right, with possibilities that go beyond any one purpose or application. The ability to collect, analyze, and interpret vast amounts of information has opened new avenues for innovation and problem-solving generally, crossing disciplines and bridging boundaries.
Current research areas
- Business and predictive analytics
- Computational epidemiology
- Computational molecular biology and bioinformatics
- Computational social science
- Computer vision
- Data mining
- Database systems
- Database theory
- Digital humanities
- Game analytics
- Health informatics
- Information retrieval
- Information visualization
- Knowledge representation
- Machine learning
- Natural language processing
- Parallel and distributed data analysis
- Statistics
Domains of interest
- Developing asymptotically optimal algorithms for query evaluation and reverse data management
- Developing asymptotically optimal algorithms for compressed knowledge representation
- Developing visual representations of relational queries


Current project highlights
Any-k: Optimal ranked enumeration for dynamic programs
Any-k research has the potential to help computers find and rank all possible solutions to a problem. This is important in data science, as being able to efficiently work with results in this way could lead to much more efficient searches and data analysis.
May Institute on Computation and Statistics for Spectrometry and Proteomics
Northeastern’s Barnett Institute for Chemical and Biological Analysis sponsors the national May Institute in computation and statistics for mass spectrometry and proteomics.
Machine Learning Approaches Towards Risk Assessment and Prediction of Adverse Pregnancy Outcomes
This research explores what molecular, clinical, and genetic factors increase the risk of adverse pregnancy outcomes. Using large data sets from pregnant women and the power of machine learning, this research has the potential to make a direct impact on maternal health.
Recent research publications
A Unified Approach for Resilience and Causal Responsibility with Integer Linear Programming (ILP) and LP Relaxations
Authors: Neha Makhija, Wolfgang Gatterbauer
This research introduces a new method using Integer Linear Programming to solve the problem of finding the smallest set of data to remove from a database to eliminate specific query results. This method can be applied to a broader range of database queries than previous method — and in some cases, it works faster.
On the Reasonable Effectiveness of Relational Diagrams: Explaining Relational Query Patterns and the Pattern Expressiveness of Relational Languages
Authors: Wolfgang Gatterbauer, Cody Dunne
This research introduces a new way to define and compare query patterns across different programming languages, leading to the development of Relational Diagrams, a visual tool that helps users understand and write database queries faster and more accurately.
Related labs and groups
Faculty members
-
Wolfgang Gatterbauer
Wolfgang Gatterbauer is an associate professor at Khoury College. He works on the theory of scalable data management, with the goal of expanding data management systems and enabling them to support novel functionalities.
-
Prashant Pandey
Prashant Pandey is an assistant professor at Khoury College. He researches scalable data systems with robust theoretical foundations for efficient data management, and tackles every level of that challenge, from the theoretical aspects of data structures to the practical issues of scaling data systems.
-
Mirek Riedewald
Mirek Riedewald is a professor at Khoury College. His research emphasizes the design of novel, scalable data management and analysis techniques, with applications in ornithology, physics, astronomy, and mechanical and aerospace engineering, among other fields.
-
Cheng Tan
Cheng Tan is an assistant professor at Khoury College. His systems and security research focuses on building verifiable outsourced services and certified neural networks.