Advancing the Field of Data Science & Analytics

Our data science research lab is an applied research program intended to develop novel, innovative data science solutions whose practical applications have the potential to make a significant impact across multiple industries.

These projects aim to push the envelope of current technological possibility by using data science tools and techniques to rapidly explore novel methods applied to underutilized datasets.  In addition to being more technologically advanced, our research lab projects also typically require a higher level of expertise than our other projects.

The goal of the research lab is the application of machine learning and data theoretic techniques to novel domains. To this end, we focus on the construction of data products - that is, products that derive their value from data and generate more data in return. Ancillary projects like tools for distributed or clustered computing, data ingestion or wrangling, new modeling techniques or visualization projects are also of interest as they contribute to the successful construction of data products.

Some of our topics of interest are:

  • Application of Machine Learning techniques to new data sets and domains.

  • Natural Language analytics and understanding.

  • Distributed storage systems and computation (Big Data/NoSQL).

  • Architecture and deployment of data or statistical systems.

  • Data pipeline tools and methodologies.

  • Position or vision papers concerning data techniques.

Our research lab combines applied development with publication and demonstration to create an innovative environment for its participants. 


News Flash is the first information product to come out of our Research Lab. It leverages the power of Natural Language Processing (NLP) to obtain the latest news articles on a specific topic, evaluate the quality of each article, summarize the content, and deliver the best stories, in newsletter format, straight to your email inbox every week.

No matter what your areas of interest are, there are always new developments happening (and a constant stream of news articles covering them) to keep up with. We developed News Flash to help you sift through all the noise and stay informed about the most important developments in the areas of interest you care about.


The machine learning workflow can be described as the construction and evaluation of model selection triples: feature engineering, algorithm selection, and hyperparameter tuning. Recently much of this workflow has been automated through the use of standardized APIs and search. However, human intuition and guidance are practically more effective than an exhaustive enumeration and comparison of models. By visualizing the model selection process, data scientists can steer towards final, explainable models and avoid pitfalls and traps. 

The Yellowbrick library is a diagnostic visualization platform for machine learning that allows data scientists to steer the model selection process. Yellowbrick extends the Scikit-Learn API with a new core object: the Visualizer. Visualizers allow visual models to be fit and transformed as part of the Scikit-Learn Pipeline process, providing visual diagnostics throughout the transformation of high dimensional data. Yellowbrick is written in Python, utilizes matplotlib for drawing, and is already in a beta stage of development. 

In this research project, we focus on extending Yellowbrick with new features and functionality, from adding text visualizations to optimizing parallel coordinates. Because of the API, this essentially means developing different kinds of model-visualizing techniques and writing custom Visualizer objects that implement them.


Dynamic Graph Visualization & Navigation

Dynamic Graph Visualization and Navigation

Graph data is all around us, powering cultural, consumer, and corporate applications that touch every part of our lives. From social networks to scientific taxonomies, and from information standards for a wide range of business practices to web search engines and archives, graph structures are a core tool in the data science toolbox.

While we are fortunate to have a variety of software libraries for manipulating graph data, visualizing and navigating changes in graph structures over time remains a major challenge. With many aspects of network analysis well understood, and at a time when processing backends and front-end application frameworks are better than ever, we have an opportunity to devise new paradigms for visualizing how large graphs change over time, communicating their complexity without sacrificing coherence. 

This symposium will consider the state of the art in dynamic graph visualization and tools for analyzing large-scale graph data sources, with a focus on preparing novel tools for browsing and understanding graph dynamics.


Natural Language Understanding & TExt Summarization

Natural Language Processing and Text Classification

Unstructured, often paragraph form, text is an important part of sharing concepts, ideas, and information throughout the world. As the volume of unstructured text grows, it contains increasing amounts of valuable information that could be organized in a structured format and made usable by computers and humans to make decisions and discover insights.

In the world of computational understanding of natural language, there has traditionally been a separation between the statistical and non-statistical approaches. With the improvement of vector-based (statistical) approaches there is an opportunity to exploit both approaches to further accuracy in analyzing, classifying, and summarizing of text to generate facts.


Semantic Networks for Information Extraction

Semantic Networks for Information Extraction

Natural language understanding techniques like OpenWIMs and FrameNet have been gaining popularity recently by parsing documents into connected semantic networks. These networks (represented as entities and their relationships) can then be used to create meaning-driven representations of text documents and corpora. Networks can be used to derive topics contained within documents, determine content similarity between documents, cluster similar and complementary documents together, and classify them at various levels for fast, efficient information retrieval.

Some practical applications include education and research solutions that compile well-rounded groups of related documents for more productive study or research and commercial solutions that identify complementary or substitute products and services based on descriptions and reviews of those products and services.


Other Projects

We also have additional projects in the pipeline involving high-dimensional visualization, complex systems, optimization, streaming data, evolutionary algorithms, and genetic computing.