About Our research lab
The District Data Labs Research Lab is an applied research program intended to develop novel, innovative data science solutions whose practical applications have the potential to make a significant impact across multiple industries.
These projects aim to push the envelope of current technological possibility by using data science tools and techniques to rapidly explore novel methods applied to underutilized datasets. In addition to being more technologically advanced, our research lab projects also typically require a higher level of expertise than our other projects.
The goal of the Research Lab is the application of machine learning and data theoretic techniques to novel domains. To this end, we focus on the construction of data products - that is, products that derive their value from data and generate more data in return. Ancillary projects like tools for distributed or clustered computing, data ingestion or wrangling, new modeling techniques or visualization projects are also of interest as they contribute to the successful construction of data products.
Broadly, some topics of interest are as follows:
- Application of Machine Learning techniques to new data sets
- Distributed storage systems and computation (Big Data/NoSQL)
- Architecture and deployment of data or statistical systems
- Data pipeline tools and methodologies
- Natural Language analytics and understanding
- Position or vision papers concerning data techniques.
Our Research Lab combines applied development with publication and demonstration to create an innovative environment for its participants.
Yellowbrick: Visual Diagnostics for Machine Learning
The machine learning workflow can be described as the construction and evaluation of model selection triples: feature engineering, algorithm selection, and hyperparameter tuning. Recently much of this workflow has been automated through the use of standardized APIs and search. However, human intuition and guidance are practically more effective than an exhaustive enumeration and comparison of models. By visualizing the model selection process, data scientists can steer towards final, explainable models and avoid pitfalls and traps.
The Yellowbrick library is a diagnostic visualization platform for machine learning that allows data scientists to steer the model selection process. Yellowbrick extends the Scikit-Learn API with a new core object: the Visualizer. Visualizers allow visual models to be fit and transformed as part of the Scikit-Learn Pipeline process, providing visual diagnostics throughout the transformation of high dimensional data. Yellowbrick is written in Python, utilizes matplotlib for drawing, and is already in a beta stage of development.
In this research lab, we will focus on extending Yellowbrick with new features and functionality, from adding text visualizations to optimizing parallel coordinates. Because of the API, this essentially means developing different kinds of model-visualizing techniques and writing custom Visualizer objects that implement them. The lab will select a core group of 6-8 developers to work on implementing Python code, tutorials in blog posts and documentation in an agile fashion. Sprints will be two weeks long, and at our research meetings we will review progress from the previous weeks and plan the sprint for the next week. The implementation itself will be done in pairs throughout the weeks.
Dynamic Graph Visualization & Navigation
Graph data is all around us, powering cultural, consumer, and corporate applications that touch every part of our lives. From social networks to scientific taxonomies, and from information standards for a wide range of business practices to web search engines and archives, graph structures are a core tool in the data science toolbox.
While we are fortunate to have a variety of software libraries for manipulating graph data, visualizing and navigating changes in graph structures over time remains a major challenge. With many aspects of network analysis well understood, and at a time when processing backends and front-end application frameworks are better than ever, we have an opportunity to devise new paradigms for visualizing how large graphs change over time, communicating their complexity without sacrificing coherence.
This symposium will consider the state of the art in dynamic graph visualization and tools for analyzing large-scale graph data sources, with a focus on preparing novel tools for browsing and understanding graph dynamics.
Patient Matching & Entity Resolution in Healthcare
Patient matching (also called entity resolution, record linkage, and de-duplication) is the process of linking patients to their data across the healthcare system. Without effective patient matching, medical providers get only a partial view of a patient’s health. Patient matching issues are often compounded by the lack of unique identifiers, which means that records must be matched on variables that are often different across healthcare providers. Examples of these common variables include first name, last name, and date of birth.
However, even with this information, it may not be possible to uniquely identify a patient. The Maria Garcia problem has been noted in a report published by the Bipartisan Policy Center. The report points out that in Harris County, Texas, there are 2,488 real patients named Maria Garcia and 231 of them have the same birth date. The aforementioned case assumes that there aren’t any data quality issues such as misspellings, transposed values, additional or missing characters, and empty fields. The addition of these types of errors makes the challenge of linking patient records a very computationally complex one.
In this semester-long lab, we will build off the previous semester's entity resolution lab and focus on the application of these technologies in the healthcare space. We will review the literature, explore several challenges surrounding entity resolution in healthcare, and cover topics such as data cleaning and normalizing, blocking, comparison, and evaluation of patient matching systems while exploring innovative new approaches involving machine learning and how they can be used to help solve real world problems in healthcare.
Natural Language Processing Classification
Unstructured, often paragraph form, text is an important part of sharing concepts, ideas, and information throughout the world. As the volume of unstructured text grows, it contains increasing amounts of valuable information that could be organized in a structured format and made usable by computers and humans to make decisions and discover insights.
In the world of computational understanding of natural language, there has traditionally been a separation between the statistical and non-statistical approaches. With the improvement of vector-based (statistical) approaches there is an opportunity to exploit both approaches to further accuracy in classifying text to generate facts.
In this ongoing project, we are adding 2 additional members to the team and we will continue to develop approaches for improving vector-based training. Our work in combining human-developed ontological models with source corpora to improve how well a trained model is able to identify synonyms and word relationships will continue with several extensions to the current work.
Semantic Networks for Information Extraction
Natural language understanding techniques like OpenWIMs and FrameNet have been gaining popularity recently by parsing documents into connected semantic networks. These networks (represented as entities and their relationships) can then be used to create meaning-driven representations of text documents and corpora. Networks can be used to derive topics contained within documents, determine content similarity between documents, cluster similar and complementary documents together, and classify them at various levels for fast, efficient information retrieval.
Some practical applications include education and research solutions that compile well-rounded groups of related documents for more productive study or research and commercial solutions that identify complementary or substitute products and services based on descriptions and reviews of those products and services.
We also have additional projects in the pipeline involving high-dimensional visualization, complex systems, optimization, streaming data, evolutionary algorithms, and genetic computing.
Join the Research Lab
We are looking to put together small, dedicated teams of people to help us bring these technologies to fruition. As a member of the DDL Research Lab, you’ll have the opportunity to:
- Conduct novel research and work on innovative data science projects.
- Make a significant impact pushing the envelope of technological possibility.
- Collaborate with other highly-skilled practitioners in data science.
Candidates should have several years of data science and programming experience and be able to demonstrate a high level of competency in those areas. You should also have at least 10 hours per week of available time for a period of 4 months to dedicate to project work.
Sponsor Our Research
If you are interested in supporting advancements in any of our research topics, please consider sponsoring our research in those areas. The benefits of sponsorship include :
- Gaining insight into the state-of-the-art.
- Having an area of data science that is important to you progressed further.
- Being listed as a sponsor of the program.
- Receiving early access to our findings and developments in your area of interest.
Sponsorship can be financial or in-kind, and will help with meeting space, computing resources, and other research expenses. If you'd like to sponsor us, please get in touch via the form below.