CS Colloquium: Ozlem Uzunner (GMU)
De-identification of Clinical Narratives
De-identification is the task of finding and removing patient identifying private health information from electronic health records. This is a sensitive and complicated task: failure to remove all private health information from records results in violations of privacy, whereas overzealous de-identification can remove medically-salient text from the records and render the data unusable for future research. Ambiguities between private health information and medical concepts can exacerbate this problem. Misspellings and foreign names render dictionaries inadequate. Short, telegraphic style of clinical narratives and the density of jargon therein limits transfer of methods from open domain.
In this talk, we will present two natural language processing approaches to de-identification. First, we will define and use "local context". We will show that local context goes a long way towards effective de-identification, even when the concepts are presented in ungrammatical and fragmented narrative text. Next, we will take a deep learning approach to learning long-distance contextual features for de-identification. We will show that featureless recurrent neural networks (RNNs) can adequately capture long-distance contextual information, giving comparable de-identification performance to methods with manually designed features. Addition of manually-designed features to featureless RNNs gives them the boost they need in order to outperform the state of the art.
Bio: Dr. Ozlem Uzuner is an associate professor at the Information Sciences and Technology Department of George Mason University. She also holds a visiting associate professor position at Harvard Medical School and is a research affiliate at the Computer Science and Artificial Intelligence Laboratory of MIT. Dr. Uzuner specializes in Natural Language Processing and its applications to real-world problems, including healthcare and policy. Her current research interests include information extraction from fragmented and ungrammatical narratives for capturing meaning, studies of consumer generated text such as social media and electronic petitions, and semantic representation development for phenotype prediction, fraud detection, and topic modeling. Her research has been funded by National Institutes of Health, National Libraries of Medicine, National Institutes of Mental Health, Office of the National Coordinator, and by industry.
Friday, October 5 at 1:30pm