With the exploding growth of online video and audio content, there's an increasing need for indexable and searchable audio. For a three-week project, I built a tool to automatically identify speakers in a recorded conversation using a corpus of audio recordings. The tool analyzes audio recordings and calculates when different speakers are speaking in a conversation. In my presentation at Strata-Hadoop 2017, I step through how I approached this problem, the algorithms used, and steps taken to validate the results. I share some of the challenges and pitfalls encountered while working on this and describe potential applications and extensions of the tool.
For this talk, presented at SciPy 2016, we shared an analysis of 59,000 OkCupid user profiles that examined online self-presentation by combining natural language processing (NLP) with machine learning. We analyzed word usage patterns by self-reported sex and drug usage status. In doing so, we reviewed standard NLP techniques, covered several ways to represent text data, and explained topic modeling. We found that individuals in particular demographic groups self-present in consistent ways. Our results suggest that users may unintentionally reveal demographic attributes in their online profiles.