As people live more of their lives online, there is a growing need for high quality natural language processing on social media posts, chat logs, and forum replies. Unfortunately, many common preprocessing routines do not capture information that is useful for conversational data. This talk will describe tools and techniques for addressing this unique type of language.
- Common sets of stop words should be revised to maintain vocabulary that is extremely useful for conversational data
- Text normalization techniques (such as lowercasing) often smooth over people’s ways of expressing emotion or tone of voice online
- Document vectors don’t need to be composed solely of word embeddings