This presentation will review and discuss common data problems encountered with web-sourced data, such as content cleaning, duplicate detection, clustering, and classification and describe the algorithms that work best as the volume of data increases, along with hacks for getting high-quality results as quickly as possible.
Hilary Mason is the Chief Scientist at bit.ly, where she finds sense in vast data sets. Her work involves both pure research and development of product-focused features.
She’s also a co-founder of HackNY, a non-profit organization that connects talented student hackers from around the world with startups in NYC.
She has discovered two new species, loves to bake cookies, and asks way too many questions.