Website | @hmason
Hilary Mason is the Chief Scientist at bit.ly, where she finds sense in vast data sets. Her work involves both pure research and development of product-focused features.
She’s also a co-founder of HackNY, a non-profit organization that connects talented student hackers from around the world with startups in NYC.
Hilary recently started the data science blog Dataists and is a member of hacker collective NYC Resistor.
She has discovered two new species, loves to bake cookies, and asks way too many questions.
This presentation will review and discuss common data problems encountered with web-sourced data, such as content cleaning, duplicate detection, clustering, and classification and describe the algorithms that work best as the volume of data increases, along with hacks for getting high-quality results as quickly as possible.