This presentation will review and discuss common data problems encountered with web-sourced data, such as content cleaning, duplicate detection, clustering, and classification and describe the algorithms that work best as the volume of data increases, along with hacks for getting high-quality results as quickly as possible.
Hilary Mason is founder and CEO of Fast Forward Labs, a machine intelligence research company, and data scientist in residence at Accel Partners. Previously Hilary was chief scientist at Bitly. She co-hosts DataGotham, a conference for New York’s home-grown data community, and co-founded HackNY, a non-profit that helps engineering students find opportunities in New York’s creative technical economy. Hilary served on Mayor Bloomberg’s Technology Advisory Board, and is a member of Brooklyn hacker collective NYC Resistor.