Web Text-based Network Industry Classifications: Preliminary Results


Studies of market structure and product market competition are important in many disciplines, such as economics, finance, accounting and management. Reliable data for such studies is easily available for public firms (e.g., 10-K filings), but no reliable data exists for private firms. In this work we propose to mine the Internet Archive Wayback Machine, a digital archive of the World Wide Web, to build a database of 300,000 companies to support analyses of market structure, product market competition, and innovation. The goal of the WTNIC project is to download pages from the archive to build a profile for each company, and to use machine learning techniques to define similarity between companies based on similarity of their product and service offerings. This paper describes the challenges that must be overcome, our approach to overcome these challenges, and some preliminary results.

SIGMOD Workshop on Data Science for Macro-Modeling with Financial and Economic Datasets (DSMM)