Common Crawl is the non-profit organization that builds and maintains the single largest publicly accessible dataset of the world's knowledge, encompassing petabytes of web crawl data. Any can download and use the data for free and we've been used for a wide variety of purposes.
As the crawl engineer, you'll run a crawl that spans hundreds of millions of domains and billions of pages each month. You'll command a fleet of machines on AWS that use Nutch to capture the web data and then Hadoop to turn it into a better structured dataset for others to use. This is a rewarding role as you're really giving back to the open data community :)
Requirements:
- Fluent in Java (Nutch and Hadoop are core to our mission)
- Familiarity with the JVM big data ecosystem (Hadoop, HDFS, ...)
- Knowledge the Amazon Web Services (AWS) ecosystem
- Experience with Python
- Basic command line Unix knowledge
- BS Computer Science or equivalent work experience
Position: Crawl Engineer / Data Scientist
Location: SF or Remote
Email: jobs@commoncrawl.org
Common Crawl is the non-profit organization that builds and maintains the single largest publicly accessible dataset of the world's knowledge, encompassing petabytes of web crawl data. Any can download and use the data for free and we've been used for a wide variety of purposes.
As the crawl engineer, you'll run a crawl that spans hundreds of millions of domains and billions of pages each month. You'll command a fleet of machines on AWS that use Nutch to capture the web data and then Hadoop to turn it into a better structured dataset for others to use. This is a rewarding role as you're really giving back to the open data community :)
Requirements:
Full details: http://commoncrawl.org/jobs/