Web Scraping for Journalists Fall 2018

Students will learn how to use Javascript/jQuery to pull formatted data off of public facing websites in an ethical and sensical way. Being able to pull data from web sites allows journalists to store the information in a way that's suitable for analysis, interactivity or simply to track changes over time.

TC McCarthy | [email protected] | 5:30 - 8:20 Tuesday | Room 432


Schedule

| Download this calendar

Class 1 - Intro

Welcome! What is scraping? Why do we do it? NodeJS gives way to more scraping-capable journalists because it’s all JS. Learning JS helps you build data visualizations, enhance the user experience on a web page, and now, scrape data. Install NodeJS Install MAMP

Code Academy: Intro to JS course

Code academy has a great intro course that covers much of what you'll need to know about Javascript in order to write scrapers in NodeJS. Complete the training before next class, where we will dive deep into JS syntax.  

Class 2 - Intro to JS

Code Academy: Intro to JS course

Strings, objects, arrays and functions -- what do they all mean? A deep dive into the skills you learned from your code academy lesson.

Code Academy: Selector intro

Being able to target your various HTML elements is very important for applying styles, adding effects or many other things. Targeting HTML elements is called "selecting".

Code academy's intro lesson should get you started on understanding the nuances of figuring out element selectors.

NO CLASS

NO CLASS

Class 3 - Intro to the DOM

Code Academy: Selector intro

HTML is for people, the DOM is what it becomes when your browser interprets it. Being able to communicate with DOM is critical for scraping as it allows you to select the elements on a page.

Code Academy: Intro to jQuery

jQuery is a JS library widely adopted by news developers because of its ease of use, familiary selection process and wide ranging methods for DOM manipulation.

Code academy's course introduces you to jQuery and its power.

Class 4 - Intro to jQuery

Code Academy: Intro to jQuery

jQuery is a wonderful JS library that has been translated to nodeJS. Knowing jQuery will make scraping super easy as you can use conventional DOM selectors to find your elements, and standard jQuery methods to capture text, numbers and images from the a page.

Brainstorm ideas for a scraper

Begin compiling ideas for a scraping project. Your ideas should:

  • Have a journalistic purpose
  • Should not already exist as a data feed
  • Should only be applied to sites that don't require login
  • Should not violate basic journalistic ethics
Please write down three ideas and have them ready for discussion. Preliminary research is necessary.

Class 5 - Starting with nodeJS

Brainstorm ideas for a scraper

Working with node for scraping will be a compilation of all of your previous lessons. Learn about setting up a node script and have your first hello world! Then, we'll get started by scraping a wikipedia page. Then we'll discuss your scraping ideas.

Scrape wikipedia

Write a scraper that pulls all of the topic headlines and the first paragraph for each section from a wikipedia article of your choosing and outputs them to the console. Have the script ready for code review and demonstration in class.

Class 6 - Store that data!

Scrape wikipedia

Demo and code review of your wikipedia scrapers. Then we talk about databasing -- storing data in a database is fairly simple, once you know the syntax and how to plan a proper table. This is arguably the whole point of scraping.

Begin building your scraper

Take the next two weeks to being working on your scraper. Push yourselves to try and get it pulling the data you want and storing it in a database.

Class 7 - Troubleshooting your scraper

Begin building your scraper

In class coding time allowing for troubleshooting, strategy, etc.

Continue working on your scraper

Workshop

Continue working on your scraper

In class time will be provided to work on your scraper

Continue working on your scraper

Class 9 - Finishing touches

Continue working on your scraper

Use your in-class time to solve remaining bugs and logistical issues. If your scraper needs to collect data repeatedly ask about scheduling a cron job so that you have a decent amount of data to show in the final class.

Continue working on your scraper

Class 10 - Final presentations

We'll do a code review and demonstration of your scraper. Then you'll show your data table so we can see the data structure and the results of a successful scrape.

Policies

Attendance

If circumstances prevent your attending class, the instructor must be informed by phone or email on or before the day of class or within 24 hours afterward.

Deadlines in Journalism Matter

You must meet Deadlines by filing your assignments no later than due date and time. Missing a deadline results in an automatic half-grade reduction. Your grade will continue to drop by half a grade for each subsequent day after the deadline until you file your assignment. Plan ahead and remember that in journalism: done is better than perfect. It will always be better to hand in something than nothing. If you are having trouble with your assignment let me know immediately, do not wait until it is too late.

Plagiarism

It is a serious ethical violation to take any material created by another person and represent it as your own original work. Any such plagiarism will result in serious disciplinary action, including possible dismissal from the CUNY J-School. Plagiarism may involve copying text from a book or magazine without attributing the source, or lifting words, photographs, videos, or other materials from the Internet and attempting to pass them off as your own. Student work may be analyzed electronically for plagiarized content. Please use comments in your HTML/CSS/JS to attribute code snippets you have found on support forums or elsewhere. For example:

Sample markup happens here

Communication

You can email me at [email protected]. The class also has a slack team you must sign up for using your journalism.cuny.edu email address.

Grading

Final scraper

50%

Homework

25%

Participation/Attendance

25%