Schedule
Subscribe to this calendar | Download this calendarClass 1 - Intro
Welcome! What is scraping? Why do we do it? NodeJS gives way to more scraping-capable journalists because it’s all JS. Learning JS helps you build data visualizations, enhance the user experience on a web page, and now, scrape data. Install NodeJS Install MAMP
Code Academy: Intro to JS course
Code academy has a great intro course that covers much of what you'll need to know about Javascript in order to write scrapers in NodeJS. Complete the training before next class, where we will dive deep into JS syntax.
Class 2 - Intro to JS
Code Academy: Intro to JS course
Strings, objects, arrays and functions -- what do they all mean? A deep dive into the skills you learned from your code academy lesson.
Code Academy: Selector intro
Being able to target your various HTML elements is very important for applying styles, adding effects or many other things. Targeting HTML elements is called "selecting".
Code academy's intro lesson should get you started on understanding the nuances of figuring out element selectors.
NO CLASS
NO CLASS
Class 3 - Intro to the DOM
Code Academy: Selector intro
HTML is for people, the DOM is what it becomes when your browser interprets it. Being able to communicate with DOM is critical for scraping as it allows you to select the elements on a page.
Code Academy: Intro to jQuery
jQuery is a JS library widely adopted by news developers because of its ease of use, familiary selection process and wide ranging methods for DOM manipulation.
Code academy's course introduces you to jQuery and its power.
Class 4 - Intro to jQuery
Code Academy: Intro to jQuery
jQuery is a wonderful JS library that has been translated to nodeJS. Knowing jQuery will make scraping super easy as you can use conventional DOM selectors to find your elements, and standard jQuery methods to capture text, numbers and images from the a page.
Brainstorm ideas for a scraper
Begin compiling ideas for a scraping project. Your ideas should:
- Have a journalistic purpose
- Should not already exist as a data feed
- Should only be applied to sites that don't require login
- Should not violate basic journalistic ethics
Class 5 - Starting with nodeJS
Brainstorm ideas for a scraper
Working with node for scraping will be a compilation of all of your previous lessons. Learn about setting up a node script and have your first hello world! Then, we'll get started by scraping a wikipedia page. Then we'll discuss your scraping ideas.
Scrape wikipedia
Write a scraper that pulls all of the topic headlines and the first paragraph for each section from a wikipedia article of your choosing and outputs them to the console. Have the script ready for code review and demonstration in class.
Class 6 - Store that data!
Scrape wikipedia
Demo and code review of your wikipedia scrapers. Then we talk about databasing -- storing data in a database is fairly simple, once you know the syntax and how to plan a proper table. This is arguably the whole point of scraping.
Begin building your scraper
Take the next two weeks to being working on your scraper. Push yourselves to try and get it pulling the data you want and storing it in a database.
Class 7 - Troubleshooting your scraper
Begin building your scraper
In class coding time allowing for troubleshooting, strategy, etc.
Continue working on your scraper
Workshop
Continue working on your scraper
In class time will be provided to work on your scraper
Continue working on your scraper
Class 9 - Finishing touches
Continue working on your scraper
Use your in-class time to solve remaining bugs and logistical issues. If your scraper needs to collect data repeatedly ask about scheduling a cron job so that you have a decent amount of data to show in the final class.
Continue working on your scraper
Class 10 - Final presentations
We'll do a code review and demonstration of your scraper. Then you'll show your data table so we can see the data structure and the results of a successful scrape.
Policies
Attendance
If circumstances prevent your attending class, the instructor must be informed by phone or email on or before the day of class or within 24 hours afterward.
Deadlines in Journalism Matter
You must meet Deadlines by filing your assignments no later than due date and time. Missing a deadline results in an automatic half-grade reduction. Your grade will continue to drop by half a grade for each subsequent day after the deadline until you file your assignment. Plan ahead and remember that in journalism: done is better than perfect. It will always be better to hand in something than nothing. If you are having trouble with your assignment let me know immediately, do not wait until it is too late.
Plagiarism
It is a serious ethical violation to take any material created by another person and represent it as your own original work. Any such plagiarism will result in serious disciplinary action, including possible dismissal from the CUNY J-School. Plagiarism may involve copying text from a book or magazine without attributing the source, or lifting words, photographs, videos, or other materials from the Internet and attempting to pass them off as your own. Student work may be analyzed electronically for plagiarized content. Please use comments in your HTML/CSS/JS to attribute code snippets you have found on support forums or elsewhere. For example:
Sample markup happens here
Communication
You can email me at [email protected]. The class also has a slack team you must sign up for using your journalism.cuny.edu email address.