Web Scraping Only a Specific Domain
I am trying to make a web scrapper that, for this example, say scrapes
news articles from Reuters.com. I want to get the title and date. I know I
will ultimately just have to pull the source code from each address and
then parse the HTML using something like JSoup.
My question is: How do I ensure I do this for each news article on
Reuters.com? How do I know I have hit all the reuters.com addresses? Is
there any API's that can help me with this?
Thank you very much, Rich
No comments:
Post a Comment