I’m trying to scrape the ads from Ask, which are generated in an iframe by a JS hosted by Google.
When I manually navigate my way through, and view source, there they are (I’m specifically looking for a div with the id “adBlock”, which is in an iframe).
But when I try using Firefox, Chromedriver or FirefoxPortable, the source returned to me is missing all of the elements I’m looking for.
I tried scraping with urllib2 and had the same results, even when adding in the necessary headers. I thought for sure that a physical browser instance like Webdriver creates would have fixed that problem.
Here’s the code I’m working off of, which had to be cobbled together from a few different sources:
from selenium import webdriver from selenium.common.exceptions import TimeoutException from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import pprint # Create a new instance of the Firefox driver driver = webdriver.Chrome('C:Python27Chromedriverchromedriver.exe') driver.get("http://www.ask.com") print driver.title inputElement = driver.find_element_by_name("q") # type in the search inputElement.send_keys("baseball hats") # submit the form (although google automatically searches now without submitting) inputElement.submit() try: WebDriverWait(driver, 10).until(EC.title_contains("baseball")) print driver.title output = driver.page_source print(output) finally: driver.quit()
I know I circle through a few different attempts at viewing the source, that’s not what I’m concerned about.
Any thoughts as to why I’m getting one result from this script (ads omitted) and a totally different result (ads present) from the browser it opened in? I’ve tried Scrapy, Selenium, Urllib2, etc. No joy.
Selenium only displays the contents of the current frame or iframe. You’ll have to switch into the iframes using something along these lines
iframes = driver.find_elements_by_tag_name("iframe") for iframe in iframes driver.switch_to_default_content() driver.switch_to_frame(iframe) output = driver.page_source print(output)