I’m trying to scrape the ads from Ask, which are generated in an iframe by a JS hosted by Google.
When I manually navigate my way through, and view source, there they are (I’m specifically looking for a div with the id “adBlock”, which is in an iframe).
But when I try using Firefox, Chromedriver or FirefoxPortable, the source returned to me is missing all of the elements I’m looking for.
I tried scraping with urllib2 and had the same results, even when adding in the necessary headers. I thought for sure that a physical browser instance like Webdriver creates would have fixed that problem.
Here’s the code I’m working off of, which had to be cobbled together from a few different sources:
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pprint
# Create a new instance of the Firefox driver
driver = webdriver.Chrome('C:Python27Chromedriverchromedriver.exe')
driver.get("http://www.ask.com")
print driver.title
inputElement = driver.find_element_by_name("q")
# type in the search
inputElement.send_keys("baseball hats")
# submit the form (although google automatically searches now without submitting)
inputElement.submit()
try:
WebDriverWait(driver, 10).until(EC.title_contains("baseball"))
print driver.title
output = driver.page_source
print(output)
finally:
driver.quit()
I know I circle through a few different attempts at viewing the source, that’s not what I’m concerned about.
Any thoughts as to why I’m getting one result from this script (ads omitted) and a totally different result (ads present) from the browser it opened in? I’ve tried Scrapy, Selenium, Urllib2, etc. No joy.
Selenium only displays the contents of the current frame or iframe. You’ll have to switch into the iframes using something along these lines
iframes = driver.find_elements_by_tag_name("iframe")
for iframe in iframes
driver.switch_to_default_content()
driver.switch_to_frame(iframe)
output = driver.page_source
print(output)