ayuda con mi scrapy por favor

lemos.ema · Junio 27, 2019, 02:48:17 PM

Como les va a todos !!!!!!!

? espero que super bien.
Hace unos dias estoy luchando con un scraper que quiero realizar a la siguiente pagina:
You are not allowed to view links. You are not allowed to view links. Register or Login or You are not allowed to view links. Register or Login
mi idea es poder entrar a cada una de las casas y extraer informacion de estas, haciendo un scraping vertical y horizontal (todas las casas de la pagina, hasta la pagina 10... por lo que si en una pagina hay 30 casas... seria un total de 300 casas a escrapear).
He comenzado utilizando selenium , luego al darme cuenta de que tarda muchiiisimooo en abrir la pagina, sacar los datos, cerrar la pagina y hacer esto con todos los links.. he decidido migrar a scrapy.. este funciona rapido, el problema es que nose por que funciona como quiere jejeje. El codigo no me esta tomando las casas en orden (comenzando desde arriba hasta la de mas abajo) ni tampoco me toma todas las casas de la pagina. Con selenium lo bueno es que podia tomar datos como el mail y el telefono. En Scrapy tambien esta en un " </script>,<script> window.propertyfinder.settings.moreProperties" abajo de todo ... pero nose como sacarlo de ahi... y otra cosa que me gustaria mucho es poder sacar la info del grafico
les paso el codigo para que lo puedan chequear. Creo que estoy haciendo algo mal en las reglas.
muchisimas gracias a todos

Código: text

from scrapy.item import Field,Item 
from scrapy.spiders import CrawlSpider,Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose


class PropietiItem(Item):
    titulo=Field()
    tipo= Field()
    reference=Field()
    agente= Field()
    company=Field()
    orn=Field()
    brn=Field()
    precio=Field()
    tipodepropiedad=Field()
    trakheesi=Field()
    bedrooms=Field()
    bathrooms=Field()
    furnishings=Field()
    area=Field()
    amenities=Field()
    fecha=Field()
    descripcion=Field()
    trendsandprice=Field()
    averagerent=Field()
    averagesize=Field()
    script=Field()
    grafico=Field()
    telefono=Field()
    mail=Field()






class PropietiCrawler(CrawlSpider):
    name= "MiPrimerCrawler"
    start_urls= ['https://www.propertyfinder.ae/en/search?c=2&l=1&ob=nd&page=1']
    allowed_domains= ['propertyfinder.ae']

    rules= (
        Rule(LinkExtractor(restrict_xpaths=('//a [@class="pagination__link pagination__link--next"]')),follow=True),
        Rule(LinkExtractor(restrict_xpaths=('//div [@class="card-list__item"]//a [@class="card card--clickable"]')),follow=True,callback= 'parse_items'),
    )

    def parse_items(self,response):
        item= ItemLoader(PropietiItem(),response)
        item.add_xpath('titulo','/html/body/main/div[1]/div/div[2]/div[2]/div[1]/div/h1/text()')
        item.add_xpath('tipo','/html/body/main/div[1]/div/div[2]/div[2]/div[1]/div/div/h2/text()')
        item.add_xpath('reference','/html/body/main/div[1]/div/div[2]/div[2]/div[1]/div/div/div/strong/text()')
        item.add_xpath('agente','/html/body/main/div[1]/div/div[2]/div[2]/div[3]/div[2]/div[1]/div[1]/div[2]/div[1]/div[2]/div/text()')
        item.add_xpath('company','/html/body/main/div[1]/div/div[2]/div[2]/div[3]/div[2]/div[1]/div[1]/div[2]/div[2]/div[2]/text()')
        item.add_xpath('orn','/html/body/main/div[1]/div/div[2]/div[2]/div[3]/div[2]/div[1]/div[1]/div[2]/div[3]/div/text()')
        item.add_xpath('brn','/html/body/main/div[1]/div/div[2]/div[2]/div[3]/div[2]/div[1]/div[1]/div[2]/div[4]/div/text()')
        item.add_xpath('precio','/html/body/main/div[1]/div/div[2]/div[2]/div[3]/div[1]/div[2]/div[1]/div/div/div[1]/div[2]/div/div/span[1]/text()')
        item.add_xpath('tipodepropiedad','/html/body/main/div[1]/div/div[2]/div[2]/div[3]/div[1]/div[2]/div[1]/div/div/div[2]/div[2]/text()')
        item.add_xpath('trakheesi','/html/body/main/div[1]/div/div[2]/div[2]/div[3]/div[1]/div[2]/div[1]/div/div/div[4]/div[2]/text()')
        item.add_xpath('bedrooms','/html/body/main/div[1]/div/div[2]/div[2]/div[3]/div[1]/div[2]/div[1]/div/div/div[5]/div[2]/text()')
        item.add_xpath('bathrooms','/html/body/main/div[1]/div/div[2]/div[2]/div[3]/div[1]/div[2]/div[1]/div/div/div[6]/div[2]/text()')
        item.add_xpath('furnishings','/html/body/main/div[1]/div/div[2]/div[2]/div[3]/div[1]/div[2]/div[1]/div/div/div[7]/div[2]/text()')
        item.add_xpath('area','/html/body/main/div[1]/div/div[2]/div[2]/div[3]/div[1]/div[2]/div[1]/div/div/div[8]/div[2]/text()')
        item.add_xpath('amenities','//div [@class="amenities__content"]/text()')
        item.add_xpath('fecha','//div [@class="last-update"]/text()')
        item.add_xpath('descripcion','/html/body/main/div[1]/div/div[2]/div[2]/div[3]/div[1]/div[4]/text()')
        item.add_xpath('trendsandprice','//span [@class="market-trends__sub-heading-value"]/text()')
        item.add_xpath('averagerent','/html/body/main/div[1]/div/div[2]/div[2]/div[3]/div[1]/div[8]/div/div[3]/div[1]/div[1]/div[2]/strong/text()')
        item.add_xpath('averagesize','/html/body/main/div[1]/div/div[2]/div[2]/div[3]/div[1]/div[8]/div/div[3]/div[1]/div[2]/div[2]/strong/text()')
        item.add_xpath('script','/html/body/script[5]/text()')
        item.add_xpath('grafico','/html/body/main/div[1]/div/div[2]/div[2]/div[3]/div[1]/div[8]/div/div[2]/div/div/div/svg/g[12]/g/text()')
        item.add_xpath('telefono','/html/body/main/div[1]/div/div[2]/div[2]/div[3]/div[2]/div[1]/div[2]/div[2]/div[1]/div/div/div/span/span[2]/text()')
        yield item.load_item()

??????? · Junio 27, 2019, 09:57:46 PM

Buenas, puedes utilizar el módulo BeautifulSoup para eso si quieres, te resultará más fácil. Acá te dejo un link para que puedas revisarlo:

Scraping con BeautifulSoup:
You are not allowed to view links. You are not allowed to view links. Register or Login or You are not allowed to view links. Register or Login

Espero que te sirva, saludos!

lemos.ema · Junio 28, 2019, 02:30:55 AM

You are not allowed to view links. You are not allowed to view links. Register or Login or You are not allowed to view links. Register or Login
Buenas, puedes utilizar el módulo BeautifulSoup para eso si quieres, te resultará más fácil. Acá te dejo un link para que puedas revisarlo:

Scraping con BeautifulSoup:
You are not allowed to view links. You are not allowed to view links. Register or Login or You are not allowed to view links. Register or Login

Espero que te sirva, saludos!

Gracias pero si lo hago con BeautifulSoup creo que se tardaria mucho en hacer un scraping completo a toda la web... la pagina tampoco explica como entrar a cada link y escrapear cada link particular

K A I L · Junio 28, 2019, 01:38:09 PM

Hola @lemos.ema probando y haciendo un par de modificaciones a la ayuda que te brindaron

You are not allowed to view links. You are not allowed to view links. Register or Login or You are not allowed to view links. Register or Login
Scraping con BeautifulSoup:
You are not allowed to view links. You are not allowed to view links. Register or Login or You are not allowed to view links. Register or Login

Me tomo 2 segundos obtener 25 resultados (solo tome 3 variables para el scraping, te toca hacer tus propias modificaciones dependiendo que quieres obtener).

Te dejo el codigo por si tienes dudas, solo modifique un poco el que tienes de ejemplo.

Código: python

# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup
import requests

URL = "https://www.propertyfinder.ae/en/search?c=2&q=&t=&rp=y&pf=&pt=&bf=&bt=&af=&at=&fu=0&kw="

# Realizamos la petición a la web
req = requests.get(URL)

# Comprobamos que la petición nos devuelve un Status Code = 200
status_code = req.status_code
if status_code == 200:

    # Pasamos el contenido HTML de la web a un objeto BeautifulSoup()
    html = BeautifulSoup(req.text, "html.parser")

    # Obtenemos todos los divs donde están las entradas
    entradas = html.find_all('div', {'class': 'card-list__item'})

    # Recorremos todas las entradas para extraer el título, autor y fecha
    for i, entrada in enumerate(entradas):
        # Con el método "getText()" no nos devuelve el HTML
        titulo = entrada.find('h2', {'class': 'card__title card__title-link'}).getText()
	ubicacion = entrada.find('p', {'class': 'card__location'}).getText()
	precio = entrada.find('span', {'class': 'card__price-value'}).getText()

        # Imprimo el Título, Autor y Fecha de las entradas
        print "%d - %s  |  %s  |  %s" % (i + 1, titulo, ubicacion, precio)

else:
    print "Status Code %d" % status_code

Te recomiendo leas la documentacion de Beautiful Soup.
You are not allowed to view links. You are not allowed to view links. Register or Login or You are not allowed to view links. Register or Login
Espero te ayude.

Saludos!
K A I L

ayuda con mi scrapy por favor

lemos.ema

Junio 27, 2019, 02:48:17 PM Ultima modificación: Junio 27, 2019, 06:15:13 PM por Gabriela

???????

Junio 27, 2019, 09:57:46 PM #1

lemos.ema

Junio 28, 2019, 02:30:55 AM #2

K A I L

Junio 28, 2019, 01:38:09 PM #3