como es para seguir el enlace usando Scapy y Python

rommel360 · Septiembre 03, 2020, 06:13:27 PM

Código: python


import scrapy

class WitsiSpider(scrapy.Spider):
    name = 'witsi'
    allowed_domains = ['www.quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        citas = response.xpath('//*[@class="quote"]')
        for cita in citas:
        	texto = cita.xpath('.//*[@class="text"]/text()').extract_first()
        	autor = cita.xpath('.//*[@class="author"]/text()').extract_first()
        	palabras_claves = cita.xpath('.//*[@itemprop="keywords"]/@content').extract_first()
        	
        	yield{ 'Texto' : texto,
        		   'Autor' : autor,
        		   'Palabras Claves' : palabras_claves }
        	
        url_a_continuar = response.xpath('//ul[@class="pager"]/li[@class="next"]/a/@href').extract()
        url_siguiente = response.urljoin(url_a_continuar)
        yield scrapy.Request(url_siguiente, callback = self.parse)

estoy aprendiendo un poco de esto y el problema que tengo es que no puedo hacer que mi arañita siga el enlace y lo unico que hace es repetirme los datos.
Como seria para que la araña siga el enlace y pueda continuar sacando la informacion?

estoy practicando con la siguiente pagina de internet

No tienes permitido ver enlaces. Registrate o Entra a tu cuenta

Bueno e editado mi araña despues de investigar y ya consegui seguir los link pero el problema es que me repite la informacion. que estare haciendo mal?

Nueva Version

Código: python


import scrapy

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class WitsiSpider(CrawlSpider):
    name = 'witsi'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com']

    rules = (
        Rule(LinkExtractor(allow=r'page/'),callback = 'parse', follow=True ),
    )
        
    def parse(self, response):
        citas = response.xpath('//*[@class="quote"]')
        for cita in citas:
            texto = cita.xpath('.//*[@class="text"]/text()').extract_first()
            autor = cita.xpath('.//*[@class="author"]/text()').extract_first()
            palabras_claves = cita.xpath('.//*[@itemprop="keywords"]/@content').extract_first()
        	
            yield{ 'Texto' : texto,
                   'Autor' : autor,
        	   'Palabras Claves' : palabras_claves }
        	
        
            yield

Esto es una parte de la salida de mi araña y como ven en este caso son citas de poemas me los repite y a si con otros

Código: text


{"Texto": "\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d", "Autor": "Eleanor Roosevelt", "Palabras Claves": "misattributed-eleanor-roosevelt"},
{"Texto": "\u201cA day without sunshine is like, you know, night.\u201d", "Autor": "Steve Martin", "Palabras Claves": "humor,obvious,simile"},
{"Texto": "\u201cLife is what happens to us while we are making other plans.\u201d", "Autor": "Allen Saunders", "Palabras Claves": "fate,life,misattributed-john-lennon,planning,plans"},
{"Texto": "\u201cLife is what happens to us while we are making other plans.\u201d", "Autor": "Allen Saunders", "Palabras Claves": "fate,life,misattributed-john-lennon,planning,plans"},
{"Texto": "\u201cLife is what happens to us while we are making other plans.\u201d", "Autor": "Allen Saunders", "Palabras Claves": "fate,life,misattributed-john-lennon,planning,plans"},
{"Texto": "\u201c... a mind needs books as a sword needs a whetstone, if it is to keep its edge.\u201d", "Autor": "George R.R. Martin", "Palabras Claves": "books,mind"},
{"Texto": "\u201cYou have to write the book that wants to be written. And if the book will be too difficult for grown-ups, then you write it for children.\u201d", "Autor": "Madeleine L'Engle", "Palabras Claves": "books,children,difficult,grown-ups,write,writers,writing"},
{"Texto": "\u201cYou have to write the book that wants to be written. And if the book will be too difficult for grown-ups, then you write it for children.\u201d", "Autor": "Madeleine L'Engle", "Palabras Claves": "books,children,difficult,grown-ups,write,writers,writing"},
{"Texto": "\u201cYou have to write the book that wants to be written. And if the book will be too difficult for grown-ups, then you write it for children.\u201d", "Autor": "Madeleine L'Engle", "Palabras Claves": "books,children,difficult,grown-ups,write,writers,writing"},
{"Texto": "\u201cYou have to write the book that wants to be written. And i

rommel360 · Septiembre 10, 2020, 02:44:51 PM

bueno pues ya lo resolvi. el problema fue que tenia mal mi regla ya que mi regla era global

Código: python

 Rule(LinkExtractor(allow=r'page/'),callback = 'parse', follow=True ),

y buscaba todo lo que tuviera /page pero yo solo queria lo que fuera asi No tienes permitido ver enlaces. Registrate o Entra a tu cuenta

y no No tienes permitido ver enlaces. Registrate o Entra a tu cuenta ya que esta forma es dominio + subcarpeta + page/ y no me servia ya que lo correcto para mi ejemplo es dominio/+page/1,2,3,4,5,6,7,8,9,10 (cada una de esa es un page/2 page/3 page/4 etc)

a si que lo que hice fue cambiar mi regla (allow=r'page/') por .com/page/ y asi, si se trai esto solamente No tienes permitido ver enlaces. Registrate o Entra a tu cuenta porque se cumple .com + page/1,2,3,4 etc. y esta ya no se cumple .com/subcarpeta/subcarpeta/page/1

a si que al final quedo

Código: python


import scrapy

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class WitsiSpider(CrawlSpider):
    name = 'witsi'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/page/1/']
    base_url = 'http://quotes.toscrape.com/page/1/'
     
    rules = (
        Rule(LinkExtractor(allow=r'.com/page/'),callback = 'parse', follow=True ),
    )
        
    def parse(self, response):
        citas = response.xpath('//*[@class="quote"]')
        for cita in citas:
            texto = cita.xpath('.//span[@class="text"]/text()').extract_first()
            autor = cita.xpath('.//*[@class="author"]/text()').extract_first()
            palabras_claves = cita.xpath('.//*[@itemprop="keywords"]/@content').extract_first()
        	
            yield{ 'Texto' : texto,
                   'Autor' : autor,
        	   'Palabras Claves' : palabras_claves }

gracias a todos por su ayuda

como es para seguir el enlace usando Scapy y Python

rommel360

Septiembre 03, 2020, 06:13:27 PM Ultima modificación: Septiembre 10, 2020, 02:46:32 PM por rommel360

rommel360

Septiembre 10, 2020, 02:44:51 PM #1