A review and help for python spiders working with Scrapinghub

Cerrado Publicado hace 5 años Pagado a la entrega
Cerrado

Hello,

I have a list of about 10k urls that I need to validate, all I have to do, is scraping the home page if the website does respond, and discard the others.

However, I've noticed that, when running the spider on Scrapinghub several times in a row, I get inconsistent results, meaning not the same number of scraped items. Usually the main difference is on the number of timed out urls.

I have set up DOWNLOAD_TIMEOUT up to 300 (with RETRY_ENABLED to False), but I still get a bunch of "[login to view URL] [login to view URL]: User timeout caused connection failure: Getting [login to view URL] took longer than 300.0 seconds.."

I have tried some of the 'slowest' websites (with request duration > 50 seconds) in the browser and they work fine. Even when running the scrapping on a single website in my local machine it works fine and loads quickly (less than 2/3 sec).

When looking at the request logs, I've found 300 urls with a request duration of more than 50 seconds, whenever I browse those website, or lauch a spider on only one of those urls, it works fast.

Now, I have isolated the 100 slowest requests (50 seconds or more), and created a new spider with those urls.

When I look at this spider request logs, I see that the request durations are not the same at all, and that the request duration follows a pattern of going from 200ms for the first request to around 2000ms for the last request.

So my final question is : how could I avoid this 'instability'? I need to run those spiders regularly, in order to maintain a list of working urls, and I can't afford to have missing items.

I have attached a zip file ([login to view URL]) with all files to support my investigation :

- [login to view URL] : here you can see 4 identical spiders, giving different results

- [login to view URL] : the [login to view URL] file

- [login to view URL] : an overview of the spider

- [login to view URL] : the 4 spiders stats, showing a big difference in the timeouts and http status)

- [login to view URL] : the requests logs of 8830 urls (see how the request duration goes gradualy higher, and then cycle back)

- [login to view URL] : an extract of the 100 slowest requests from [login to view URL]

- [login to view URL] : the same spider, running on the 100 'slowest' urls taken from [login to view URL] (see how the request duration goes gradualy higher)

Python Scrapy

Nº del proyecto: #17676525

Sobre el proyecto

11 propuestas Proyecto remoto Activo hace 5 años

11 freelancers están ofertando un promedio de $32 / hora por este trabajo

chirgeo

Hi. Ok, I can investigate this issue and see what can be wrong. From my idea this can be related as well with the location from where the connections is made. To solve this issue we may require to use different prox Más

$40 USD / hora
(85 comentarios)
7.1
polarjin2017

Hi.. How are you? I saw your description carefully your project. Owing to my rich experience in python and scrapinghub, i can say i can do this perfectly. I have many top skills like python, scrapy,CSS,HTML ,PHP , Más

$41 USD / hora
(8 comentarios)
4.6
roshanasim

I have worked in python for 5 years. I have developed a mental health project expression recognition in python integration with android, natural language processing in python, regular expressions handling, development Más

$42 USD / hora
(14 comentarios)
4.4
divkis

Hi, This is very interesting task at hand and I would like to take this up, reason being I am an expert at scraping and have written it to scrape Amazon, Ecommerce sites, Real estate sites and Social networking site Más

$25 USD / hora
(0 comentarios)
0.0