Click here to Skip to main content
15,891,316 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Im trying to create a little script that will grab photos from a specific site ("ilike-photo.com") and save them in google-drive. but this site uses a thing called lazy-loading- more images are dynamically loaded when the user scrolls down. one way to work around this is using window.scroll() until no more content is loaded and only then grab the images, but this technique is slow and ugly because the user actually sees the page getting scrolled. is there a way to force dynamic content to be loaded but keep the scrollbar (and the page) on top?

all i can think of is some way of faking a scroll event somehow but i don't think it's possible...
may be there is a way to find the function that listens to the scroll event and manually call it?
the last option i can think of is using a server that will run an headless browser on it and will serve my request but the problem is that i don't have a server:)
any other suggestions?
Posted

1 solution

The problem is not so simple; and I don't think it can be solved without looking at the site you are trying to scrape. I have no idea how exactly the lazy loading technique you described is implemented, but I'm sure it can be implemented is some different ways, and those differences would need difference scraping approaches. Only one aspect of the difference is important: in all cases, scrolling causes some additional HTTP requests, and the data related to the scrolling event (say, scrolling position, page, or something like that) can be passed in the HTTP request in different ways: HTTP parameters, URL parameters, etc.

So, you need to study this and act accordingly. How? Here is the approach I would use:

Use some existing HTTP spy software and then try to rich the full content manually, by loading the page and scrolling. Such HTTP spying tools are often available as plug-ins for Web browser. I, for example, use HttpFox, a plug-in for Mozilla browsers. If the tracking is turned on, it will list you all the HTTP requests and HTTP responses passed through the browser, with all the detail needed to understand how to do scraping.

—SA
 
Share this answer
 
v2
Comments
yno1 14-Sep-14 16:39pm    
thank you very much!
i will try this HttpFox and play with it and see what i get and i will post an update soon!
Sergey Alexandrovich Kryukov 14-Sep-14 19:07pm    
You can do this research on particular site using any spy software, whichever is convenient for you.
This approach helped me in nearly all cases.

Please try it and consider accepting the answer formally (green "Accept" button).
—SA
yno1 16-Sep-14 11:08am    
i'm was play around with httpFox and i was able to find the GET request that the browser is using to display the request the images, but how can i find who prepared the request?
it seems like the request is dynamically prepared...is there a tool to find out things like that?
Sergey Alexandrovich Kryukov 16-Sep-14 11:15am    
I don't know such tools, but you may not need them...

Now, theoretically speaking, explicitly calculated HTTP request could be anything, even random and, hence, unpredictable in principle. In connection to that, I already explained in some of my previous answers, that not only the problem of "scrapping all the Web site" maybe unsolvable, it may not even make sense. Example: a game implemented on the server side...

—SA
yno1 16-Sep-14 11:46am    
how does httpFox Intercepts the GET request? don't you see how absurd it is?? httpFox can give me exactly the string that i need, with all the images id's...if httpFox can do this, why cant i do it?

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900