LiMo Posted May 29, 2018 at 09:41 AM Report Share Posted May 29, 2018 at 09:41 AM I'm not sure if this really counts as Chinese computing but it's loosely related and I need a little help. I've been doing a project on Chinese internet communities and I want to try scraping Baidu forums. Obviously these forums are huge and I want to restrict my search to subforums and threads. However, baidu forums seems to have a crazy layout, or rather, an overly simple one. From the URL it seems like the first page on a subforum has a URL like the one below: https://tieba.baidu.com/f?ie=utf-8&kw=人生 https://tieba.baidu.com/f?ie=utf-8&kw=凡人修仙传 But the threads on all subforums are held in the same place, these are from the respective forums above: https://tieba.baidu.com/p/5722174354 https://tieba.baidu.com/p/5721673461 I fear that if I attempt to scrape the forums like this I'll end up scraping the entire thing. I'm no expert and the software I'm trying to use, Rcrawler, works based on the the URL having clear sections. (The info on how it works is found here: https://github.com/salimk/Rcrawler#installation) Here's the URL to this thread for example: https://www.chinese-forums.com/forums/topic/56586-web-scraping-baidu-tieba Nice and clear and with an obvious title in it that can be found using key words Any ideas on how I can fix/deal with this (apparent) problem? Quote Link to comment Share on other sites More sharing options...
roddy Posted May 29, 2018 at 11:05 AM Report Share Posted May 29, 2018 at 11:05 AM I would say what you'd need to do is first scrape the https://tieba.baidu.com/f?ie=utf-8&kw=人生 page for all urls in the format https://tieba.baidu.com/p/1234567890, then go and scrape that list of urls. Which would be two different jobs, but I can't see how else you can do it. Many moons ago I'd have done this in PHP by basically telling it to look for https://tieba.baidu.com/p/ followed by ten digits, then to go to that page and look for all content between whatever tags indicated the start / end of content. I'm sure there are easier ways now. There were probably easier ways then. I suspect Baidu will be very used to people scraping their content for the simple purposes of copying it, and this is what it'll look like you're doing. I wouldn't let any crawlers loose without being sure they're going to fly under the radar. 1 Quote Link to comment Share on other sites More sharing options...
LiMo Posted May 29, 2018 at 11:45 AM Author Report Share Posted May 29, 2018 at 11:45 AM I'll give it a go. What happens if I'm "detected?" I'm operating at the edge of my competence here so I have no idea what the difference is between all these things. Can they selectively ban my IP address or something like that which would stop me even visiting casually? Quote Link to comment Share on other sites More sharing options...
imron Posted May 29, 2018 at 01:54 PM Report Share Posted May 29, 2018 at 01:54 PM 2 hours ago, roddy said: without being sure they're going to fly under the radar. Your crawler's not going to fly under the radar. The best you can do is make sure that you put reasonably delays between each request so that it's not crawling abusively. Also respect their robots.txt file. It'll give you a clue as to which places you shouldn't be scraping. 2 hours ago, LiMo said: Can they selectively ban my IP address or something like that which would stop me even visiting casually? Yes. 1 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.