Jump to content
Chinese-Forums
  • Sign Up

Web scraping Baidu Tieba


LiMo

Recommended Posts

I'm not sure if this really counts as Chinese computing but it's loosely related and I need a little help.

 

I've been doing a project on Chinese internet communities and I want to try scraping Baidu forums. Obviously these forums are huge and I want to restrict my search to subforums and threads. However, baidu forums seems to have a crazy layout, or rather, an overly simple one. From the URL it seems like the first page on a subforum has a URL like the one below:

 

https://tieba.baidu.com/f?ie=utf-8&kw=人生

https://tieba.baidu.com/f?ie=utf-8&kw=凡人修仙传

 

But the threads on all subforums are held in the same place, these are from the respective forums above:

 

https://tieba.baidu.com/p/5722174354

 

https://tieba.baidu.com/p/5721673461

 

I fear that if I attempt to scrape the forums like this I'll end up scraping the entire thing. I'm no expert and the software I'm trying to use, Rcrawler, works based on the the URL having clear sections. (The info on how it works is found here: https://github.com/salimk/Rcrawler#installation)

Here's the URL to this thread for example:

https://www.chinese-forums.com/forums/topic/56586-web-scraping-baidu-tieba

 

Nice and clear and with an obvious title in it that can be found using key words

 

Any ideas on how I can fix/deal with this (apparent) problem?

 

 

Link to comment
Share on other sites

I would say what you'd need to do is first scrape the https://tieba.baidu.com/f?ie=utf-8&kw=人生 page for all urls in the format https://tieba.baidu.com/p/1234567890, then go and scrape that list of urls. Which would be two different jobs, but I can't see how else you can do it. Many moons ago I'd have done this in PHP by basically telling it to look for https://tieba.baidu.com/p/ followed by ten digits, then to go to that page and look for all content between whatever tags indicated the start / end of content. I'm sure there are easier ways now. There were probably easier ways then.

 

I suspect Baidu will be very used to people scraping their content for the simple purposes of copying it, and this is what it'll look like you're doing. I wouldn't let any crawlers loose without being sure they're going to fly under the radar.

  • Helpful 1
Link to comment
Share on other sites

I'll give it a go. What happens if I'm "detected?" I'm operating at the edge of my competence here so I have no idea what the difference is between all these things. Can they selectively ban my IP address or something like that which would stop me even visiting casually?

Link to comment
Share on other sites

2 hours ago, roddy said:

without being sure they're going to fly under the radar.

Your crawler's not going to fly under the radar.  The best you can do is make sure that you put reasonably delays between each request so that it's not crawling abusively.  Also respect their robots.txt file.  It'll give you a clue as to which places you shouldn't be scraping.

2 hours ago, LiMo said:

Can they selectively ban my IP address or something like that which would stop me even visiting casually?

Yes.

  • Helpful 1
Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...