Jump to content
Chinese-Forums
  • Sign Up

Help scraping character animation images off wikipedia


Recommended Posts

Posted

Yes, I know this is not a programming forum, but I think people will find this useful so I ask here.

I'm trying to get the images showing stroke order off of Wikipedia. Scraping from this page. The project is far from complete, but at least I can freely use the images available. Seems I'm not the first to try this, someone already wrote a script over at wikipedia, but it's not saving the files with filename matching the image. For my use, that's not good enough.

With my limited programming experience, I already spent two days trying to fix this but I've reached a dead end.. also tried on a linux machine, also not getting the character to match the image, sometimes getting a ? in the filename. The problem is on line 123.

<?php

/*
-------------------------------------------------------------------------
get-stroke-orders.php
-------------------------------------------------------------------------

Version 1.0

Contact: http://en.wikipedia.org/wiki/User_talk:WikiLaurent

This program is free software you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundationeither version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTYwithout even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this programif not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA

-------------------------------------------------------------------------

Dependency:

Simple HTML DOM Parser (http://simplehtmldom.sourceforge.net/)

-------------------------------------------------------------------------

Usage:

php get-stroke-orders.php n=<page number> t=<animation type>

<page number> - The page number (1 to 7) at http://commons.wikimedia.org/wiki/Commons:Stroke_Order_Project/Simplified_Chinese_progress
<animation type> - "bw", "red" or "order"

-------------------------------------------------------------------------

Example:

Get all the gif animations:

php get-stroke-orders.php n=1 t=order
php get-stroke-orders.php n=2 t=order
php get-stroke-orders.php n=3 t=order
php get-stroke-orders.php n=4 t=order
php get-stroke-orders.php n=5 t=order
php get-stroke-orders.php n=6 t=order
php get-stroke-orders.php n=7 t=order

-------------------------------------------------------------------------

*/

require_once "simple_html_dom.php";
set_time_limit(3600 * 10);

function curl($url){
       $ch = curl_init();
       curl_setopt($ch, CURLOPT_URL,$url);
       curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
       curl_setopt($ch, CURLOPT_USERAGENT, "StrokeOrderAnimScrapper/1.0");
       $output = curl_exec($ch);
       curl_close ($ch);
       return $output;
}


function downloadAnimations($pageNumber, $type = "bw") {
       $listBaseUrl = "http://commons.wikimedia.org/wiki/Commons:Stroke_Order_Project/Simplified_Chinese_progress";
//$listBaseUrl = "./Simplified_Chinese_progress.htm";
       $pageUrl = $listBaseUrl;
       if ($pageNumber > 1) $pageUrl .= "/" . $pageNumber;
	$filecount = 0;

       echo "Parsing " . $pageUrl . "\n";
       $hmlString = curl($pageUrl);
       $html = new simple_html_dom();
       $html->load($hmlString);

       $filecount = 0;
foreach ($html->find('tr') as $tr) {
               $tdIndex = 3;
               if ($type == "red") $tdIndex = 4;
               if ($type == "order") $tdIndex = 5;

               $tdimg = $tr->find("td", $tdIndex);
	$tdchar = $tr->find("td", 1);
               if (!$tdimg) continue;
               $img = $tdimg->find("img", 0);
	$char = $tdchar->plaintext;
	$char = substr($char,1);
	echo "Got this char:: " . $char . "\n";
               if (!$img)
	{
		echo "no file for char::" . $char . "\n";
		continue;
	}
               $src = $img->getAttribute("src");
               if ($type == "bw" && strpos($src, "-bw.png") === false) continue;
               if ($type == "red" && strpos($src, "-red.png") === false) continue;
               if ($type == "order" && strpos($src, "-order.gif") === false) continue;

               $lastSlashIndex = strrpos($src, "/");
               $src = substr($src, 0, $lastSlashIndex);
	$src = str_replace("/thumb", "", $src);

	//$alt = $img->getAttribute("alt");
	//if ($type == "bw") $filename = substr($alt, 1) . "-bw" . ".png";
	//if ($type == "red") $filename = substr($alt, 1) . "-red" . ".png";
	//if ($type == "order") $filename = substr($alt, 1) . "-order" . ".gif";

               echo "Downloading " . $src . "\n";
               $pngData = file_get_contents($src);
	$filecount = ++$filecount;
	$filename = $char . substr($src,-7);
	if(empty($pngData)) echo "failed to download " . $src . "\n";
	echo "filename now:: " . $filename . "\n";
               if (file_put_contents(utf8_encode($filename), $pngData)) echo $filecount . ")) " . $char . " from src: " . $src . " downloaded" . "\n\n";
	else echo $filecount . ")) " . $char . " from src: " . $src . " failed" . "\n\n";
       }
echo "got" . $filecount . "\n";
}


function getParam($name) {
       if (isset($_GET[$name])) return $_GET[$name];
       global $argv;
       foreach ($argv as $value) {;
               $pair = explode("=", $value);
               if (count($pair) < 2) continue;
               if (trim($pair[0]) != $name) continue;
               $equalPos = strpos($value, "=");
               return trim(substr($value, $equalPos + 1, strlen($value)));
       }
       return null;
}

downloadAnimations(getParam("n"), getParam("t"));

Any help appreciated,

Thanks

Posted

It might be easier if you just show us exactly what you want.

A link to 1 or 2 of the images you want...

And what you want it to be called when you're done.

Honestly that code looks messy and I'd be better off with a description of what you actually want then looking at someone elses broken script and guessing what exactly you want from it.

Posted

Thanks for the nudge, i didn't think anyone would be interested in solving this here.

Well I narrowed the problem down to a few lines, the problem seems to be in decoding the url, the part with the filename to be exact.

here's a cleaned up bit of code::

<?php
   $src= "http://upload.wikimedia.org/wikipedia/commons/2/26/%E7%9A%84-bw.png";
   $pngData = file_get_contents($src);
   $fileName = basename(urldecode($src));
   file_put_contents($fileName, $pngData);
?>

If you put

in your browser, it would be transformed to
and then when you save the file it would be called
的-bw.png

Simple enough, how to do this in php?

Posted

Wait, ignore me.

Hmmm, can't get php to create a Chinese filename. And should really be doing other things . . .

Posted

I don't know about theirs, but....

Just use this...

getChar('你');

<?php
/*
Author: Matty of http://www.chinese-forums.com
Release: 2011-08-10
*/
ini_set('default_charset', 'utf-8'); // <=== Probably not needed!
getChar('你');
getChar('好');


function getChar($char)
{
$folder = '';  // You could do this .... $folder = "d:/z/";
$formattedname = iconv('utf-8','gbk', $char);
if (file_exists($folder . $formattedname . '.png')) {echo "$char  - $formattedname - already downloaded.<br />\r\n";return;}
$x = file_get_contents('http://commons.wikimedia.org/wiki/File:'.$char.'-bw.png');
$full_res_pattern = '<a href="(http:\/\/upload.wikimedia.org\/wikipedia\/commons\/.*?-bw.png)" class="internal" title=".*-bw.png">Full resolution';
$file = getSingle($full_res_pattern, $x);
$image_data = file_get_contents($file);
file_put_contents($folder . $formattedname . '.png',$image_data);
}
function getSingle($pattern, $txt,$after='')
{
preg_match("/$pattern/$after", $txt, $matches);
if (isset($matches[1]))
	return $matches[1];
return false;
}
?>

Posted

Sorry for the late reply, been away.. Well thanks for taking an interest in this.

I tried your script Matty, but this is what I get::

. - . - already downloaded.<br />

. - . - already downloaded.<br />

By the way, this is on a linux machine running php ver 5.3.5..

I'm not sure I follow your logic, $char is changing from an external list? How to do this for all files? For sure directory indexing is disabled on Wikipedia.. I'm past getting all the images, it's just getting the wrong file names.

Have you tried this on your computer? I get results will vary with each case's severity ..

Posted

Actually I did try it, but now I'm getting slightly different results. This should work.

<?php
/*
Author: Matty of http://www.chinese-forums.com
Release: 2011-08-10
Updated: 2011-08-15
*/
ini_set('default_charset', 'utf-8'); // <=== Probably not needed!
getChar('你');
getChar('好');


function getChar($char)
{
$folder = '';  // You could do this .... $folder = "d:/z/";
$formattedname = iconv('utf-8','gbk', $char);
if (file_exists($folder . $formattedname . '.png')) {echo "$char  - $formattedname - already downloaded.<br />\r\n";return;}
$x = request('http://commons.wikimedia.org/wiki/File:'.$char.'-bw.png');
$full_res_pattern = '<a href="(http:\/\/upload.wikimedia.org\/wikipedia\/commons\/.*?-bw.png)" class="internal" title=".*-bw.png">Full resolution';
$file = getSingle($full_res_pattern, $x);
$image_data = file_get_contents($file);
file_put_contents($folder . $formattedname . '.png',$image_data);
}
function getSingle($pattern, $txt,$after='')
{
preg_match("/$pattern/$after", $txt, $matches);
if (isset($matches[1]))
	return $matches[1];
return false;
}
function request($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$output = curl_exec($ch);
curl_close ($ch);
return $output;
}
?>

Posted

AHHH !! Script file has to be saved with utf8 encoding.. That's why I kept getting

Notice: iconv(): Detected an illegal character in input string in D:\simplehtmldom\matty.php on line 15

Ok, thanks, I'm getting files with the correct file name now. But how to use this to get files for all characters? List the characters in an array?

$chars = array('的','是','不','我'); // etc.. 

foreach($chars as $xchar)
{
   getChar($xchar);
}

Well tried that.. somehow not able to statically define array..

Posted

Err, here's a smarter way to use your script, but i'm too sleepy to figure out what's wrong .. almost there.. sleep now. Thanks.

require_once "simple_html_dom.php";

$listBaseUrl = "http://commons.wikimedia.org/wiki/Commons:Stroke_Order_Project/Simplified_Chinese_progress";

$hmlString = request($listBaseUrl);

$html = new simple_html_dom();

$html->load($hmlString);

foreach ($html->find('tr') as $tr)

{

$tdchar = $tr->find("td", 1);

if (!$td) continue;

$char = $tdchar->plaintext;

$char = substr($char,1);

echo $char . "\n";

getChar($char);

}

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...