slabo Posted August 8, 2011 at 04:38 PM Report Posted August 8, 2011 at 04:38 PM Yes, I know this is not a programming forum, but I think people will find this useful so I ask here. I'm trying to get the images showing stroke order off of Wikipedia. Scraping from this page. The project is far from complete, but at least I can freely use the images available. Seems I'm not the first to try this, someone already wrote a script over at wikipedia, but it's not saving the files with filename matching the image. For my use, that's not good enough. With my limited programming experience, I already spent two days trying to fix this but I've reached a dead end.. also tried on a linux machine, also not getting the character to match the image, sometimes getting a ? in the filename. The problem is on line 123. <?php /* ------------------------------------------------------------------------- get-stroke-orders.php ------------------------------------------------------------------------- Version 1.0 Contact: http://en.wikipedia.org/wiki/User_talk:WikiLaurent This program is free software you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundationeither version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTYwithout even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this programif not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA ------------------------------------------------------------------------- Dependency: Simple HTML DOM Parser (http://simplehtmldom.sourceforge.net/) ------------------------------------------------------------------------- Usage: php get-stroke-orders.php n=<page number> t=<animation type> <page number> - The page number (1 to 7) at http://commons.wikimedia.org/wiki/Commons:Stroke_Order_Project/Simplified_Chinese_progress <animation type> - "bw", "red" or "order" ------------------------------------------------------------------------- Example: Get all the gif animations: php get-stroke-orders.php n=1 t=order php get-stroke-orders.php n=2 t=order php get-stroke-orders.php n=3 t=order php get-stroke-orders.php n=4 t=order php get-stroke-orders.php n=5 t=order php get-stroke-orders.php n=6 t=order php get-stroke-orders.php n=7 t=order ------------------------------------------------------------------------- */ require_once "simple_html_dom.php"; set_time_limit(3600 * 10); function curl($url){ $ch = curl_init(); curl_setopt($ch, CURLOPT_URL,$url); curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); curl_setopt($ch, CURLOPT_USERAGENT, "StrokeOrderAnimScrapper/1.0"); $output = curl_exec($ch); curl_close ($ch); return $output; } function downloadAnimations($pageNumber, $type = "bw") { $listBaseUrl = "http://commons.wikimedia.org/wiki/Commons:Stroke_Order_Project/Simplified_Chinese_progress"; //$listBaseUrl = "./Simplified_Chinese_progress.htm"; $pageUrl = $listBaseUrl; if ($pageNumber > 1) $pageUrl .= "/" . $pageNumber; $filecount = 0; echo "Parsing " . $pageUrl . "\n"; $hmlString = curl($pageUrl); $html = new simple_html_dom(); $html->load($hmlString); $filecount = 0; foreach ($html->find('tr') as $tr) { $tdIndex = 3; if ($type == "red") $tdIndex = 4; if ($type == "order") $tdIndex = 5; $tdimg = $tr->find("td", $tdIndex); $tdchar = $tr->find("td", 1); if (!$tdimg) continue; $img = $tdimg->find("img", 0); $char = $tdchar->plaintext; $char = substr($char,1); echo "Got this char:: " . $char . "\n"; if (!$img) { echo "no file for char::" . $char . "\n"; continue; } $src = $img->getAttribute("src"); if ($type == "bw" && strpos($src, "-bw.png") === false) continue; if ($type == "red" && strpos($src, "-red.png") === false) continue; if ($type == "order" && strpos($src, "-order.gif") === false) continue; $lastSlashIndex = strrpos($src, "/"); $src = substr($src, 0, $lastSlashIndex); $src = str_replace("/thumb", "", $src); //$alt = $img->getAttribute("alt"); //if ($type == "bw") $filename = substr($alt, 1) . "-bw" . ".png"; //if ($type == "red") $filename = substr($alt, 1) . "-red" . ".png"; //if ($type == "order") $filename = substr($alt, 1) . "-order" . ".gif"; echo "Downloading " . $src . "\n"; $pngData = file_get_contents($src); $filecount = ++$filecount; $filename = $char . substr($src,-7); if(empty($pngData)) echo "failed to download " . $src . "\n"; echo "filename now:: " . $filename . "\n"; if (file_put_contents(utf8_encode($filename), $pngData)) echo $filecount . ")) " . $char . " from src: " . $src . " downloaded" . "\n\n"; else echo $filecount . ")) " . $char . " from src: " . $src . " failed" . "\n\n"; } echo "got" . $filecount . "\n"; } function getParam($name) { if (isset($_GET[$name])) return $_GET[$name]; global $argv; foreach ($argv as $value) {; $pair = explode("=", $value); if (count($pair) < 2) continue; if (trim($pair[0]) != $name) continue; $equalPos = strpos($value, "="); return trim(substr($value, $equalPos + 1, strlen($value))); } return null; } downloadAnimations(getParam("n"), getParam("t")); Any help appreciated, Thanks Quote
Matty Posted August 10, 2011 at 03:59 AM Report Posted August 10, 2011 at 03:59 AM It might be easier if you just show us exactly what you want. A link to 1 or 2 of the images you want... And what you want it to be called when you're done. Honestly that code looks messy and I'd be better off with a description of what you actually want then looking at someone elses broken script and guessing what exactly you want from it. Quote
slabo Posted August 10, 2011 at 07:03 AM Author Report Posted August 10, 2011 at 07:03 AM Thanks for the nudge, i didn't think anyone would be interested in solving this here. Well I narrowed the problem down to a few lines, the problem seems to be in decoding the url, the part with the filename to be exact. here's a cleaned up bit of code:: <?php $src= "http://upload.wikimedia.org/wikipedia/commons/2/26/%E7%9A%84-bw.png"; $pngData = file_get_contents($src); $fileName = basename(urldecode($src)); file_put_contents($fileName, $pngData); ?> If you put http://upload.wikimedia.org/wikipedia/commons/2/26/%E7%9A%84-bw.png in your browser, it would be transformed to http://upload.wikimedia.org/wikipedia/commons/2/26/的-bw.png and then when you save the file it would be called 的-bw.png Simple enough, how to do this in php? Quote
roddy Posted August 10, 2011 at 07:59 AM Report Posted August 10, 2011 at 07:59 AM Wait, ignore me. Hmmm, can't get php to create a Chinese filename. And should really be doing other things . . . Quote
Matty Posted August 10, 2011 at 10:25 AM Report Posted August 10, 2011 at 10:25 AM I don't know about theirs, but.... Just use this... getChar('你'); <?php /* Author: Matty of http://www.chinese-forums.com Release: 2011-08-10 */ ini_set('default_charset', 'utf-8'); // <=== Probably not needed! getChar('你'); getChar('好'); function getChar($char) { $folder = ''; // You could do this .... $folder = "d:/z/"; $formattedname = iconv('utf-8','gbk', $char); if (file_exists($folder . $formattedname . '.png')) {echo "$char - $formattedname - already downloaded.<br />\r\n";return;} $x = file_get_contents('http://commons.wikimedia.org/wiki/File:'.$char.'-bw.png'); $full_res_pattern = '<a href="(http:\/\/upload.wikimedia.org\/wikipedia\/commons\/.*?-bw.png)" class="internal" title=".*-bw.png">Full resolution'; $file = getSingle($full_res_pattern, $x); $image_data = file_get_contents($file); file_put_contents($folder . $formattedname . '.png',$image_data); } function getSingle($pattern, $txt,$after='') { preg_match("/$pattern/$after", $txt, $matches); if (isset($matches[1])) return $matches[1]; return false; } ?> Quote
slabo Posted August 14, 2011 at 08:12 PM Author Report Posted August 14, 2011 at 08:12 PM Sorry for the late reply, been away.. Well thanks for taking an interest in this. I tried your script Matty, but this is what I get:: . - . - already downloaded.<br />. - . - already downloaded.<br /> By the way, this is on a linux machine running php ver 5.3.5.. I'm not sure I follow your logic, $char is changing from an external list? How to do this for all files? For sure directory indexing is disabled on Wikipedia.. I'm past getting all the images, it's just getting the wrong file names. Have you tried this on your computer? I get results will vary with each case's severity .. Quote
Matty Posted August 15, 2011 at 11:14 AM Report Posted August 15, 2011 at 11:14 AM Actually I did try it, but now I'm getting slightly different results. This should work. <?php /* Author: Matty of http://www.chinese-forums.com Release: 2011-08-10 Updated: 2011-08-15 */ ini_set('default_charset', 'utf-8'); // <=== Probably not needed! getChar('你'); getChar('好'); function getChar($char) { $folder = ''; // You could do this .... $folder = "d:/z/"; $formattedname = iconv('utf-8','gbk', $char); if (file_exists($folder . $formattedname . '.png')) {echo "$char - $formattedname - already downloaded.<br />\r\n";return;} $x = request('http://commons.wikimedia.org/wiki/File:'.$char.'-bw.png'); $full_res_pattern = '<a href="(http:\/\/upload.wikimedia.org\/wikipedia\/commons\/.*?-bw.png)" class="internal" title=".*-bw.png">Full resolution'; $file = getSingle($full_res_pattern, $x); $image_data = file_get_contents($file); file_put_contents($folder . $formattedname . '.png',$image_data); } function getSingle($pattern, $txt,$after='') { preg_match("/$pattern/$after", $txt, $matches); if (isset($matches[1])) return $matches[1]; return false; } function request($url) { $ch = curl_init(); curl_setopt($ch, CURLOPT_URL,$url); curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); $output = curl_exec($ch); curl_close ($ch); return $output; } ?> Quote
slabo Posted August 15, 2011 at 05:39 PM Author Report Posted August 15, 2011 at 05:39 PM AHHH !! Script file has to be saved with utf8 encoding.. That's why I kept getting Notice: iconv(): Detected an illegal character in input string in D:\simplehtmldom\matty.php on line 15 Ok, thanks, I'm getting files with the correct file name now. But how to use this to get files for all characters? List the characters in an array? $chars = array('的','是','不','我'); // etc.. foreach($chars as $xchar) { getChar($xchar); } Well tried that.. somehow not able to statically define array.. Quote
slabo Posted August 15, 2011 at 06:08 PM Author Report Posted August 15, 2011 at 06:08 PM Err, here's a smarter way to use your script, but i'm too sleepy to figure out what's wrong .. almost there.. sleep now. Thanks. require_once "simple_html_dom.php"; $listBaseUrl = "http://commons.wikimedia.org/wiki/Commons:Stroke_Order_Project/Simplified_Chinese_progress"; $hmlString = request($listBaseUrl); $html = new simple_html_dom(); $html->load($hmlString); foreach ($html->find('tr') as $tr) { $tdchar = $tr->find("td", 1); if (!$td) continue; $char = $tdchar->plaintext; $char = substr($char,1); echo $char . "\n"; getChar($char); } Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.