tag:www.stefanwienert.net,2008:/bot Bot - Stefan Wienert's Blog 2010-02-28T20:06:14Z Enki Stefan Wienert stwienert@gmail.com tag:www.stefanwienert.net,2008:Post/35 2010-02-28T19:06:00Z 2010-02-28T20:06:14Z Minibot für Erstellung eines iCals und RSS-Feeds von einer Web1.0 Site <p>In Dresden gibt es die Hochschule für Musik &#8220;Carl Maria von Weber&#8221;, welche <a href="http://www.hfmdd.de/index.php?id=4">auf ihrer Website</a> auch ihr aktuelles Programm kundtun. Wer etwas Interesse an klassischer Musik hat, hat durch diese Art der Konzerte die Gelegenheit, sehr gute Pianisten sehr preiswert (umsonst&#8230;) zu hören.</p> <p>Leider bieten sie weder einen Feed noch einen Kalender an, deshalb dachte ich mir, das wär wieder ein guter Einsatz für das hpricot-Gem, ich will hier mal kurz den Ablauf skizzieren.</p> <h3>Rien ne va plus &#8230; ohne Gems!</h3><table class="CodeRay"><tr> <td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt> </tt></pre></td> <td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"><span class="s"><span class="dl">%w[</span><span class="k">rubygems hpricot curl active_support icalendar</span><span class="dl">]</span></span>.each { |x| require x}<tt> </tt></pre></td> </tr></table> <p>Im ersten Fall hatte ich Probleme mit den <strong>Umlauten</strong>, also erst einmal alles nach <span class="caps">UTF</span>-8 transformieren/encodieren:</p><table class="CodeRay"><tr> <td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt> </tt>2<tt> </tt>3<tt> </tt>4<tt> </tt>5<tt> </tt></pre></td> <td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }">cal = <span class="co">Calendar</span>.new <span class="c"># by the way: wir machen gleich mal einen ical draus, siehe icalendar gem.</span><tt> </tt><tt> </tt>curl_object = <span class="co">Curl</span>::<span class="co">Easy</span>.perform(<span class="s"><span class="dl">&quot;</span><span class="k">http://www.hfmdd.de/veranstaltungen/</span><span class="dl">&quot;</span></span>)<tt> </tt>body = <span class="co">Iconv</span>.conv(<span class="s"><span class="dl">&quot;</span><span class="k">UTF-8//IGNORE</span><span class="dl">&quot;</span></span>,<span class="s"><span class="dl">&quot;</span><span class="k">ISO-8859-1</span><span class="dl">&quot;</span></span>, curl_object.body_str)<tt> </tt>doc = Hpricot(body)<tt> </tt></pre></td> </tr></table> <p>Im nächsten Schritt mit <a href="http://www.selectorgadget.com/">Selector Gadget</a> und Firebug herausfinden, welcher <span class="caps">CSS</span>-Selektor uns Zeiger auf die Contentbereiche liefert, hier z.B. &#8220;#contentbereich div&#8221; mit der Eigenschaft, dass deren margin 35em ist&#8230; Leider haben die Webdesigner bei der Site der Musikhochschule wenig Gebraucht von Klassen oder IDs gemacht.</p><table class="CodeRay"><tr> <td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt> </tt>2<tt> </tt>3<tt> </tt>4<tt> </tt>5<tt> </tt>6<tt> </tt>7<tt> </tt>8<tt> </tt>9<tt> </tt></pre></td> <td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }">elements = doc.search(<span class="s"><span class="dl">&quot;</span><span class="k">#contentbereich div</span><span class="dl">&quot;</span></span>).select{ <tt> </tt> |it| it[<span class="sy">:style</span>].include? <span class="s"><span class="dl">&quot;</span><span class="k">35em</span><span class="dl">&quot;</span></span> <span class="r">if</span> it[<span class="sy">:style</span>].present?<tt> </tt>}<tt> </tt><span class="c"># present? ist das Gegenteil von .blank?</span><tt> </tt><tt> </tt><span class="c"># Jetzt traversieren wir die Elemente die uebrig bleiben</span><tt> </tt>elements.each <span class="r">do</span> |item|<tt> </tt> [...]<tt> </tt><span class="r">end</span><tt> </tt></pre></td> </tr></table> <p>Jetzt zum langweiligeren [&#8230;] Teil, dem lokalen Extrahieren der Daten&#8230;<br /> Das passiert leider etwas unsauber, da, wie man auf der Seite sehen kann, recht willkürliche Formate in den Datumsangaben gemacht wurden, hier mein bester Versuch:</p><table class="CodeRay"><tr> <td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt> </tt>2<tt> </tt>3<tt> </tt>4<tt> </tt>5<tt> </tt>6<tt> </tt>7<tt> </tt>8<tt> </tt>9<tt> </tt><strong>10</strong><tt> </tt>11<tt> </tt>12<tt> </tt>13<tt> </tt>14<tt> </tt></pre></td> <td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"> text = item.search(<span class="s"><span class="dl">&quot;</span><span class="k">div[2]</span><span class="dl">&quot;</span></span>).inner_html <span class="r">rescue</span> text = <span class="s"><span class="dl">&quot;</span><span class="k">???</span><span class="dl">&quot;</span></span><tt> </tt> date = item.search(<span class="s"><span class="dl">&quot;</span><span class="k">&gt;div&gt;b</span><span class="dl">&quot;</span></span>).first.inner_text.split(<span class="s"><span class="dl">&quot;</span><span class="k">.</span><span class="dl">&quot;</span></span>)<tt> </tt> time = item.search(<span class="s"><span class="dl">&quot;</span><span class="k">&gt;div[1]&gt;*</span><span class="dl">&quot;</span></span>)[<span class="i">4</span>].to_s[<span class="i">0</span>..<span class="i">4</span>].split(<span class="s"><span class="dl">&quot;</span><span class="k">:</span><span class="dl">&quot;</span></span>) <span class="r">rescue</span> time = <span class="s"><span class="dl">&quot;</span><span class="k">00:00</span><span class="dl">&quot;</span></span><tt> </tt> <span class="r">begin</span><tt> </tt> new_date = <span class="co">DateTime</span>.strptime(<span class="s"><span class="dl">&quot;</span><span class="k">20</span><span class="il"><span class="idl">#{</span>date[<span class="i">2</span>]<span class="idl">}</span></span><span class="k">-</span><span class="il"><span class="idl">#{</span>date[<span class="i">1</span>]<span class="idl">}</span></span><span class="k">-</span><span class="il"><span class="idl">#{</span>date[<span class="i">0</span>]<span class="idl">}</span></span><span class="k">T</span><span class="il"><span class="idl">#{</span>time[<span class="i">0</span>]<span class="idl">}</span></span><span class="k">:</span><span class="il"><span class="idl">#{</span>time[<span class="i">1</span>]<span class="idl">}</span></span><span class="k">:00+0100</span><span class="dl">&quot;</span></span>) <tt> </tt> <span class="r">next</span> <span class="r">if</span> new_date &lt; <span class="co">Date</span>.today<tt> </tt> event = cal.event<tt> </tt> event.start = new_date<tt> </tt> event.summary = text.split(<span class="s"><span class="dl">&quot;</span><span class="k">&lt;br</span><span class="dl">&quot;</span></span>).first[<span class="i">0</span>..<span class="i">150</span>]<tt> </tt> event.description = text <tt> </tt> <span class="r">rescue</span><tt> </tt> puts <span class="s"><span class="dl">&quot;</span><span class="k">INVALID: 20</span><span class="il"><span class="idl">#{</span>date[<span class="i">2</span>]<span class="idl">}</span></span><span class="k">-</span><span class="il"><span class="idl">#{</span>date[<span class="i">1</span>]<span class="idl">}</span></span><span class="k">-</span><span class="il"><span class="idl">#{</span>date[<span class="i">0</span>]<span class="idl">}</span></span><span class="k">T</span><span class="il"><span class="idl">#{</span>time[<span class="i">0</span>]<span class="idl">}</span></span><span class="k">:</span><span class="il"><span class="idl">#{</span>time[<span class="i">1</span>]<span class="idl">}</span></span><span class="k">:00+0100</span><span class="dl">&quot;</span></span><tt> </tt> <span class="c"># spaeter ein logger</span><tt> </tt> <span class="r">end</span><tt> </tt></pre></td> </tr></table> <p>Jetzt koennen wir den Kalendar direkt ausgeben, oder sonstwas damit machen</p><table class="CodeRay"><tr> <td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt> </tt></pre></td> <td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }">cal.to_ical<tt> </tt></pre></td> </tr></table> <p>Das Ganze war eher ein Proof-of-Concept, denn wie immer, wenn man ScreenScraping betreibt, sollte man eine Art Einverständnis des Inhabers der gescrapten Seite haben (oder sie bieten wie Google und Twitter eine <span class="caps">API</span> an)</p> tag:www.stefanwienert.net,2008:Post/6 2009-08-14T08:49:00Z 2009-10-18T10:49:01Z Notenübersichts-Bot in PHP <p>Vor zwei Wochen kam mir beim Überprüfen der aktuellen Notenergebnisse die Idee in den Sinn, das Ganze zu automatisieren und als Feed zur Verfügung zu stellen, um es in meinen Feedreader mit einzubinden und so immer auf dem aktuellen Stand sein zu können ;).</p> <p>Der Einfachheit halber hab ich <span class="caps">PHP</span>/Curl genommen, da ich kurz mal in Ruby reingeschaut hatte, mir die <span class="caps">HTTP</span>-Bibliothek aber nicht zweckdienlich erschien.</p> <h3>Teil 1: Den Quelltext der Webseite holen</h3> <p>Dazu mal mit Firefox und LiveHTTP-Headers-Addon schauen, was man beim Login zu alles schickt. In unserem Fall muss man danach noch einen Klick auf “Notenübersicht” machen.</p> <p>Das ganze dann in cURL gießen und eine cookies.txt schreibbar bereitstellen:</p><table class="CodeRay"><tr> <td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt> </tt>2<tt> </tt>3<tt> </tt>4<tt> </tt>5<tt> </tt>6<tt> </tt>7<tt> </tt>8<tt> </tt>9<tt> </tt><strong>10</strong><tt> </tt>11<tt> </tt>12<tt> </tt>13<tt> </tt>14<tt> </tt>15<tt> </tt>16<tt> </tt>17<tt> </tt>18<tt> </tt>19<tt> </tt><strong>20</strong><tt> </tt>21<tt> </tt></pre></td> <td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"> <span class="lv">$username</span>=<span class="s"><span class="dl">&quot;</span><span class="k">Meine Matrikelnummer</span><span class="dl">&quot;</span></span><tt> </tt> <span class="lv">$passwort</span>=<span class="s"><span class="dl">&quot;</span><span class="k">Mein Passwort ;)</span><span class="dl">&quot;</span></span><tt> </tt> <span class="lv">$ch</span> = curl_init();<tt> </tt> <span class="c">//Variablen setzen</span><tt> </tt> <span class="lv">$url</span>=<span class="s"><span class="dl">&quot;</span><span class="k">https://wwwqis.htw-dresden.de/qisserver/rds?state=user&amp;type=1&amp;category=auth.login&amp;startpage=portal.vm</span><span class="dl">&quot;</span></span>;<tt> </tt> <span class="c">//$url=&quot;https://wwwqis.htw-dresden.de/qisserver/rds?state=user&amp;amp;type=1&amp;amp;category=auth.login&amp;amp;startpage=portal.vm&quot;;</span><tt> </tt> <span class="lv">$arrSubmit</span>=<span class="s"><span class="dl">&quot;</span><span class="k">username=</span><span class="lv">$username</span><span class="k">&amp;submit=%C2%A0Ok%C2%A0&amp;password=</span><span class="lv">$password</span><span class="dl">&quot;</span></span>;<tt> </tt> <span class="lv">$cookies</span>=<span class="s"><span class="dl">&quot;</span><span class="k">cookie.txt</span><span class="dl">&quot;</span></span>;<tt> </tt><span class="c">//Session Optionen setzen</span><tt> </tt> curl_setopt(<span class="lv">$ch</span>, <span class="co">CURLOPT_URL</span>,<span class="lv">$url</span>);<tt> </tt> curl_setopt (<span class="lv">$ch</span>, <span class="co">CURLOPT_POST</span>, <span class="i">1</span>);<tt> </tt> curl_setopt(<span class="lv">$ch</span>, <span class="co">CURLOPT_POSTFIELDS</span>, <span class="lv">$arrSubmit</span>);<tt> </tt> curl_setopt(<span class="lv">$ch</span>, <span class="co">CURLOPT_HEADER</span>, <span class="i">0</span>);<tt> </tt> curl_setopt(<span class="lv">$ch</span>, <span class="co">CURLOPT_COOKIEJAR</span>, <span class="lv">$cookies</span>);<tt> </tt> curl_setopt(<span class="lv">$ch</span>, <span class="co">CURLOPT_COOKIEFILE</span>, <span class="lv">$cookies</span>);<tt> </tt> curl_setopt(<span class="lv">$ch</span>, <span class="co">CURLOPT_FOLLOWLOCATION</span>, <span class="pc">true</span>);<tt> </tt> curl_setopt(<span class="lv">$ch</span>, <span class="co">CURLOPT_RETURNTRANSFER</span>, <span class="pc">true</span>);<tt> </tt><span class="c">//Ausf?hren der Aktionen</span><tt> </tt> curl_setopt(<span class="lv">$ch</span>, <span class="co">CURLOPT_SSL_VERIFYPEER</span>, <span class="pc">FALSE</span>);<tt> </tt> <span class="lv">$result</span>=curl_exec(<span class="lv">$ch</span>);<tt> </tt> curl_close(<span class="lv">$ch</span>);<tt> </tt></pre></td> </tr></table> <p>In unserem Beispiel des <span class="caps">HIS</span>-<span class="caps">QIS</span> gibt es noch eine Art zweiter Session-ID, die ausgelesen werden, und mit übergeben werden muss:</p><table class="CodeRay"><tr> <td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt> </tt>2<tt> </tt></pre></td> <td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"> <span class="pd">preg_match</span>(<span class="s"><span class="dl">&quot;</span><span class="k">/asi=([^</span><span class="ch">\&quot;</span><span class="k">]*)</span><span class="ch">\&quot;</span><span class="k">/</span><span class="dl">&quot;</span></span>,<span class="lv">$result</span>,<span class="lv">$treffer</span>);<tt> </tt> <span class="lv">$asi</span>=<span class="lv">$treffer</span>[<span class="i">1</span>];<tt> </tt></pre></td> </tr></table> <p>Dann der Zweite durchlauf mit der asi:</p><table class="CodeRay"><tr> <td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt> </tt>2<tt> </tt>3<tt> </tt>4<tt> </tt>5<tt> </tt>6<tt> </tt>7<tt> </tt>8<tt> </tt>9<tt> </tt><strong>10</strong><tt> </tt>11<tt> </tt>12<tt> </tt>13<tt> </tt>14<tt> </tt></pre></td> <td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"> <span class="lv">$url</span>=<span class="s"><span class="dl">&quot;</span><span class="k">https://wwwqis.htw-dresden.de/qisserver/rds?state=htmlbesch&amp;moduleParameter=Student&amp;menuid=notenspiegel&amp;asi=</span><span class="lv">$asi</span><span class="dl">&quot;</span></span>;<tt> </tt> <span class="lv">$ch</span> = curl_init();<tt> </tt> curl_setopt(<span class="lv">$ch</span>, <span class="co">CURLOPT_URL</span>,<span class="lv">$url</span>);<tt> </tt> curl_setopt (<span class="lv">$ch</span>, <span class="co">CURLOPT_POST</span>, <span class="i">0</span>);<tt> </tt> curl_setopt(<span class="lv">$ch</span>, <span class="co">CURLOPT_HEADER</span>, <span class="i">0</span>);<tt> </tt> curl_setopt(<span class="lv">$ch</span>, <span class="co">CURLOPT_COOKIEJAR</span>, <span class="lv">$cookies</span>);<tt> </tt> curl_setopt(<span class="lv">$ch</span>, <span class="co">CURLOPT_COOKIEFILE</span>, <span class="lv">$cookies</span>);<tt> </tt> curl_setopt(<span class="lv">$ch</span>, <span class="co">CURLOPT_FOLLOWLOCATION</span>, <span class="pc">true</span>);<tt> </tt> curl_setopt(<span class="lv">$ch</span>, <span class="co">CURLOPT_RETURNTRANSFER</span>, <span class="pc">true</span>);<tt> </tt> curl_setopt(<span class="lv">$ch</span>, <span class="co">CURLOPT_SSL_VERIFYPEER</span>, <span class="pc">FALSE</span>);<tt> </tt> <span class="lv">$result</span>=curl_exec(<span class="lv">$ch</span>);<tt> </tt> <span class="c">//echo curl_error($ch);</span><tt> </tt> <span class="c">//Session beenden</span><tt> </tt> curl_close(<span class="lv">$ch</span>);<tt> </tt></pre></td> </tr></table> <p>Damit ist der Text in der Variablen $result<br /> Extraktion der wichtigen Zeilen mit XPath</p> <p>Mittels Firebug schauen, wo die Prüfungsergebnisse drinstehen und die XPaths kopieren bzw. analysieren. Damit erhalten wir eine NodeList die wir ausgeben/speichern können:</p><table class="CodeRay"><tr> <td class="line_numbers" title="click to toggle" onclick="with (this.firstChild.style) { display = (display == '') ? 'none' : '' }"><pre>1<tt> </tt>2<tt> </tt>3<tt> </tt>4<tt> </tt>5<tt> </tt>6<tt> </tt>7<tt> </tt>8<tt> </tt>9<tt> </tt><strong>10</strong><tt> </tt>11<tt> </tt></pre></td> <td class="code"><pre ondblclick="with (this.style) { overflow = (overflow == 'auto' || overflow == '') ? 'visible' : 'auto' }"><span class="lv">$Doc</span> = <span class="r">new</span> <span class="co">DOMDocument</span>();<tt> </tt><span class="lv">$Doc</span>-&gt;loadHTML(<span class="lv">$result</span>);<tt> </tt><span class="lv">$Doc</span>-&gt;preserveWhiteSpace = <span class="pc">false</span>;<tt> </tt><span class="lv">$Doc</span>-&gt;normalizeDocument();<tt> </tt><span class="lv">$XPath</span> = <span class="r">new</span> <span class="co">DOMXPath</span>(<span class="lv">$Doc</span>);<tt> </tt><span class="lv">$NodeList</span> = <span class="lv">$XPath</span>-&gt;query(<span class="s"><span class="dl">&quot;</span><span class="k">//tr[@bgcolor='#EFEFEF']</span><span class="dl">&quot;</span></span>);<tt> </tt><span class="r">foreach</span> (<span class="lv">$NodeList</span> <span class="r">as</span> <span class="lv">$node</span>)<tt> </tt>{ <tt> </tt> <span class="pd">echo</span> <span class="lv">$node</span>-&gt;nodeValue;<tt> </tt> ...<tt> </tt>}<tt> </tt></pre></td> </tr></table> <p>Schon fast fertig, was fehlt noch?</p> <ul> <li>Die Datumswerte auslesen, nach Sekunden umwandeln um danach die Noten danach zu sortieren.</li> </ul> <ul> <li>Einfacher Caching Algorithmus à la “Wenn unsere cache-datei älter als 30 minuten ist, dann erstelle sie neu [durchlaufe den Algorithmus] und schreibe das Ergebnis in die Cache-Datei; andernfalls gib nur die cache-Datei aus”</li> </ul> <ul> <li>Ausgabe als <span class="caps">RSS</span>-Feed, einfach mal die Spezifikation googlen ;)</li> </ul>