%PDF- %PDF-
Direktori : /usr/share/doc/python3-mechanize/html/ |
Current File : //usr/share/doc/python3-mechanize/html/advanced.html |
<!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="generator" content="Docutils 0.17.1: http://docutils.sourceforge.net/" /> <title>Advanced topics — mechanize 0.4.7 documentation</title> <link rel="stylesheet" type="text/css" href="_static/pygments.css" /> <link rel="stylesheet" type="text/css" href="_static/alabaster.css" /> <script data-url_root="./" id="documentation_options" src="_static/documentation_options.js"></script> <script src="_static/jquery.js"></script> <script src="_static/underscore.js"></script> <script src="_static/doctools.js"></script> <link rel="index" title="Index" href="genindex.html" /> <link rel="search" title="Search" href="search.html" /> <link rel="prev" title="HTML Forms API" href="forms_api.html" /> <link rel="stylesheet" href="_static/custom.css" type="text/css" /> <meta name="viewport" content="width=device-width, initial-scale=0.9, maximum-scale=0.9" /> </head><body> <div class="document"> <div class="documentwrapper"> <div class="bodywrapper"> <div class="body" role="main"> <section id="advanced-topics"> <h1>Advanced topics<a class="headerlink" href="#advanced-topics" title="Permalink to this headline">¶</a></h1> <section id="thread-safety"> <span id="threading"></span><h2>Thread safety<a class="headerlink" href="#thread-safety" title="Permalink to this headline">¶</a></h2> <p>The global <code class="xref py py-func docutils literal notranslate"><span class="pre">mechanize.urlopen()</span></code> and <code class="xref py py-func docutils literal notranslate"><span class="pre">mechanize.urlretrieve()</span></code> functions are thread safe. However, mechanize browser instances <strong>are not</strong> thread safe. If you want to use a mechanize Browser instance in multiple threads, clone it, using <cite>copy.copy(browser_object)</cite> method. The clone will share the same, thread safe cookie jar, and have the same settings/handlers as the original, but all other state is not shared, making the clone safe to use in a different thread.</p> </section> <section id="using-custom-ca-certificates"> <h2>Using custom CA certificates<a class="headerlink" href="#using-custom-ca-certificates" title="Permalink to this headline">¶</a></h2> <p>mechanize supports the same mechanism for using custom CA certificates as python >= 2.7.9. To change the certificates a mechanize browser instance uses, call the <a class="reference internal" href="browser_api.html#mechanize.Browser.set_ca_data" title="mechanize.Browser.set_ca_data"><code class="xref py py-meth docutils literal notranslate"><span class="pre">mechanize.Browser.set_ca_data()</span></code></a> method on it.</p> </section> <section id="debugging"> <span id="id1"></span><h2>Debugging<a class="headerlink" href="#debugging" title="Permalink to this headline">¶</a></h2> <p>Hints for debugging programs that use mechanize.</p> <section id="cookies"> <span id="id2"></span><h3>Cookies<a class="headerlink" href="#cookies" title="Permalink to this headline">¶</a></h3> <p>A common mistake is to use <code class="xref py py-func docutils literal notranslate"><span class="pre">mechanize.urlopen()</span></code>, <em>and</em> the <cite>.extract_cookies()</cite> and <cite>.add_cookie_header()</cite> methods on a cookie object themselves. If you use <cite>mechanize.urlopen()</cite> (or <cite>OpenerDirector.open()</cite>), the module handles extraction and adding of cookies by itself, so you should not call <cite>.extract_cookies()</cite> or <cite>.add_cookie_header()</cite>.</p> <p>Are you sure the server is sending you any cookies in the first place? Maybe the server is keeping track of state in some other way (<cite>HIDDEN</cite> HTML form entries (possibly in a separate page referenced by a frame), URL-encoded session keys, IP address, HTTP <cite>Referer</cite> headers)? Perhaps some embedded script in the HTML is setting cookies (see below)? Turn on <a class="reference internal" href="#logging"><span class="std std-ref">Logging</span></a>.</p> <p>When you <cite>.save()</cite> to or <cite>.load()</cite>/<cite>.revert()</cite> from a file, single-session cookies will expire unless you explicitly request otherwise with the <cite>ignore_discard</cite> argument. This may be your problem if you find cookies are going away after saving and loading.</p> <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">mechanize</span> <span class="n">cj</span> <span class="o">=</span> <span class="n">mechanize</span><span class="o">.</span><span class="n">LWPCookieJar</span><span class="p">()</span> <span class="n">opener</span> <span class="o">=</span> <span class="n">mechanize</span><span class="o">.</span><span class="n">build_opener</span><span class="p">(</span><span class="n">mechanize</span><span class="o">.</span><span class="n">HTTPCookieProcessor</span><span class="p">(</span><span class="n">cj</span><span class="p">))</span> <span class="n">mechanize</span><span class="o">.</span><span class="n">install_opener</span><span class="p">(</span><span class="n">opener</span><span class="p">)</span> <span class="n">r</span> <span class="o">=</span> <span class="n">mechanize</span><span class="o">.</span><span class="n">urlopen</span><span class="p">(</span><span class="s2">"http://foobar.com/"</span><span class="p">)</span> <span class="n">cj</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="s2">"/some/file"</span><span class="p">,</span> <span class="n">ignore_discard</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">ignore_expires</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> </pre></div> </div> <p>JavaScript code can set cookies; mechanize does not support this. See <a class="reference internal" href="faq.html#jsfaq"><span class="std std-ref">JavaScript is messing up my web-scraping. What do I do?</span></a>.</p> </section> <section id="general"> <h3>General<a class="headerlink" href="#general" title="Permalink to this headline">¶</a></h3> <p>Enable <a class="reference internal" href="#logging"><span class="std std-ref">Logging</span></a>.</p> <p>Sometimes, a server wants particular HTTP headers set to the values it expects. For example, the <cite>User-Agent</cite> header may need to be set (<a class="reference internal" href="browser_api.html#mechanize.Browser.set_header" title="mechanize.Browser.set_header"><code class="xref py py-meth docutils literal notranslate"><span class="pre">mechanize.Browser.set_header()</span></code></a>) to a value like that of a popular browser.</p> <p>Check that the browser is able to do manually what you’re trying to achieve programmatically. Make sure that what you do manually is <em>exactly</em> the same as what you’re trying to do from Python – you may simply be hitting a server bug that only gets revealed if you view pages in a particular order, for example.</p> <p>Try comparing the headers and data that your program sends with those that a browser sends. Often this will give you the clue you need. You can use the developer tools in any browser to see exactly what the browser sends and receives.</p> <p>If nothing is obviously wrong with the requests your program is sending and you’re out of ideas, you can reliably locate the problem by copying the headers that a browser sends, and then changing headers until your program stops working again. Temporarily switch to explicitly sending individual HTTP headers (by calling <cite>.add_header()</cite>, or by using <cite>httplib</cite> directly). Start by sending exactly the headers that Firefox or Chrome send. You may need to make sure that a valid session ID is sent – the one you got from your browser may no longer be valid. If that works, you can begin the tedious process of changing your headers and data until they match what your original code was sending. You should end up with a minimal set of changes. If you think that reveals a bug in mechanize, please report it.</p> </section> <section id="logging"> <span id="id3"></span><h3>Logging<a class="headerlink" href="#logging" title="Permalink to this headline">¶</a></h3> <p>To enable logging to stdout:</p> <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">sys</span><span class="o">,</span> <span class="nn">logging</span> <span class="n">logger</span> <span class="o">=</span> <span class="n">logging</span><span class="o">.</span><span class="n">getLogger</span><span class="p">(</span><span class="s2">"mechanize"</span><span class="p">)</span> <span class="n">logger</span><span class="o">.</span><span class="n">addHandler</span><span class="p">(</span><span class="n">logging</span><span class="o">.</span><span class="n">StreamHandler</span><span class="p">(</span><span class="n">sys</span><span class="o">.</span><span class="n">stdout</span><span class="p">))</span> <span class="n">logger</span><span class="o">.</span><span class="n">setLevel</span><span class="p">(</span><span class="n">logging</span><span class="o">.</span><span class="n">DEBUG</span><span class="p">)</span> </pre></div> </div> <p>You can reduce the amount of information shown by setting the level to <cite>logging.INFO</cite> instead of <cite>logging.DEBUG</cite>, or by only enabling logging for one of the following logger names instead of <cite>“mechanize”</cite>:</p> <blockquote> <div><ul class="simple"> <li><p><cite>“mechanize”</cite>: Everything.</p></li> <li><p><cite>“mechanize.cookies”</cite>: Why particular cookies are accepted or rejected and why they are or are not returned. Requires logging enabled at the <cite>DEBUG</cite> level.</p></li> <li><p><cite>“mechanize.http_responses”</cite>: HTTP response body data.</p></li> <li><p><cite>“mechanize.http_redirects”</cite>: HTTP redirect information.</p></li> </ul> </div></blockquote> </section> <section id="http-headers"> <h3>HTTP headers<a class="headerlink" href="#http-headers" title="Permalink to this headline">¶</a></h3> <p>An example showing how to enable printing of HTTP headers to stdout, logging of HTTP response bodies, and logging of information about redirections:</p> <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">sys</span><span class="o">,</span> <span class="nn">logging</span> <span class="kn">import</span> <span class="nn">mechanize</span> <span class="n">logger</span> <span class="o">=</span> <span class="n">logging</span><span class="o">.</span><span class="n">getLogger</span><span class="p">(</span><span class="s2">"mechanize"</span><span class="p">)</span> <span class="n">logger</span><span class="o">.</span><span class="n">addHandler</span><span class="p">(</span><span class="n">logging</span><span class="o">.</span><span class="n">StreamHandler</span><span class="p">(</span><span class="n">sys</span><span class="o">.</span><span class="n">stdout</span><span class="p">))</span> <span class="n">logger</span><span class="o">.</span><span class="n">setLevel</span><span class="p">(</span><span class="n">logging</span><span class="o">.</span><span class="n">DEBUG</span><span class="p">)</span> <span class="n">browser</span> <span class="o">=</span> <span class="n">mechanize</span><span class="o">.</span><span class="n">Browser</span><span class="p">()</span> <span class="n">browser</span><span class="o">.</span><span class="n">set_debug_http</span><span class="p">(</span><span class="kc">True</span><span class="p">)</span> <span class="n">browser</span><span class="o">.</span><span class="n">set_debug_responses</span><span class="p">(</span><span class="kc">True</span><span class="p">)</span> <span class="n">browser</span><span class="o">.</span><span class="n">set_debug_redirects</span><span class="p">(</span><span class="kc">True</span><span class="p">)</span> <span class="n">response</span> <span class="o">=</span> <span class="n">browser</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="s2">"http://python.org/"</span><span class="p">)</span> </pre></div> </div> <p>Alternatively, you can examine request and response objects to see what’s going on. Note that requests may involve “sub-requests” in cases such as redirection, in which case you will not see everything that’s going on just by examining the original request and final response.</p> </section> </section> </section> </div> </div> </div> <div class="sphinxsidebar" role="navigation" aria-label="main navigation"> <div class="sphinxsidebarwrapper"> <h1 class="logo"><a href="index.html">mechanize</a></h1> <h3>Navigation</h3> <p class="caption" role="heading"><span class="caption-text">Table of Contents:</span></p> <ul class="current"> <li class="toctree-l1"><a class="reference internal" href="faq.html">Frequently Asked Questions</a></li> <li class="toctree-l1"><a class="reference internal" href="browser_api.html">Browser API</a></li> <li class="toctree-l1"><a class="reference internal" href="forms_api.html">HTML Forms API</a></li> <li class="toctree-l1 current"><a class="current reference internal" href="#">Advanced topics</a><ul> <li class="toctree-l2"><a class="reference internal" href="#thread-safety">Thread safety</a></li> <li class="toctree-l2"><a class="reference internal" href="#using-custom-ca-certificates">Using custom CA certificates</a></li> <li class="toctree-l2"><a class="reference internal" href="#debugging">Debugging</a></li> </ul> </li> </ul> <div class="relations"> <h3>Related Topics</h3> <ul> <li><a href="index.html">Documentation overview</a><ul> <li>Previous: <a href="forms_api.html" title="previous chapter">HTML Forms API</a></li> </ul></li> </ul> </div> <div id="searchbox" style="display: none" role="search"> <h3 id="searchlabel">Quick search</h3> <div class="searchformwrapper"> <form class="search" action="search.html" method="get"> <input type="text" name="q" aria-labelledby="searchlabel" autocomplete="off" autocorrect="off" autocapitalize="off" spellcheck="false"/> <input type="submit" value="Go" /> </form> </div> </div> <script>$('#searchbox').show(0);</script> </div> </div> <div class="clearer"></div> </div> <div class="footer"> ©2021, Kovid Goyal. | Powered by <a href="http://sphinx-doc.org/">Sphinx 4.3.2</a> & <a href="https://github.com/bitprophet/alabaster">Alabaster 0.7.12</a> | <a href="_sources/advanced.rst.txt" rel="nofollow">Page source</a> </div> </body> </html>