Notes of life & code/2014-01-01T14:20:39+08:00Python+Requests编码识别Bug2014-01-01T11:23:41+08:002014-01-01T14:20:39+08:00Li Guangmingtag:None,2014-01-01:/python-requests-ge-encoding-from-headers.html<p><img alt="Requests" src="/static/images/requests-logo.png"></p>
<p>Requests 是使用 Apache2 Licensed 许可证的 HTTP 库。用 Python 编写,更友好,更易用。</p>
<p>Requests 使用的是 urllib3,因此继承了它的所有特性。Requests 支持 HTTP 连接保持和连接池,支持使用 cookie 保持会话,支持文件上传,支持自动确定响应内容的编码,支持国际化的 URL 和 POST 数据自动编码。现代、国际化、人性化。</p>
<p>最近在使用Requests的过程中发现一个问题,就是抓去某些中文网页的时候,出现乱码,打印encoding是ISO-8859-1。为什么会这样呢?通过查看源码,我发现默认的编码识别比较简单,直接从响应头文件的Content-Type里获取,如果存在charset,则可以正确识别,如果不存在charset但是存在text就认为是ISO-8859-1,见utils.py。</p>
<div class="highlight"><pre><span></span>def get_encoding_from_headers(headers):
"""Returns …</pre></div><p><img alt="Requests" src="/static/images/requests-logo.png"></p>
<p>Requests 是使用 Apache2 Licensed 许可证的 HTTP 库。用 Python 编写,更友好,更易用。</p>
<p>Requests 使用的是 urllib3,因此继承了它的所有特性。Requests 支持 HTTP 连接保持和连接池,支持使用 cookie 保持会话,支持文件上传,支持自动确定响应内容的编码,支持国际化的 URL 和 POST 数据自动编码。现代、国际化、人性化。</p>
<p>最近在使用Requests的过程中发现一个问题,就是抓去某些中文网页的时候,出现乱码,打印encoding是ISO-8859-1。为什么会这样呢?通过查看源码,我发现默认的编码识别比较简单,直接从响应头文件的Content-Type里获取,如果存在charset,则可以正确识别,如果不存在charset但是存在text就认为是ISO-8859-1,见utils.py。</p>
<div class="highlight"><pre><span></span>def get_encoding_from_headers(headers):
"""Returns encodings from given HTTP Header Dict.
:param headers: dictionary to extract encoding from.
"""
content_type = headers.get('content-type')
if not content_type:
return None
content_type, params = cgi.parse_header(content_type)
if 'charset' in params:
return params['charset'].strip("'\"")
if 'text' in content_type:
return 'ISO-8859-1'
</pre></div>
<p>其实Requests提供了从内容获取编码,只是在默认中没有使用,见utils.py:</p>
<div class="highlight"><pre><span></span>def get_encodings_from_content(content):
"""Returns encodings from given content string.
:param content: bytestring to extract encodings from.
"""
charset_re = re.compile(r'<meta.*?charset=["\']*(.+?)["\'>]', flags=re.I)
pragma_re = re.compile(r'<meta.*?content=["\']*;?charset=(.+?)["\'>]', flags=re.I)
xml_re = re.compile(r'^<\?xml.*?encoding=["\']*(.+?)["\'>]')
return (charset_re.findall(content) +
pragma_re.findall(content) +
xml_re.findall(content))
</pre></div>
<p>还提供了使用chardet的编码检测,见models.py:</p>
<div class="highlight"><pre><span></span>@property
def apparent_encoding(self):
"""The apparent encoding, provided by the lovely Charade library
(Thanks, Ian!)."""
return chardet.detect(self.content)['encoding']
</pre></div>
<p>如何修复这个问题呢?先来看一下示例:</p>
<div class="highlight"><pre><span></span>>>> r = requests.get('http://cn.python-requests.org/en/latest/')
>>> r.headers['content-type']
'text/html'
>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding
'utf-8'
>>> requests.utils.get_encodings_from_content(r.content)
['utf-8']
>>> r = requests.get('http://reader.360duzhe.com/2013_24/index.html')
>>> r.headers['content-type']
'text/html'
>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding
'gb2312'
>>> requests.utils.get_encodings_from_content(r.content)
['gb2312']
</pre></div>
<p>通过了解,可以这么用一个monkey patch解决这个问题:</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">requests</span>
<span class="k">def</span> <span class="nf">monkey_patch</span><span class="p">():</span>
<span class="n">prop</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">models</span><span class="o">.</span><span class="n">Response</span><span class="o">.</span><span class="n">content</span>
<span class="k">def</span> <span class="nf">content</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">_content</span> <span class="o">=</span> <span class="n">prop</span><span class="o">.</span><span class="n">fget</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span>
<span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">encoding</span> <span class="o">==</span> <span class="s1">'ISO-8859-1'</span><span class="p">:</span>
<span class="n">encodings</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">utils</span><span class="o">.</span><span class="n">get_encodings_from_content</span><span class="p">(</span><span class="n">_content</span><span class="p">)</span>
<span class="k">if</span> <span class="n">encodings</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">encoding</span> <span class="o">=</span> <span class="n">encodings</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="k">else</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">encoding</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">apparent_encoding</span>
<span class="n">_content</span> <span class="o">=</span> <span class="n">_content</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">encoding</span><span class="p">,</span> <span class="s1">'replace'</span><span class="p">)</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s1">'utf8'</span><span class="p">,</span> <span class="s1">'replace'</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_content</span> <span class="o">=</span> <span class="n">_content</span>
<span class="k">return</span> <span class="n">_content</span>
<span class="n">requests</span><span class="o">.</span><span class="n">models</span><span class="o">.</span><span class="n">Response</span><span class="o">.</span><span class="n">content</span> <span class="o">=</span> <span class="nb">property</span><span class="p">(</span><span class="n">content</span><span class="p">)</span>
<span class="n">monkey_patch</span><span class="p">()</span>
</pre></div>
<p>相关文章:</p>
<ul>
<li><a href="http://docs.python-requests.org/en/latest/">Requests: HTTP for Humans</a></li>
<li><a href="http://www.au92.com/archives/python-requests-chinese-improve-random-code.html">Python+Requests抓取中文乱码改进方案</a></li>
</ul>PHP代码的另类加密方法2013-12-25T18:01:20+08:002013-12-25T18:11:53+08:00Li Guangmingtag:None,2013-12-25:/other-methods-protecting-your-php-code.html<p>利用gzcompress, base64_encode等方法对代码文件多次转换,加入大量的中文等不忍猝读的字符,达到加密混淆的目的.PHP从5开始支持符合<code>[-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*</code><a href="http://www.php.net/manual/en/functions.user-defined.php">规则函数,变量,类名</a>.</p>
<p>网上比较常见的的如下三家产品:</p>
<ul>
<li>
<p><a href="http://www.phpdp.org">PHP神盾</a></p>
<p>PHP神盾,是一款无需依靠附加扩展来解析的php加密工具,保护强度是目前此类产品中的佼佼者之一.</p>
</li>
<li>
<p><a href="http://www.phpjm.net">PHP加密</a></p>
<p>PHP在线加密平台(phpjm.net)是一个优秀的免费的PHP源码加密保护平台,PHP代码加密后无需依靠附加扩展来解析,服务器端无需安装任何第三方组件,可运行于任何普通 PHP 环境下.</p>
</li>
<li>
<p><a href="http://www.hcache.com/">易盾PHP加密</a></p>
<p>易盾PHP加密可以保护您的PHP源程序代码不被破解.加密后,无论是正规途径销售出去的PHP程序,还是从非法渠道获得的PHP程序,都不能还原出真正的PHP程序源代码,能让您的知识产权得到保护.</p>
</li>
</ul>
<p>PHP神盾和PHP加密的不需要第三方组件,易盾PHP加密需要安装他们的组件,Windows版本的提供下载,Linux的要购买后才提供下载,因为没有Windows环境,暂时忽略.通过以上网站提供的在线加密,我上传了一个简单的PHP脚本,代码如下.</p>
<div class="highlight"><pre><span></span><span class="cp"><?php</span>
<span class="k">function</span> <span class="nf">test</span><span class="p">(){</span>
<span class="k">echo</span> <span class="s1">'hello …</span></pre></div><p>利用gzcompress, base64_encode等方法对代码文件多次转换,加入大量的中文等不忍猝读的字符,达到加密混淆的目的.PHP从5开始支持符合<code>[-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*</code><a href="http://www.php.net/manual/en/functions.user-defined.php">规则函数,变量,类名</a>.</p>
<p>网上比较常见的的如下三家产品:</p>
<ul>
<li>
<p><a href="http://www.phpdp.org">PHP神盾</a></p>
<p>PHP神盾,是一款无需依靠附加扩展来解析的php加密工具,保护强度是目前此类产品中的佼佼者之一.</p>
</li>
<li>
<p><a href="http://www.phpjm.net">PHP加密</a></p>
<p>PHP在线加密平台(phpjm.net)是一个优秀的免费的PHP源码加密保护平台,PHP代码加密后无需依靠附加扩展来解析,服务器端无需安装任何第三方组件,可运行于任何普通 PHP 环境下.</p>
</li>
<li>
<p><a href="http://www.hcache.com/">易盾PHP加密</a></p>
<p>易盾PHP加密可以保护您的PHP源程序代码不被破解.加密后,无论是正规途径销售出去的PHP程序,还是从非法渠道获得的PHP程序,都不能还原出真正的PHP程序源代码,能让您的知识产权得到保护.</p>
</li>
</ul>
<p>PHP神盾和PHP加密的不需要第三方组件,易盾PHP加密需要安装他们的组件,Windows版本的提供下载,Linux的要购买后才提供下载,因为没有Windows环境,暂时忽略.通过以上网站提供的在线加密,我上传了一个简单的PHP脚本,代码如下.</p>
<div class="highlight"><pre><span></span><span class="cp"><?php</span>
<span class="k">function</span> <span class="nf">test</span><span class="p">(){</span>
<span class="k">echo</span> <span class="s1">'hello world.'</span><span class="p">;</span>
<span class="p">}</span>
<span class="nx">test</span><span class="p">();</span>
<span class="cp">?></span>
</pre></div>
<p>我本地运行脚本加密后的密码,发现可以打印加密钱源代码,这也证实了我的猜想,加密后的代码要通过eval语言构造器,在代码内事无法重写的,只能通过对PHP编译器做了些手脚,就可以获得加密之前源代码了,以下是截图.</p>
<p>PHP加密</p>
<p><img alt="PHP加密" src="/static/images/phpjm.png"></p>
<p>PHP神盾</p>
<p><img alt="PHP神盾" src="/static/images/phpdp.png"></p>
<p>PHP神盾在里面还加入了一段javasript代码:</p>
<div class="highlight"><pre><span></span>http://www.phpdp.org/index.php?mod=decode&code_key=xxx&sign=xxx
</pre></div>
<p>打开是侵权提示:</p>
<div class="highlight"><pre><span></span>警告:您的行为已侵犯了本程式的使用条约,请停止您的脚步!
</pre></div>
<p>赶紧声明:本文仅是出于学习研究的目的,本人不提供源码破解等相关业务.</p>OpenCC Python binding2013-05-09T12:00:43+08:002013-05-09T16:21:55+08:00Li Guangmingtag:None,2013-05-09:/opencc-python-binding.html<ol>
<li>
<p>为什么会有pyOpenCC</p>
<p>因为<a href="http://readcola.com/">readcola</a>这个项目,要将一些繁体的电子书转换成简体中文书籍,测试的结果发现OpenCC的效果是非常好的,而且是开源的,便于和现在的工具整合。
在pip发现这个<a href="https://pypi.python.org/pypi/opencc-python/">opencc-python</a>,测试后发现只是调用<a href="http://code.google.com/p/opencc/">OpenCC</a>的命令行,对转换的内容长度也有限制。
抱着试试目的,在网上查了下Python的C扩展的写法,调用<a href="http://code.google.com/p/opencc/">OpenCC</a>的接口,讲过多次调试,于是就有了这个项目。第一次写作Python的C扩展。</p>
</li>
<li>
<p>OpenCC</p>
<p>Open Chinese Convert(<a href="http://code.google.com/p/opencc/">OpenCC</a>)是一个开源的中文简繁转换项目,致力于制作高质量的基于统计预料的简繁转换词库。还提供函数库(libopencc)、命令行简繁转换工具、人工校对工具、词典生成进程、在线转换服务及图形用户界面。</p>
</li>
<li>
<p>What is pyOpenCC?</p>
<p>pyOpenCC is a Python wrapper for <a href="http://code.google.com/p/opencc/">Open Chinese Converter</a></p>
</li>
<li>
<p>Installation</p>
<p>You need …</p></li></ol><ol>
<li>
<p>为什么会有pyOpenCC</p>
<p>因为<a href="http://readcola.com/">readcola</a>这个项目,要将一些繁体的电子书转换成简体中文书籍,测试的结果发现OpenCC的效果是非常好的,而且是开源的,便于和现在的工具整合。
在pip发现这个<a href="https://pypi.python.org/pypi/opencc-python/">opencc-python</a>,测试后发现只是调用<a href="http://code.google.com/p/opencc/">OpenCC</a>的命令行,对转换的内容长度也有限制。
抱着试试目的,在网上查了下Python的C扩展的写法,调用<a href="http://code.google.com/p/opencc/">OpenCC</a>的接口,讲过多次调试,于是就有了这个项目。第一次写作Python的C扩展。</p>
</li>
<li>
<p>OpenCC</p>
<p>Open Chinese Convert(<a href="http://code.google.com/p/opencc/">OpenCC</a>)是一个开源的中文简繁转换项目,致力于制作高质量的基于统计预料的简繁转换词库。还提供函数库(libopencc)、命令行简繁转换工具、人工校对工具、词典生成进程、在线转换服务及图形用户界面。</p>
</li>
<li>
<p>What is pyOpenCC?</p>
<p>pyOpenCC is a Python wrapper for <a href="http://code.google.com/p/opencc/">Open Chinese Converter</a></p>
</li>
<li>
<p>Installation</p>
<p>You need to install opencc-dev first, To install OpenCC:</p>
<p>Debian:</p>
<div class="highlight"><pre><span></span>apt-get install libopencc-dev -y
</pre></div>
<p>FreeBSD:</p>
<div class="highlight"><pre><span></span>cd /usr/ports/chinese/opencc
make install clean
</pre></div>
<p>To install pyopencc:</p>
<div class="highlight"><pre><span></span>git clone https://github.com/cute/pyopencc.git
cd pyopencc
python setup.py build_ext -I /usr/local/include/opencc/
python setup.py install
</pre></div>
</li>
<li>
<p>How to use it?</p>
<p>Following is a simple example:</p>
<div class="highlight"><pre><span></span><span class="c1"># -*- coding: utf8 -*-</span>
<span class="kn">import</span> <span class="nn">opencc</span>
<span class="n">cc</span> <span class="o">=</span> <span class="n">opencc</span><span class="o">.</span><span class="n">OpenCC</span><span class="p">(</span><span class="s1">'zht2zhs.ini'</span><span class="p">)</span>
<span class="k">print</span> <span class="n">cc</span><span class="o">.</span><span class="n">convert</span><span class="p">(</span><span class="s1">u'Open Chinese Convert(OpenCC)「開放中文轉換」,是一個致力於中文簡繁轉換的項目,提供高質量詞庫和函數庫(libopencc)。'</span><span class="p">)</span>
</pre></div>
<p>And the output should be:</p>
<div class="highlight"><pre><span></span>Open Chinese Convert(OpenCC)「开放中文转换」,是一个致力于中文简繁转换的项目,提供高质量词库和函数库(libopencc)。
</pre></div>
<p>There are four convertion in opencc:</p>
<ul>
<li>zht2zhs.ini - Traditional Chinese to Simplified Chinese</li>
<li>zhs2zht.ini - Simplified Chinese to Traditional Chinese</li>
<li>mix2zht.ini - Mixed to Traditional Chinese</li>
<li>mix2zhs.ini - Mixed to Simplified Chinese</li>
</ul>
</li>
</ol>PHP命令行如何判断有管道数据输入2013-02-22T11:56:22+08:002013-02-22T12:02:30+08:00Li Guangmingtag:None,2013-02-22:/set-php-stdin-nonblocking.html<p>PHP的STDIN是阻塞操作,直接读取内容的话会造成阻塞.如下代码会一直运行直到有数据输入:</p>
<div class="highlight"><pre><span></span><span class="p">$</span><span class="nv">data</span><span class="x"> = stream_get_contents(STDIN);</span>
</pre></div>
<p>按理可以通过声明stream_set_blocking(STDIN, FALSE)来操作:</p>
<div class="highlight"><pre><span></span><span class="x">stream_set_blocking(STDIN, FALSE);</span>
<span class="p">$</span><span class="nv">data</span><span class="x"> = stream_get_contents(STDIN);</span>
</pre></div>
<p>运行依旧不行,其实这是一个Bug,描述:<a href="https://bugs.php.net/bug.php?id=34972">https://bugs.php.net/bug.php?id=34972</a></p>
<p>通过测试发现,可用通过ftell函数获取STDIN文件句柄指针读/写的位置来判断.</p>
<div class="highlight"><pre><span></span><span class="x">if(ftell(STDIN)===0)</span><span class="err">{</span><span class="x"></span>
<span class="x"> </span><span class="p">$</span><span class="nv">data</span><span class="x"> = stream_get_contents(STDIN);</span>
<span class="x">}</span>
</pre></div><p>PHP的STDIN是阻塞操作,直接读取内容的话会造成阻塞.如下代码会一直运行直到有数据输入:</p>
<div class="highlight"><pre><span></span><span class="p">$</span><span class="nv">data</span><span class="x"> = stream_get_contents(STDIN);</span>
</pre></div>
<p>按理可以通过声明stream_set_blocking(STDIN, FALSE)来操作:</p>
<div class="highlight"><pre><span></span><span class="x">stream_set_blocking(STDIN, FALSE);</span>
<span class="p">$</span><span class="nv">data</span><span class="x"> = stream_get_contents(STDIN);</span>
</pre></div>
<p>运行依旧不行,其实这是一个Bug,描述:<a href="https://bugs.php.net/bug.php?id=34972">https://bugs.php.net/bug.php?id=34972</a></p>
<p>通过测试发现,可用通过ftell函数获取STDIN文件句柄指针读/写的位置来判断.</p>
<div class="highlight"><pre><span></span><span class="x">if(ftell(STDIN)===0)</span><span class="err">{</span><span class="x"></span>
<span class="x"> </span><span class="p">$</span><span class="nv">data</span><span class="x"> = stream_get_contents(STDIN);</span>
<span class="x">}</span>
</pre></div>如何让浏览器在访问链接时不要带上referer?2012-09-30T16:48:54+08:002012-09-30T17:08:53+08:00Li Guangmingtag:None,2012-09-30:/link-without-referer.html<hr>
<p>我们在从一个网站点击链接进入另一个页面时,浏览器会在header里加上Referer值,来标识这次访问的来源页面。但是这种标识有可能会泄漏用户的隐私,有时候我不想让其他人知道我是从哪里点击进来的,能否有手段可以让浏览器不要发送Referer呢?</p>
<ul>
<li>使用新增的html5的解决方案,使用rel="noreferrer",声明连接的属性为<a href="http://www.whatwg.org/specs/web-apps/current-work/multipage/links.html#link-type-noreferrer" title="noreferrer">noreferrer</a>,目前只有chrome4+支持.</li>
<li>使用中间页面,但实际上还是发送referrer的,比如使用Google的连接转向,<a href="https://github.com/knu/noreferrer" title="noreferrer.js">noreferrer.js</a>.</li>
<li>使用javascript协议链接中转,参见下面的说明.</li>
</ul>
<h4>新开一个窗口,相当于target="_blank":</h4>
<div class="highlight"><pre><span></span><span class="nt">function</span> <span class="nt">open_window</span><span class="o">(</span><span class="nt">link</span><span class="o">)</span><span class="p">{</span>
<span class="n">var</span> <span class="n">arg</span> <span class="o">=</span> <span class="s1">'\u003cscript\u003elocation.replace("'</span><span class="o">+</span><span class="n">link</span><span class="o">+</span><span class="s1">'")\u003c/script\u003e'</span><span class="p">;</span>
<span class="n">window</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="s1">'javascript:window.name;'</span><span class="o">,</span> <span class="n">arg</span><span class="p">);</span>
<span class="p">}</span>
</pre></div>
<h4>转向到一个连接,相当于target="_self":</h4>
<div class="highlight"><pre><span></span><span class="nt">function</span> <span class="nt">redirect</span><span class="o">(</span><span class="nt">link</span><span class="o">)</span><span class="p">{</span>
<span class="n">var</span> <span class="n">arg</span> <span class="o">=</span><span class="s1">'\u003cscript …</span></pre></div><hr>
<p>我们在从一个网站点击链接进入另一个页面时,浏览器会在header里加上Referer值,来标识这次访问的来源页面。但是这种标识有可能会泄漏用户的隐私,有时候我不想让其他人知道我是从哪里点击进来的,能否有手段可以让浏览器不要发送Referer呢?</p>
<ul>
<li>使用新增的html5的解决方案,使用rel="noreferrer",声明连接的属性为<a href="http://www.whatwg.org/specs/web-apps/current-work/multipage/links.html#link-type-noreferrer" title="noreferrer">noreferrer</a>,目前只有chrome4+支持.</li>
<li>使用中间页面,但实际上还是发送referrer的,比如使用Google的连接转向,<a href="https://github.com/knu/noreferrer" title="noreferrer.js">noreferrer.js</a>.</li>
<li>使用javascript协议链接中转,参见下面的说明.</li>
</ul>
<h4>新开一个窗口,相当于target="_blank":</h4>
<div class="highlight"><pre><span></span><span class="nt">function</span> <span class="nt">open_window</span><span class="o">(</span><span class="nt">link</span><span class="o">)</span><span class="p">{</span>
<span class="n">var</span> <span class="n">arg</span> <span class="o">=</span> <span class="s1">'\u003cscript\u003elocation.replace("'</span><span class="o">+</span><span class="n">link</span><span class="o">+</span><span class="s1">'")\u003c/script\u003e'</span><span class="p">;</span>
<span class="n">window</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="s1">'javascript:window.name;'</span><span class="o">,</span> <span class="n">arg</span><span class="p">);</span>
<span class="p">}</span>
</pre></div>
<h4>转向到一个连接,相当于target="_self":</h4>
<div class="highlight"><pre><span></span><span class="nt">function</span> <span class="nt">redirect</span><span class="o">(</span><span class="nt">link</span><span class="o">)</span><span class="p">{</span>
<span class="n">var</span> <span class="n">arg</span> <span class="o">=</span><span class="s1">'\u003cscript\u003etop.location.replace("'</span><span class="o">+</span><span class="n">link</span><span class="o">+</span><span class="s1">'")\u003c/script\u003e'</span><span class="p">;</span>
<span class="n">var</span> <span class="n">iframe</span> <span class="o">=</span> <span class="n">document</span><span class="o">.</span><span class="n">createElement</span><span class="p">(</span><span class="s1">'iframe'</span><span class="p">);</span>
<span class="n">iframe</span><span class="o">.</span><span class="n">src</span><span class="o">=</span><span class="s1">'javascript:window.name;'</span><span class="p">;</span>
<span class="n">iframe</span><span class="o">.</span><span class="n">name</span><span class="o">=</span><span class="n">arg</span><span class="p">;</span>
<span class="n">document</span><span class="o">.</span><span class="n">body</span><span class="o">.</span><span class="n">appendChild</span><span class="p">(</span><span class="n">iframe</span><span class="p">);</span>
<span class="p">}</span>
</pre></div>
<h4>其他连接:</h4>
<ul>
<li>我在<a href="http://segmentfault.com/q/1010000000123441#a-1020000000123513" title="SegmentFault">SegmentFault</a> 回答的一个问题.</li>
</ul>Kindle touch me2012-08-15T16:17:34+08:002014-01-01T12:18:29+08:00Li Guangmingtag:None,2012-08-15:/kindle-touch-me.html<p>kindle改变了我的阅读习惯,自从买了kindle touch之后,我利用上下班在公交车,地铁上的时间已经读了不少书了,一些以前没有时间阅读的书籍和资料,感觉确实是赚到了。</p>
<p><img alt="readcola.com" src="/static/images/screenshot-2.png"></p>
<p>我琢磨着将要阅读的资料制作成书,通过Amazon的推送服务推送到Kindle阅读器上。
<img alt="readcola.com" src="/static/images/screenshot-1.png"></p>
<p>为了制作出精美的电子杂志,我在网上找了MobiPocket的文档结构,完全通过Python实现了生成Mobi电子书。</p>
<p><img alt="readcola.com" src="/static/images/screenshot-3.png"></p>
<p>本来Amazon的kindlegen和calibre的ebook-convert能基本满足我的需求,可是经过测试发现如下一些缺憾:</p>
<ul>
<li>速度慢, 同样的书,ebook-convert要9秒,kindlegen要5秒,而我写的脚本只需要0.5秒左右。</li>
<li>kindlegen不支持Freebsd,目前只能运行在Linux,Wdinwos,Mac平台。</li>
<li>不得不说calibre是一个强大的桌面电子书管理软件,可是对于我来说太臃肿了。</li>
<li>mobiperl,一个用perl写成的mobi电子书生成工具,年代比较久,试用了下可以生成mobi4电子书,不能识别相对路径。</li>
</ul>
<p><img alt="readcola.com" src="/static/images/screenshot-4.png"></p>
<p>我最近在阅读《画家与黑客》,是硅谷创业之父Paul Graham 的文集,用豆瓣上的话说是:适合所有程序员和互联网创业者,也适合一切对计算机行业感兴趣的读者。</p>
<p><img alt="画家与黑客" src="http://img1.douban.com/lpic/s4669554.jpg"></p><p>kindle改变了我的阅读习惯,自从买了kindle touch之后,我利用上下班在公交车,地铁上的时间已经读了不少书了,一些以前没有时间阅读的书籍和资料,感觉确实是赚到了。</p>
<p><img alt="readcola.com" src="/static/images/screenshot-2.png"></p>
<p>我琢磨着将要阅读的资料制作成书,通过Amazon的推送服务推送到Kindle阅读器上。
<img alt="readcola.com" src="/static/images/screenshot-1.png"></p>
<p>为了制作出精美的电子杂志,我在网上找了MobiPocket的文档结构,完全通过Python实现了生成Mobi电子书。</p>
<p><img alt="readcola.com" src="/static/images/screenshot-3.png"></p>
<p>本来Amazon的kindlegen和calibre的ebook-convert能基本满足我的需求,可是经过测试发现如下一些缺憾:</p>
<ul>
<li>速度慢, 同样的书,ebook-convert要9秒,kindlegen要5秒,而我写的脚本只需要0.5秒左右。</li>
<li>kindlegen不支持Freebsd,目前只能运行在Linux,Wdinwos,Mac平台。</li>
<li>不得不说calibre是一个强大的桌面电子书管理软件,可是对于我来说太臃肿了。</li>
<li>mobiperl,一个用perl写成的mobi电子书生成工具,年代比较久,试用了下可以生成mobi4电子书,不能识别相对路径。</li>
</ul>
<p><img alt="readcola.com" src="/static/images/screenshot-4.png"></p>
<p>我最近在阅读《画家与黑客》,是硅谷创业之父Paul Graham 的文集,用豆瓣上的话说是:适合所有程序员和互联网创业者,也适合一切对计算机行业感兴趣的读者。</p>
<p><img alt="画家与黑客" src="http://img1.douban.com/lpic/s4669554.jpg"></p>关于Nginx的return配置小技巧2012-05-10T18:32:01+08:002012-07-20T15:33:01+08:00Li Guangmingtag:None,2012-05-10:/nginx-return-tips.html<p><img alt="关于Nginx的return关键字小技巧" src="/static/images/nginx-logo.png"></p>
<p>Nginx的<a href="http://wiki.nginx.org/HttpRewriteModule#return">return</a>关键字属于HttpRewriteModule模块:</p>
<div class="highlight"><pre><span></span>语法:return http状态码
默认值:无
上下文:server,location,if
该指令将结束执行直接返回http状态码到客户端.
支持的http状态码:200, 204, 400, 402-406, 408, 410, 411, 413, 416 , 500-504,还有非标准的444状态码.
</pre></div>
<h3>使用方法:</h3>
<div class="highlight"><pre><span></span>#不符合规则的返回403禁止访问
location /download/ {
rewrite ^(/download/.*)/media/(.*)\..*$ $1/mp3/$2.mp3 break;
return 403;
}
</pre></div>
<h3>小技巧</h3>
<div class="highlight"><pre><span></span>这些小技巧都是wiki里没有介绍的,而系统却是支持的。
</pre></div>
<p>如下配置文件:</p>
<div class="highlight"><pre><span></span><span class="nt">server</span> <span class="p">{</span>
<span class="n">server_name</span> <span class="n">test</span><span class="o">.</span><span class="n">liguangming</span><span class="o">.</span><span class="n">com</span><span class="p">;</span>
<span class="n">listen</span> <span class="m">80 …</span></pre></div><p><img alt="关于Nginx的return关键字小技巧" src="/static/images/nginx-logo.png"></p>
<p>Nginx的<a href="http://wiki.nginx.org/HttpRewriteModule#return">return</a>关键字属于HttpRewriteModule模块:</p>
<div class="highlight"><pre><span></span>语法:return http状态码
默认值:无
上下文:server,location,if
该指令将结束执行直接返回http状态码到客户端.
支持的http状态码:200, 204, 400, 402-406, 408, 410, 411, 413, 416 , 500-504,还有非标准的444状态码.
</pre></div>
<h3>使用方法:</h3>
<div class="highlight"><pre><span></span>#不符合规则的返回403禁止访问
location /download/ {
rewrite ^(/download/.*)/media/(.*)\..*$ $1/mp3/$2.mp3 break;
return 403;
}
</pre></div>
<h3>小技巧</h3>
<div class="highlight"><pre><span></span>这些小技巧都是wiki里没有介绍的,而系统却是支持的。
</pre></div>
<p>如下配置文件:</p>
<div class="highlight"><pre><span></span><span class="nt">server</span> <span class="p">{</span>
<span class="n">server_name</span> <span class="n">test</span><span class="o">.</span><span class="n">liguangming</span><span class="o">.</span><span class="n">com</span><span class="p">;</span>
<span class="n">listen</span> <span class="m">80</span><span class="p">;</span>
<span class="n">location</span> <span class="o">/</span> <span class="err">{</span>
<span class="n">add_header</span> <span class="n">Content</span><span class="o">-</span><span class="n">Type</span> <span class="s2">"text/plain;charset=utf-8"</span><span class="p">;</span>
<span class="n">return</span> <span class="m">200</span> <span class="s2">"Your IP Address:$remote_addr"</span><span class="p">;</span>
<span class="p">}</span>
<span class="err">}</span>
</pre></div>
<p>执行请求:</p>
<div class="highlight"><pre><span></span>curl -i http://test.liguangming.com
</pre></div>
<p>返回内容如下:</p>
<div class="highlight"><pre><span></span><span class="nt">HTTP</span><span class="o">/</span><span class="nt">1</span><span class="nc">.1</span> <span class="nt">200</span> <span class="nt">OK</span>
<span class="nt">Server</span><span class="o">:</span> <span class="nt">nginx</span><span class="o">/</span><span class="nt">1</span><span class="nc">.0.13</span>
<span class="nt">Date</span><span class="o">:</span> <span class="nt">Thu</span><span class="o">,</span> <span class="nt">10</span> <span class="nt">May</span> <span class="nt">2012</span> <span class="nt">10</span><span class="nd">:01:15</span> <span class="nt">GMT</span>
<span class="nt">Content-Type</span><span class="o">:</span> <span class="nt">application</span><span class="o">/</span><span class="nt">octet-stream</span>
<span class="nt">Content-Length</span><span class="o">:</span> <span class="nt">30</span>
<span class="nt">Connection</span><span class="o">:</span> <span class="nt">keep-alive</span>
<span class="nt">Content-Type</span><span class="o">:</span> <span class="nt">text</span><span class="o">/</span><span class="nt">plain</span><span class="o">;</span><span class="nt">charset</span><span class="o">=</span><span class="nt">utf-8</span>
<span class="nt">Your</span> <span class="nt">IP</span> <span class="nt">Address</span><span class="nd">:123</span><span class="nc">.128.217.19</span>
</pre></div>
<p>好玩吧,还有呢,比如如下的配置文件:</p>
<div class="highlight"><pre><span></span><span class="nt">server</span> <span class="p">{</span>
<span class="n">server_name</span> <span class="n">test</span><span class="o">.</span><span class="n">liguangming</span><span class="o">.</span><span class="n">com</span><span class="p">;</span>
<span class="n">listen</span> <span class="m">80</span><span class="p">;</span>
<span class="n">location</span> <span class="o">/</span> <span class="err">{</span>
<span class="n">return</span> <span class="n">http</span><span class="o">://</span><span class="n">liguangming</span><span class="o">.</span><span class="n">com</span><span class="o">/</span><span class="p">;</span>
<span class="p">}</span>
<span class="err">}</span>
</pre></div>
<p>执行请求:</p>
<div class="highlight"><pre><span></span>curl -i http://test.liguangming.com
</pre></div>
<p>返回内容如下:</p>
<div class="highlight"><pre><span></span>HTTP/1.1 302 Moved Temporarily
Server: nginx/1.0.13
Date: Thu, 10 May 2012 10:06:58 GMT
Content-Type: text/html
Content-Length: 161
Connection: keep-alive
Location: http://liguangming.com/
Content-Type: text/plain;charset=utf-8
<span class="nt"><html></span>
<span class="nt"><head><title></span>302 Found<span class="nt"></title></head></span>
<span class="nt"><body</span> <span class="na">bgcolor=</span><span class="s">"white"</span><span class="nt">></span>
<span class="nt"><center><h1></span>302 Found<span class="nt"></h1></center></span>
<span class="nt"><hr><center></span>nginx/1.0.13<span class="nt"></center></span>
<span class="nt"></body></span>
<span class="nt"></html></span>
</pre></div>
<p>是个302转向,为什么会这样呢?在nginx的源代码里找到src/http/modules/ngx_http_rewrite_module.c文件,
找到return关键字的解析配置:</p>
<div class="highlight"><pre><span></span>{ ngx_string("return"),
NGX_HTTP_SRV_CONF|NGX_HTTP_SIF_CONF|NGX_HTTP_LOC_CONF|NGX_HTTP_LIF_CONF
|NGX_CONF_TAKE12,
ngx_http_rewrite_return,
NGX_HTTP_LOC_CONF_OFFSET,
0,
NULL },
</pre></div>
<p>看到NGX_CONF_TAKE12,原来return允许接受一个或者两个参数啊.</p>
<p>再接着往下看找到ngx_http_rewrite_return函数:</p>
<div class="highlight"><pre><span></span><span class="x">static char *</span>
<span class="x">ngx_http_rewrite_return(ngx_conf_t *cf, ngx_command_t *cmd, void *conf)</span>
<span class="err">{</span><span class="x"></span>
<span class="x"> ngx_http_rewrite_loc_conf_t *lcf = conf;</span>
<span class="x"> u_char *p;</span>
<span class="x"> ngx_str_t *value, *v;</span>
<span class="x"> ngx_http_script_return_code_t *ret;</span>
<span class="x"> ngx_http_compile_complex_value_t ccv;</span>
<span class="x"> ret = ngx_http_script_start_code(cf->pool, &lcf->codes,</span>
<span class="x"> sizeof(ngx_http_script_return_code_t));</span>
<span class="x"> if (ret == NULL) </span><span class="err">{</span><span class="x"></span>
<span class="x"> return NGX_CONF_ERROR;</span>
<span class="x"> }</span>
<span class="x"> value = cf->args->elts;</span>
<span class="x"> ngx_memzero(ret, sizeof(ngx_http_script_return_code_t));</span>
<span class="x"> ret->code = ngx_http_script_return_code;</span>
<span class="x"> p = value[1].data;</span>
<span class="x"> ret->status = ngx_atoi(p, value[1].len);</span>
<span class="x"> if (ret->status == (uintptr_t) NGX_ERROR) </span><span class="err">{</span><span class="x"></span>
<span class="x"> if (cf->args->nelts == 2</span>
<span class="x"> && (ngx_strncmp(p, "http://", sizeof("http://") - 1) == 0</span>
<span class="x"> || ngx_strncmp(p, "https://", sizeof("https://") - 1) == 0</span>
<span class="x"> || ngx_strncmp(p, "</span><span class="p">$</span><span class="nv">scheme</span><span class="x">", sizeof("</span><span class="p">$</span><span class="nv">scheme</span><span class="x">") - 1) == 0))</span>
<span class="x"> </span><span class="err">{</span><span class="x"></span>
<span class="x"> ret->status = NGX_HTTP_MOVED_TEMPORARILY;</span>
<span class="x"> v = &value[1];</span>
<span class="x"> } else </span><span class="err">{</span><span class="x"></span>
<span class="x"> ngx_conf_log_error(NGX_LOG_EMERG, cf, 0,</span>
<span class="x"> "invalid return code "%V"", &value[1]);</span>
<span class="x"> return NGX_CONF_ERROR;</span>
<span class="x"> }</span>
<span class="x"> } else </span><span class="err">{</span><span class="x"></span>
<span class="x"> if (cf->args->nelts == 2) </span><span class="err">{</span><span class="x"></span>
<span class="x"> return NGX_CONF_OK;</span>
<span class="x"> }</span>
<span class="x"> v = &value[2];</span>
<span class="x"> }</span>
<span class="x"> ngx_memzero(&ccv, sizeof(ngx_http_compile_complex_value_t));</span>
<span class="x"> ccv.cf = cf;</span>
<span class="x"> ccv.value = v;</span>
<span class="x"> ccv.complex_value = &ret->text;</span>
<span class="x"> if (ngx_http_compile_complex_value(&ccv) != NGX_OK) </span><span class="err">{</span><span class="x"></span>
<span class="x"> return NGX_CONF_ERROR;</span>
<span class="x"> }</span>
<span class="x"> return NGX_CONF_OK;</span>
<span class="x">}</span>
</pre></div>
<p>当一个参数的时候,并不一定要是状态码,如果是一个网址,以http,https,或者与请求相同的协议,就会返回一个302重定向.
相当于:</p>
<div class="highlight"><pre><span></span><span class="nt">return</span> <span class="nt">302</span> <span class="nt">http</span><span class="o">://</span><span class="nt">liguangming</span><span class="nc">.com</span><span class="o">/;</span>
</pre></div>
<p>第二个参数会作为内容返回,其实这不就是一个简单的原生的echo模块吗?
赶着回去吃饭,匆忙写成的,难免有疏漏,有时间再修改.</p>
<h3>参考资源</h3>
<ul>
<li><a href="http://wiki.nginx.org/HttpRewriteModule#return">HttpRewriteModule</a></li>
<li><a href="http://wiki.nginx.org/HttpEchoModule">Nginx第三方echo模块</a></li>
</ul>怎么在Python里使用UTF-8编码2012-04-11T17:06:40+08:002012-07-18T17:57:50+08:00Li Guangmingtag:None,2012-04-11:/how-to-use-utf-8-with-python.html<p><img alt="怎么在Python里使用UTF-8编码?" src="/static/images/python-logo.png"></p>
<h3>基本概念</h3>
<p>在Python里有两种类型的字符串类型:字节字符串和Unicode的字符串,一个字节字符串就是一个包含字节列表。
当需要的时候,Python根据电脑默认的locale设置将字节转化成字符。
在Mac OX上默认的编码是UTF-8,但是在别的系统上,大部分是ASCII。</p>
<p>比如创建一个字节字符串:</p>
<div class="highlight"><pre><span></span>byteString = "hello world! (in my default locale)"
</pre></div>
<p>创建一个Unicode字符串:</p>
<div class="highlight"><pre><span></span>unicodeString = u"hello Unicode world!"
</pre></div>
<p>将一个字节字符串转成Unicode字符串然后再转回来:</p>
<div class="highlight"><pre><span></span>s = "hello byte string"
u = s.decode()
backToBytes = u.encode()
</pre></div>
<p>以上代码使用的是系统默认的字符来出来转换的。
然而,依赖系统的区域设置的字符集不是一个好主意,或许你的程序在泰文用户的电脑上就会崩溃。
最好的办法就是为字符指定一个编码:</p>
<div class="highlight"><pre><span></span>s = "hello normal string"
u = s.decode("UTF-8" )
backToBytes = u …</pre></div><p><img alt="怎么在Python里使用UTF-8编码?" src="/static/images/python-logo.png"></p>
<h3>基本概念</h3>
<p>在Python里有两种类型的字符串类型:字节字符串和Unicode的字符串,一个字节字符串就是一个包含字节列表。
当需要的时候,Python根据电脑默认的locale设置将字节转化成字符。
在Mac OX上默认的编码是UTF-8,但是在别的系统上,大部分是ASCII。</p>
<p>比如创建一个字节字符串:</p>
<div class="highlight"><pre><span></span>byteString = "hello world! (in my default locale)"
</pre></div>
<p>创建一个Unicode字符串:</p>
<div class="highlight"><pre><span></span>unicodeString = u"hello Unicode world!"
</pre></div>
<p>将一个字节字符串转成Unicode字符串然后再转回来:</p>
<div class="highlight"><pre><span></span>s = "hello byte string"
u = s.decode()
backToBytes = u.encode()
</pre></div>
<p>以上代码使用的是系统默认的字符来出来转换的。
然而,依赖系统的区域设置的字符集不是一个好主意,或许你的程序在泰文用户的电脑上就会崩溃。
最好的办法就是为字符指定一个编码:</p>
<div class="highlight"><pre><span></span>s = "hello normal string"
u = s.decode("UTF-8" )
backToBytes = u.encode( "UTF-8" )
</pre></div>
<p>现在,字节字符串<strong>s</strong>就被当成一个UTF-8字节列表去创建一个Unicode字符串<strong>u</strong>,
下一行用UTF-8表示的字符串u转换成字节字符串<strong>backToBytes</strong>.</p>
<h3>如何判断一个对象是字符串</h3>
<p>比如这样去判断:</p>
<div class="highlight"><pre><span></span>if isinstance( s, str ):
pass
</pre></div>
<p>这样是不对的,因为Unicode字符串将不为真.
代替的是使用通用字符串类, <a href="http://www.python.org/doc/current/lib/built-in-funcs.html#l2h-9"><code>basestring</code></a>:</p>
<div class="highlight"><pre><span></span>if isinstance( s, basestring ):# True for both Unicode and byte strings
pass
</pre></div>
<p>单独判断是不是Unicode字符串:</p>
<div class="highlight"><pre><span></span>if isinstance( s, unicode ):
pass
</pre></div>
<h3>读取UTF-8编码的文件</h3>
<p>你可以手工转换从文件中读取的字符串,方法很简单:</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">codecs</span>
<span class="n">fileObj</span> <span class="o">=</span> <span class="n">codecs</span><span class="o">.</span><span class="n">open</span><span class="p">(</span> <span class="s2">"someFile"</span><span class="p">,</span> <span class="s2">"r"</span><span class="p">,</span> <span class="s2">"UTF-8"</span> <span class="p">)</span>
<span class="n">u</span> <span class="o">=</span> <span class="n">fileObj</span><span class="o">.</span><span class="n">read</span><span class="p">()</span> <span class="c1"># Returns a Unicode string from the UTF-8 bytes in the file</span>
</pre></div>
<p><a href="http://www.python.org/doc/current/lib/module-codecs.html"><code>codecs模块</code></a>可以处理所有的编码转换。</p>
<h3>源码的编码声明</h3>
<p>Python源代码默认是 ASCII.可以在源文件的第一行或者是第二行作如下声明:</p>
<div class="highlight"><pre><span></span># coding=UTF-8
</pre></div>
<p>or (using formats recognized by popular editors):</p>
<table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre>1
2</pre></div></td><td class="code"><div class="highlight"><pre><span></span><span class="ch">#!/usr/bin/python</span>
<span class="c1"># -*- coding: UTF-8 -*-</span>
</pre></div>
</td></tr></table>
<p>or:</p>
<table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre>1
2</pre></div></td><td class="code"><div class="highlight"><pre><span></span><span class="ch">#!/usr/bin/python</span>
<span class="c1"># vim: set fileencoding=UTF-8 :</span>
</pre></div>
</td></tr></table>
<h3>系统编码</h3>
<p>前面说了,Python根据电脑默认的locale设置将字节转化成字符.那如何获得系统的默认编码:</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">sys</span>
<span class="k">print</span> <span class="n">sys</span><span class="o">.</span><span class="n">getdefaultencoding</span><span class="p">()</span>
</pre></div>
<p>更改系统的默认编码:</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">sys</span>
<span class="nb">reload</span><span class="p">(</span><span class="n">sys</span><span class="p">)</span>
<span class="n">sys</span><span class="o">.</span><span class="n">setdefaultencoding</span><span class="p">(</span><span class="s1">'UTF-8'</span><span class="p">)</span>
</pre></div>
<p>为什么要reload sys模块,先看下python的模块加载过程:</p>
<div class="highlight"><pre><span></span><span class="c1"># python -v</span>
<span class="c1"># installing zipimport hook</span>
<span class="kn">import</span> <span class="nn">zipimport</span> <span class="c1"># builtin</span>
<span class="c1"># installed zipimport hook</span>
<span class="c1"># /usr/local/lib/python2.6/site.pyc matches /usr/local/lib/python2.6/site.py</span>
<span class="kn">import</span> <span class="nn">site</span> <span class="c1"># precompiled from /usr/local/lib/python2.6/site.pyc</span>
<span class="o">....</span>
</pre></div>
<p>Python运行的时候首先加载了site.py,在site.py文件里有这么一段代码:</p>
<div class="highlight"><pre><span></span>if hasattr(sys, "setdefaultencoding"):
del sys.setdefaultencoding
</pre></div>
<p>在sys加载后,setdefaultencoding方法被删除了,所以我们要通过重新导入sys来设置系统编码.</p>
<h3>参考文章</h3>
<ul>
<li><a href="http://www.python.org/dev/peps/pep-0263/">Defining Python Source Code Encodings</a></li>
<li><a href="http://www.evanjones.ca/python-utf8.html">How to Use UTF-8 with Python</a></li>
<li><a href="http://code.activestate.com/recipes/466341-guaranteed-conversion-to-unicode-or-byte-string/">Guaranteed conversion to unicode or byte string (Python recipe)</a></li>
</ul>Kindle2012-04-01T15:19:07+08:002012-07-18T17:57:50+08:00Li Guangmingtag:None,2012-04-01:/kindle.html<hr>
<p>前段时间入手了<a href="http://www.amazon.com/Kindle-Touch-e-Reader-Touch-Screen-Wi-Fi-Special-Offers/dp/B005890G8Y?tag=l-g-m-20">Kindle Touch</a>,Wifi广告版,很实惠,现在每天都用,真是读书的利器。</p>
<p><img alt="Kindle Touch" src="/static/images/kindle-touch.jpg" title="Kindle Touch"></p>
<p>我还将Nginx和Redis的源代码制作成了mobi格式的电子书,以便于在<a href="http://www.amazon.com/Kindle-Touch-e-Reader-Touch-Screen-Wi-Fi-Special-Offers/dp/B005890G8Y?tag=l-g-m-20">Kindle Touch</a>上随时查看,为此我写了一个名为Src2Html.py的Python脚本,将源代码生成html文件,然后通过<a href="http://calibre-ebook.com/download">Calibre</a>工具生成mobi电子书。在设备上的显示效果还不错,遗憾的是不能改变屏幕方向,网上有这个的补丁,需要越狱安装。</p>
<h4><a href="http://cdn.liguangming.com/books/nginx-1.0.12.mobi">Nginx 1.0 Source</a></h4>
<p><img alt="Nginx 1.0 Source" src="http://cdn.liguangming.com/books/nginx-1.0.12.jpg" title="Nginx 1.0 Source"></p>
<h4><a href="http://cdn.liguangming.com/books/redis-2.4.8.mobi">Redis 2.4.8 Source</a></h4>
<p><img alt="Redis 2.4.8 Source" src="http://cdn.liguangming.com/books/redis-2.4.8.jpg" title="Redis 2.4.8 Source"></p><hr>
<p>前段时间入手了<a href="http://www.amazon.com/Kindle-Touch-e-Reader-Touch-Screen-Wi-Fi-Special-Offers/dp/B005890G8Y?tag=l-g-m-20">Kindle Touch</a>,Wifi广告版,很实惠,现在每天都用,真是读书的利器。</p>
<p><img alt="Kindle Touch" src="/static/images/kindle-touch.jpg" title="Kindle Touch"></p>
<p>我还将Nginx和Redis的源代码制作成了mobi格式的电子书,以便于在<a href="http://www.amazon.com/Kindle-Touch-e-Reader-Touch-Screen-Wi-Fi-Special-Offers/dp/B005890G8Y?tag=l-g-m-20">Kindle Touch</a>上随时查看,为此我写了一个名为Src2Html.py的Python脚本,将源代码生成html文件,然后通过<a href="http://calibre-ebook.com/download">Calibre</a>工具生成mobi电子书。在设备上的显示效果还不错,遗憾的是不能改变屏幕方向,网上有这个的补丁,需要越狱安装。</p>
<h4><a href="http://cdn.liguangming.com/books/nginx-1.0.12.mobi">Nginx 1.0 Source</a></h4>
<p><img alt="Nginx 1.0 Source" src="http://cdn.liguangming.com/books/nginx-1.0.12.jpg" title="Nginx 1.0 Source"></p>
<h4><a href="http://cdn.liguangming.com/books/redis-2.4.8.mobi">Redis 2.4.8 Source</a></h4>
<p><img alt="Redis 2.4.8 Source" src="http://cdn.liguangming.com/books/redis-2.4.8.jpg" title="Redis 2.4.8 Source"></p>如何将Sphinx配置成缓存服务器2012-03-30T16:56:29+08:002012-07-18T17:57:50+08:00Li Guangmingtag:None,2012-03-30:/how-to-set-up-sphinx-as-a-caching-server.html<p>大家都知道Sphinx是一个全文索引程序,它的高速查询能力也是有目共睹的。除了这些,我们是否还能挖掘点别的功能出来呢?不如作为一个简单的缓存服务器。</p>
<p><img alt="Sphinx" src="/static/images/sphinx-logo.jpg" title="Sphinx"></p>
<p>先来了解下Sphinx的使用的文件,Sphinx使用的文件包括 .sph, .spa, .spi, .spd, .spp, .spm ,.spl。</p>
<ul>
<li>sph:头文件,保存的是系统的配置文件。</li>
<li>spi:保存WordId及指向此WordId对应的文档信息在spd文件的指针,
spi文件在检索程序启动时完全加载入内存。
spi文件是分块的,块内排序,块之间也排序。分块的目的应该是为了快速检索到WordId,
因为spi中的WordId是变长压缩的,索引需要先在块级别做二分定位,再在快内解压缩查找。</li>
<li>spa:存储DocInfo的文件,检索程序启动时会把此文件加载如内存,sphinx可以指定DocInfo的存储方式:<ul>
<li>inline:存储到spd文件中。</li>
<li>extern:单独存储,就会生成spa文件。</li>
</ul>
</li>
<li>spd:文档列表。</li>
<li>spp:关键字所在位置列表。</li>
<li>spm:在DocInfo中,有一种特殊的属性,叫MVA,多值属性。
Sphinx对此属性特殊处理,需要存储在spm文件中。
检索程序启动时会把此文件加载如内存。
此属性在DocInfo对应位置存储其在此文件中的字节偏移量。</li>
<li>spk …</li></ul><p>大家都知道Sphinx是一个全文索引程序,它的高速查询能力也是有目共睹的。除了这些,我们是否还能挖掘点别的功能出来呢?不如作为一个简单的缓存服务器。</p>
<p><img alt="Sphinx" src="/static/images/sphinx-logo.jpg" title="Sphinx"></p>
<p>先来了解下Sphinx的使用的文件,Sphinx使用的文件包括 .sph, .spa, .spi, .spd, .spp, .spm ,.spl。</p>
<ul>
<li>sph:头文件,保存的是系统的配置文件。</li>
<li>spi:保存WordId及指向此WordId对应的文档信息在spd文件的指针,
spi文件在检索程序启动时完全加载入内存。
spi文件是分块的,块内排序,块之间也排序。分块的目的应该是为了快速检索到WordId,
因为spi中的WordId是变长压缩的,索引需要先在块级别做二分定位,再在快内解压缩查找。</li>
<li>spa:存储DocInfo的文件,检索程序启动时会把此文件加载如内存,sphinx可以指定DocInfo的存储方式:<ul>
<li>inline:存储到spd文件中。</li>
<li>extern:单独存储,就会生成spa文件。</li>
</ul>
</li>
<li>spd:文档列表。</li>
<li>spp:关键字所在位置列表。</li>
<li>spm:在DocInfo中,有一种特殊的属性,叫MVA,多值属性。
Sphinx对此属性特殊处理,需要存储在spm文件中。
检索程序启动时会把此文件加载如内存。
此属性在DocInfo对应位置存储其在此文件中的字节偏移量。</li>
<li>spk:killlist</li>
<li>spl:索引锁</li>
</ul>
<p>通过介绍可以得知Sphinx存储的文档的属性,在0.98之前的版本是不存储的,我们是不是可以利用这些数据作为缓存使用呢,根据DocID获取文档的信息。</p>
<p>通过hack搜索服务添加SEARCHD_COMMAND_DOCINFO指令,客户端API添加GetDocinfo函数可以达到预期的效果。</p>
<p>php示例代码:</p>
<div class="highlight"><pre><span></span><span class="x">require 'sphinxapi.php';</span>
<span class="p">$</span><span class="nv">cl</span><span class="x"> = new SphinxClient ();</span>
<span class="p">$</span><span class="nv">cl</span><span class="x">->SetServer();</span>
<span class="p">$</span><span class="nv">res</span><span class="x"> = </span><span class="p">$</span><span class="nv">cl</span><span class="x">->GetDocinfo(1, 'singer');</span>
<span class="x">print_r(</span><span class="p">$</span><span class="nv">res</span><span class="x">);</span>
</pre></div>
<p>结果如下:</p>
<div class="highlight"><pre><span></span>Array
(
[singer_id] => 1
[singer_name] => 阿牛
[cate_id] => 1
[tag_ids] => Array
(
[0] => 110
[1] => 114
[2] => 127
)
[song_number] => 137
[album_number] => 14
)
</pre></div>
<p>Patch文件 : <a href="https://gist.github.com/2251422" title="Sphinx 2.0.4 GetDocinfo Patch">https://gist.github.com/2251422</a></p>
<p><strong>参考文章</strong></p>
<ul>
<li><a href="http://jasonyu.cn/post/168/">sphinx简析</a></li>
<li><a href="http://blog.csdn.net/uestc_huan/article/details/6333711">sphinx的spx文件格式</a></li>
</ul>有关Sphinx的wordforms属性设置的小技巧2012-03-30T13:13:10+08:002012-07-18T17:57:50+08:00Li Guangmingtag:None,2012-03-30:/sphinx-wordforms-small-tips.html<p><strong>Sphinx</strong>索引配置文件有个wordfroms属性,wordfroms对应的是一个简单的字典文本文件,供sphinx在索引和搜索的时候替换词语使用。</p>
<p><img alt="Sphinx" src="/static/images/sphinx-logo.jpg" title="Sphinx"></p>
<h3>作用</h3>
<p>本质上,就是将一个词替换成另一个。这通常被用来将不同的词形变成一个单一的标准形式(即将词的各种形态如“walks”,“walked”,“walking”变为标准形式“walk”)。</p>
<p>例如:</p>
<div class="highlight"><pre><span></span>walks>walk
walked>walk
walking>walk
</pre></div>
<p>也可以用来实现取词根的例外情况,<strong>因为词形字典中可以找到的词不会经过词干提取器的处理</strong>。
索引和搜索中的输入词都会利用词典做规则化。<strong>因此要使词形字典的更改起作用,需要重新索引并重启searchd。</strong></p>
<h3>影响</h3>
<p>Sphnix的词形支持被设计成可以很好地支持很大的字典,<strong>仅对索引速度有微小的影响,搜索速度则完全不受影响</strong>。例如,一百万个条目的字典会使索引速度下降1.5倍。</p>
<p>额外的内存占用大体上等于字典文件的大小,而且字典是被多个索引共享的,即如果一个50MB的词形字典文件被10个不同的索引使用了,那么额外的searchd内存占用就是大约50MB。</p>
<h3>格式</h3>
<ul>
<li>每行包括一个源词和一个目标词,二者用大于号分隔。</li>
<li>忽略大小写。</li>
<li>遵循charset_table选项指定的规则。</li>
</ul>
<h3>技巧</h3>
<ul>
<li>简繁转换</li>
</ul>
<p>例如:</p>
<div class="highlight"><pre><span></span>張>张
學>学
</pre></div>
<p>当搜索 …</p><p><strong>Sphinx</strong>索引配置文件有个wordfroms属性,wordfroms对应的是一个简单的字典文本文件,供sphinx在索引和搜索的时候替换词语使用。</p>
<p><img alt="Sphinx" src="/static/images/sphinx-logo.jpg" title="Sphinx"></p>
<h3>作用</h3>
<p>本质上,就是将一个词替换成另一个。这通常被用来将不同的词形变成一个单一的标准形式(即将词的各种形态如“walks”,“walked”,“walking”变为标准形式“walk”)。</p>
<p>例如:</p>
<div class="highlight"><pre><span></span>walks>walk
walked>walk
walking>walk
</pre></div>
<p>也可以用来实现取词根的例外情况,<strong>因为词形字典中可以找到的词不会经过词干提取器的处理</strong>。
索引和搜索中的输入词都会利用词典做规则化。<strong>因此要使词形字典的更改起作用,需要重新索引并重启searchd。</strong></p>
<h3>影响</h3>
<p>Sphnix的词形支持被设计成可以很好地支持很大的字典,<strong>仅对索引速度有微小的影响,搜索速度则完全不受影响</strong>。例如,一百万个条目的字典会使索引速度下降1.5倍。</p>
<p>额外的内存占用大体上等于字典文件的大小,而且字典是被多个索引共享的,即如果一个50MB的词形字典文件被10个不同的索引使用了,那么额外的searchd内存占用就是大约50MB。</p>
<h3>格式</h3>
<ul>
<li>每行包括一个源词和一个目标词,二者用大于号分隔。</li>
<li>忽略大小写。</li>
<li>遵循charset_table选项指定的规则。</li>
</ul>
<h3>技巧</h3>
<ul>
<li>简繁转换</li>
</ul>
<p>例如:</p>
<div class="highlight"><pre><span></span>張>张
學>学
</pre></div>
<p>当搜索“张学友”和“張學友”和“張学友”能得到一样的结果.</p>
<ul>
<li>拼音纠错</li>
</ul>
<p>例如:</p>
<div class="highlight"><pre><span></span>张>zhang
学>xue
友>you
</pre></div>
<p>当搜索“张学友”和“zhang xue you”能得到一样的结果.</p>