Posts Tagged ‘python’

搜狗地图——离线下载

我的前一篇文章:搜狗地图研究-坐标系对应中提到GPS坐标系和搜狗地图的坐标系的关系,这篇文章着重谈谈怎么离线下载搜狗地图。

搜狗的离线下载算是比较简单的,只要搞清楚坐标系和URL的对应关系即可。

为了得到这个对应算法,分析搜狗的javascript是一种方法。但是由于搜狗已经对js进行了加密,所以解密出来的东西也不怎么看得懂。况且在firebug中获取到的js还不完整,很影响分析。

最后,发现搜狗路书是用flash做的。各位看官看到这里,有经验的就知道把flash下载下来,反编译一下就能看到actionscript的代码。很好,搜狗在这里面没有一丝的模糊,完全高清无码奉送。好了,至于反编译后的代码是怎么回事,各位还是回家自己研究研究吧。在这里就不多废话了。

分析代码发现,在18级的时候,每一个像素代表了0.488281米的距离,每缩放一级就放大一倍,也就是说在第n级的对应有:

BASE_MPP = 0.488281
def getMetersPerPixel(level):
    return BASE_MPP * (1 << 18 - level)

地图的图片都是256*256大小的,所以一个图片覆盖的面积是:

def getPicArea(level):
    return 256*getMetersPerPixel(level)

得到了面积以后,就可以进行计算了,将ActionScript代码翻译过来,就成了我们的代码了:

def getPicArea(level):
    return 256*getMetersPerPixel(level)

def formatWithM(param1):
    return ("M" + (-str(param1))) if (param1 < 0) else ("" + str(param1))

LEVELCODE = ["728", "727", "726", "725", "724", "723", "722", "721", "720", "719", "718", "717", "716", "715", "714", "713", "712", "711", "792"]

def getMapURL(sogouLat, sogouLon, level):
    mapArea=getPicArea(level)
    picsPerFolder = 200;
    _loc_9 = int(math.floor(sogouLon / mapArea))
    _loc_10 = int(math.floor(_loc_9 / picsPerFolder))
    _loc_11 = int(math.floor(sogouLat / mapArea))
    _loc_12 = int(math.floor(_loc_11 / picsPerFolder))
    _loc_13 = formatWithM(_loc_9)
    _loc_14 = formatWithM(_loc_11)
    _loc_15 = formatWithM(_loc_10)
    _loc_16 = formatWithM(_loc_12)
    return "/" + LEVELCODE[level] + "/" + _loc_15 + "/" + _loc_16 + "/" + _loc_13 + "_" + _loc_14

这样,就得到了一部分的URL。

搜狗地图分为三种图:普通地图、卫星图、卫星图上覆盖的简化地图

三种地图分别有基URL与之对应:

  • 普通地图:http://pic1.go2map.com/seamless/0/174,图片后缀.GIF
  • 卫星图:http://hbpic1.go2map.com/seamless/0/180,图片后缀.JPG
  • 简化地图:http://hbpic2.go2map.com/seamless/0/179,图片后缀.PNG

    根据getMapURL的结果,与你想要的图的基URL一合并,就搞定了。

    好了,说多了烦,直接上完整代码,怎么用你就琢磨琢磨吧。

    告诉你一个不幸的消息,不要试图将全国18级地图全部下载完,那会消耗上T的空间,而且是很多的T~~~可以下载一部分玩玩咯。

    #!/usr/bin/env python
    #Author: Derek
    #Homepage: http://www.april1985.com
    
    import math
    import threading
    import os
    import urllib
    import time
    
    BASE_MPP = 0.488281
    LEVELCODE = ["728", "727", "726", "725", "724", "723", "722", "721", "720", "719", "718", "717", "716", "715", "714", "713", "712", "711", "792"]
    downType=["Hd","Lw","Map"]
    rootPath={"Hd":"http://hbpic1.go2map.com/seamless/0/180",
              "Lw":"http://hbpic2.go2map.com/seamless/0/179",
              "Map":"http://pic1.go2map.com/seamless/0/174"}
    suffix={"Hd":".JPG",
            "Lw":".PNG",
            "Map":".GIF"}
    
    def getMetersPerPixel(level):
        return BASE_MPP * (1 << 18 - level)
    
    def formatWithM(param1):
        return ("M" + (-str(param1))) if (param1 < 0) else ("" + str(param1))
    
    def getPicArea(level):
        return 256*getMetersPerPixel(level)
    
    def formatWithM(param1):
        return ("M" + (-str(param1))) if (param1 < 0) else ("" + str(param1))
    
    def getMapURL(sogouLat, sogouLon, level):
        mapArea=getPicArea(level)
        picsPerFolder = 200;
        _loc_9 = int(math.floor(sogouLon / mapArea))
        _loc_10 = int(math.floor(_loc_9 / picsPerFolder))
        _loc_11 = int(math.floor(sogouLat / mapArea))
        _loc_12 = int(math.floor(_loc_11 / picsPerFolder))
        _loc_13 = formatWithM(_loc_9)
        _loc_14 = formatWithM(_loc_11)
        _loc_15 = formatWithM(_loc_10)
        _loc_16 = formatWithM(_loc_12)
        return "/" + LEVELCODE[level] + "/" + _loc_15 + "/" + _loc_16 + "/" + _loc_13 + "_" + _loc_14
    
    threadCount=0
    
    class downloadThread(threading.Thread):
        def __init__(self, url, file):
            threading.Thread.__init__(self)
            self.url=url
            self.file=file
    
        def run(self):
            global threadCount
            dir=os.path.split(self.file)[0]
            try:
                if not os.path.isdir(dir):
                    os.makedirs(dir)
            except:
                pass
            if not os.path.isfile(self.file):
                print "Getting %s" % self.url
                urllib.urlretrieve(self.url, self.file)
                if(os.path.getsize(self.file)<500):
                    os.remove(self.file)
                threadCount-=1
            else:
                threadCount-=1
    
    latFrom=40
    latEnd=60
    lonFrom=80
    lonEnd=100
    picCount=0
    localStorage="D:/map/storage"
    for level in range(1,18):
        mapArea=getPicArea(level)
        lon=lonFrom
        lat=latFrom
        picCount=0
        while(lon20):
                    time.sleep(0.5)
                picCount+=1
                lat+=mapArea
            lon+=mapArea
        print level,picCount
    
  • 邮箱大挪移

    在GOOGLE被“驱逐”中国后,我越来越担心Gmail的命运了。这几天貌似google.com.hk都间歇性的刷不出来了。就怕哪天完全被屏蔽了,这下啥都完了。目前我的邮箱已经是Google的域名邮箱,里面存有几千个邮件,包括163和以前Gmail的几个邮箱的。但挪移邮箱并不是那么简单的事情,用POP3和SMTP没法完成这个任务。还好IMAP提供了邮件下载和上传的功能,圆满解决我的需求。完整的IMAP协议见:http://www.faqs.org/rfcs/rfc3501.html

    在所有的库中,支持IMAP最好的是Python。Python的imaplib提供了IMAP的很好的支持,而且应用简单(http://docs.python.org/library/imaplib.html)。在IMAP下载中,这个python脚本提供了一个很好的例子:http://the.taoofmac.com/media/Projects/imapbackup/imapbackup.py.txt

    我的需求比较简单,将几个邮箱的数据全部转移到QQ邮箱即可,并且要区分收件箱和发件箱。

    实现起来,首先将所有的邮箱转发到需要转移到的邮箱(避免下载的时候多出几个新邮件来),然后通过IMAP将所有邮箱的数据下载下来(注意有可能有以前通过POP3下载的重复的数据),最后再上传到QQ邮箱。

    具体实现中,将IMAP的数据下载后,对BODY部分的MD5编码作为文件名,然后放置到一个文件夹中,命名为receive。对于制定的发件箱,需要存到sent文件夹中。这样receive中就有所有的邮件,并且不重复。

    然后上传到QQ邮箱,上传成功就删除一个本地文件,避免重复上传。

    我参考imapbackup.py写了一个IMAPServer类,用于下载和上传

    import os, imaplib, re, hashlib, shutil, getopt
    
    class IMAPServer:
        def __init__(self, server, port,  ssl, usr, psd):
            if ssl:
                self.__server=imaplib.IMAP4_SSL(server, port)
            else:
                self.__server=imaplib.IMAP4(server, port)
            self.__server.debug=4
            self.__server.login(usr, psd)
            print usr , " loged in"
    
        def __parse_paren_list(self, row):
            """Parses the nested list of attributes at the start of a LIST response"""
            # eat starting paren
            assert(row[0] == '(' )
            row = row[1:]
    
            result = []
    
            # NOTE: RFC3501 doesn't fully define the format of name attributes
            name_attrib_re = re.compile("^\s*(\\\\[a-zA-Z0-9_]+)\s*")
    
            # eat name attributes until ending paren
            while row[0] != ')':
                # recurse
                if row[0] == '(':
                    paren_list, row = parse_paren_list(row)
                    result.append(paren_list)
                # consume name attribute
                else:
                    match = name_attrib_re.search(row)
                    assert(match != None)
                    name_attrib = row[match.start():match.end()]
                    row = row[match.end():]
                    #print "MATCHED '%s' '%s'" % (name_attrib, row)
                    name_attrib = name_attrib.strip()
                    result.append(name_attrib)
    
            # eat ending paren
            assert(')' == row[0])
            row = row[1:]
    
            # done!
            return result, row
    
        def __parse_string_list(self, row):
            """Parses the quoted and unquoted strings at the end of a LIST response"""
            slist = re.compile('\s*(?:"([^"]+)")\s*|\s*(\S+)\s*').split(row)
            return [s for s in slist if s]
    
        def __parse_list(self, row):
            """Prases response of LIST command into a list"""
            row = row.strip()
            paren_list, row = self.__parse_paren_list(row)
            string_list = self.__parse_string_list(row)
            assert(len(string_list) == 2)
            return [paren_list] + string_list
    
        def __get_hierarchy_delimiter(self):
            """Queries the imapd for the hierarchy delimiter, eg. '.' in INBOX.Sent"""
            # see RFC 3501 page 39 paragraph 4
            typ, data = self.__server.list('', '')
            assert(typ == 'OK')
            assert(len(data) == 1)
            lst = self.__parse_list(data[0]) # [attribs, hierarchy delimiter, root name]
            hierarchy_delim = lst[1]
            # NIL if there is no hierarchy
            if 'NIL' == hierarchy_delim:
                hierarchy_delim = '.'
    
            return hierarchy_delim
    
        def __convert_utf7(self, str):
            p=str.split("&")
            rst=p[0]+("+"+p[1]).decode("utf-7")
            return rst
    
        def get_mailbox_names(self):
            delim = self.__get_hierarchy_delimiter()
    
            # Get LIST of all folders
            typ, data = self.__server.list()
            assert(typ == 'OK')
    
            names=[]
            # parse each LIST, find folder name
            for row in data:
                lst = self.__parse_list(row)
                foldername = lst[2]
    
                unicodename=foldername
                if foldername.find('&')>=0:
                    unicodename=self.__convert_utf7(foldername)
    
                names.append((foldername,unicodename))
            return names
    
        def get_mail(self, mailbox, dir, fromid=1):
            type, num= self.__server.select(mailbox)
            #type, data =self.__server.search(None,"ALL")
    
            for mail_id in range(fromid,int(num[0])+1):
                type=None
                try:
                    type, msg_data=self.__server.fetch(mail_id, "(RFC822 BODY[TEXT])")
                except:
                    print "Get ", mail_id, " failed"
    
                if type!="OK":
                    continue
    
                content=msg_data[0][1]
                bodyMD5=msg_data[1][1].strip()
    
                print "Saving ",mail_id, " total ", num[0]
    
                filename=hashlib.md5(bodyMD5).hexdigest()
                file=open(os.path.join(dir,filename), "wb")
                file.write(content)
                file.flush()
                file.close()
    
        def upload_mail(self, mailbox, content):
            print self.__server.append(mailbox,0,0,content)

    先是下载邮件:

    def getAllMails(receive_folder,sent_folder):
        src_svr=IMAPServer("imap.gmail.com",993, True, "gmail@gmail.com", "gmail")
        for mailbox, unicode in src_svr.get_mailbox_names():
            print mailbox, "\t" , unicode
    
        src_svr.get_mail('INBOX', receive_folder)
        src_svr.get_mail('[Gmail]/&XfJT0ZCuTvY-', sent_folder)
    
        src_svr=IMAPServer("imap.163.com",993, True, "163", "163")
        for mailbox, unicode in src_svr.get_mailbox_names():
            print mailbox, "\t" , unicode
    
        src_svr.get_mail('INBOX', receive_folder)
        src_svr.get_mail('Sent Items', sent_folder)
    
        print "Done"

    待所有的邮件下载完成以后,再上传:

    def copyto(receive_folder, sent_folder):
        to_receive_folder="D:\\mail\\to_receive"
        to_sent_folder="D:\\mail\\to_sent"
    
        to_svr=IMAPServer("imap.qq.com", 993, True, "QQ号码", '密码')
    
        print "Upload"
        finished=0
        total=len(os.listdir(receive_folder))
        for received in os.listdir(receive_folder):
            finished+=1
            print finished, "\t", total
    
            path=os.path.join(receive_folder, received)
            file=open(path, "r")
            content=file.read()
            file.close()
            to_svr.upload_mail('INBOX',content)
            os.remove(path)
    
        finished=0
        total=len(os.listdir(sent_folder))
        for sent in os.listdir(sent_folder):
            finished+=1
            print finished, "\t", total
    
            path=os.path.join(sent_folder, sent)
            file=open(path, "r")
            content=file.read()
            file.close()
            to_svr.upload_mail('&XfJT0ZABkK5O9g-',content)
            os.remove(path)

    主程序如下:

    def main():
        """Main entry point"""
    
        receive_folder="D:\\mail\\received\\"
        sent_folder="D:\\mail\\sent"
    
        getAllMails(receive_folder,sent_folder)  #先把所有的邮件下载完成后,再用copyto函数
        #copyto(receive_folder,sent_folder)    
    
    if __name__ == '__main__':
      main()

    这样就可以挪移咯。经过实验,将我的所有的2300多个邮件全部转移到QQ邮箱,并且发送和接受都分开存储。
    当然这个脚本还有很多不足,比如MD5的方式比较内容还是有点问题。例如你把GMAIL的邮件下载下来放到QQ上去再下载下来,你就会发现BODY最后的部分多了几个空格,MD5的方式就不能检测出这种相同的邮件。
    控制台的方式显然比较麻烦,最好的办法还是做一个GUI来进行同步处理。只是我没有那么多时间来整这个了,有兴趣的朋友可以发挥你们的才能。

    metaWeblog同步博客遇到的问题

    最近有一个迫切的需求,需要同步几个博客。目前我有的博客是本站,还有hesicong.cnblogs.com和www.csdn.net/hesicong共三个。其后两个由于各种问题长久以来没有得到更新。我的需求就是写一个软件能够将wordpress的文章同步到cnblogs和csdn(成为次博客)。对于主次博客都有的文章修改次博客为主博客内容,对于次博客没有的文章按照时间新建一个。从而实现博客同步。

    技术上使用metaWeblog就可以实现上述目的,选用python的xmlrpclib可以方便的进行xmlrpc操作。做一个控制台的小程序足够了我使用了。

    然后开始技术实验,发现:

    1. wordpress支持metaWeblog很好,可以实现所有的功能。从wordpress可以通过metaWeblog.getRecentPosts函数得到所有的文章。
    2. cnblogs也支持metaWeblog,也支持的很好。cnblogs也支持我的语法高亮。但遗憾的是:第一:metaWeblog.getRecentPosts函数最多能够返回100个文章。而我的cnblogs目前有230篇文章,很显然,cnblogs限制了文章数量;第二:metaWeblog.newPost函数即便Post结构中有dateCreated,但cnblogs的主界面中依然按照当前时间计算,造成文章时间对不上号,顺序混乱。
    3. csdn就是个垃圾,metaWeblog表面上支持,暗地里出问题。metaWeblog.getRecentPosts,metaWeblog.editPost都无法用,提示User not exist。仅有metaWeblog.newPost可以用,但csdn的blog的语法高亮无法用,页面很难看。

    所以,想实现我的目的通过metaWeblog看来是没希望了。除非cnblogs调整文章数量限制,csdn希望从垃圾变成战斗机了。

    附我写的一个测试代码,不完善,仅作为参考:

    import xmlrpclib
    
    class Metablog:
        def __init__(self, url, username, password):
            self.username=username
            self.password=password
            self.url=url
            self.server=xmlrpclib.ServerProxy(url)
            self.posts=None
    
        def getAllPosts(self):
            print "Getting all posts from "+self.url
            self.posts=self.server.metaWeblog.getRecentPosts('', self.username, self.password, 9999999)
            print "found "+ str(len(self.posts)) +" posts"
            return self.posts;
    
        def getAllPostTitle(self):
            if self.posts==None:
                self.getAllPosts()
    
            ret=dict()
            for post in self.posts:
                ret[post["postid"]]=post["title"]
    
            return ret
    
        def getPost(self, id):
            for post in self.posts:
                if post["postid"]==id:
                    return post
    
            return None
    
        def newPost(self, post):
            self.server.metaWeblog.newPost('', self.username, self.password, post, True)
    
        def editPost(self,postid, post):
            self.server.metaWeblog.editPost(postid, self.username, self.password, post, True)
    
        def delPost(self, postid):
            self.server.metaWeblog.deletePost('',postid, self.username, self.password, True)
    
    def syncBlog(b1, b2):
        b1Titles=b1.getAllPostTitle()
        b2Titles=b2.getAllPostTitle()
        for key1,value1 in b1Titles.iteritems():
            print "Blog1 title: "+value1
            for key2,value2 in b2Titles.iteritems():
                print "\tBlog2 title: "+value2
                if value1==value2:
                    print "Syncing, blog 2 postid="+key2
                    b2.editPost(keys2, b1.getPost(key1))
                    break
    
            print "Blog2 has no article equal to title :"+value1
            print "Add new "
            b2.newPost(b1.getPost(key1))
    
        print "Done sync"                    
    
    wpBlog=Metablog("主站地址", "用户名", "密码")
    cnBlog=Metablog("从站地址", "用户名", "密码")
    syncBlog(wpBlog, cnBlog)
    

    Grammer Girl podcast downloader

    Same as ESL podcast downloader, a simple tool to fetch all the podcast and descriptions from Grammer Girl. Use at your own risk!


      GGDownloader.py (1.6 KiB, 487 hits)

    ESL podcast downloader

    When I first met ESL podcast, I found it was very useful for your listening and it expanded your vocabulary and your communicate skills. On the website, you can download all of the podcasts and read the scripts online by free. There are now 670-odd podcasts on the website, so it is a huge treasure for English learners.

    I was wondering if I could create a tool and download them all at once including the audio index and the script and then convert them to a plain text form in order that I can listen to the podcast and read the script on my phone. To realize my idea, I wrote a small program by python under Linux.

    Download it:

      eslDownloader.py (2.8 KiB, 626 hits)

    How to steps:

    1. Be sure you are using Linux or Unix system or Cygwin system.
    2. Make sure you have installed Python 2.x
    3. Copy the script to any folder you like.
    4. Make sure you have right to create a sub-directory under this folder. My script will create a folder called "eslpod" to store the content.
    5. Run this script.

    You will see the script start to fetch the pages and download the podcasts. Downloaded files will store under "eslpod" folder and grouping by the tag. Under every tag, you could see the script with .html suffix and podcast with .mp3 suffix.

    You can interrupt the downloading process by kill the process.

    DISCLAIMER:

    • Use this script at your own risk.
    • This program may fail if the site was changed.
    • Respect the authors' work and do not distribute without their permission.

    And at last, be happy and confidence to study English!