Riedayme
Riedayme

Scrape postingan wordpress dengan output xml blogger menggunakan python 3

Update 16 Oktober 2019, Kode Menggunakan 2x Selector :)
tutorial kali ini saya akan berbagi cara scrape sitemap wordpress menggunakan bahasa pemrograman python dengan menggunakan library beautifulsoup dan library pendukung lainnya.

awal niatan untuk membuat tools scrape ini adalah ketika saya merasa sangat lambat menggunakan bahasa pemrograman php. saya mencoba dengan python tapi masih tetap saja lambat :v mungkin memang situs web yang saya scrape servernya lambat.

langsung saja saya mulai eksperimannya :
  1. Pertama pastikan anda sudah menginstall aplikasi python
  2. Lanjut menginstall Library yang dibutuhkan
    1. pip install beautifulsoup4
    2. pip install requests
    3. pip install htmlmin
  3. Kemudian buat sebuah file dengan format .py dan
    isi dengan kode python dibawah ini :
    
    import requests
    import io
    import htmlmin
    import html
    from bs4 import BeautifulSoup
    
    # define sitemap url
    sitemap_post_url = 'https://apkhome.net/post-sitemap64.xml'
    
    # get xml last path
    sitemap_post_url_lastpath = sitemap_post_url.split("/")[3]
    xml_name = sitemap_post_url_lastpath + ".xml"
    
    # define selector dynamic
    selector2_title = 'div.short-detail > a > h2'
    selector2_date = 'meta'
    selector2_date_property = 'article:published_time'
    selector2_category = 'div.short-detail > p > a'
    selector2_content = 'div.post-content > div.post-main-content'
    
    # define selector2 dynamic
    selector_title = 'div.details > div.p10 > dl > dd > div.p1 > h2'
    selector_date = 'meta'
    selector_date_property = 'article:published_time'
    selector_category = 'div.details-title > a:nth-child(3)'
    selector_content = 'div.my-container > div.details'
    
    # define function ===============================================================
    # ===============================================================================
    def get_title(result):
        post_title = result.string
        post_title = html.escape(post_title)
        return post_title
    
    def get_category(result):
        post_category_arr = []
        for caq in result:
            category_name = "<category scheme='http://www.blogger.com/atom/ns#' term='" + \
                html.escape(caq.get_text()) + "' />"
            post_category_arr.append(category_name)
        # end for
        post_category_remove_same_value = list(set(post_category_arr))
        post_category = ''.join(post_category_remove_same_value)
        return post_category
    
    def get_content(result,selector):
    
        if selector == 1:
            # work for selector 2nd
            #result.find("div", {'class':'p10'}).decompose()
            #result.find("div", {'class':'additional'}).decompose()  
            for div in result.find_all("div", {'class':'p10'}): 
                div.decompose()           
            for div in result.find_all("div", {'class':'below-com-widget'}): 
                div.decompose()                       
            for div in result.find_all("div", {'class':'additional'}): 
                div.decompose()
    
    
        [x.extract() for x in result.find_all(['script', 'noscript', 'ins', 'style', 'link', 'meta'])]
    
        change_image = result.find_all('img',{'data-src':True})
        for img in change_image:
    
            img['src'] = img['data-src']
    
            del img['data-src']
            del img['data-lazyloaded']
        # end for
      
        # change object to string !!!
        result = repr(result)
        result = htmlmin.minify(result, remove_empty_space=True, remove_optional_attribute_quotes=False)
    
        result = html.escape(result)
    
        return result
    
    # start scraping !!! ============================================================
    # ===============================================================================
    # ===============================================================================
    
    # request url sitemap post
    req_post = requests.get(sitemap_post_url)
    # define request to text
    req_post_text = req_post.text
    
    # define beautifulsoup (html.parse)
    soup_sitemap_post = BeautifulSoup(req_post_text, "html.parser")
    # select url tag
    sitemap_post_tags = soup_sitemap_post.find_all("url")
    
    # write count sitemap post url
    print("Jumlah Sitemap Post {0}".format(len(sitemap_post_tags)))
    
    # remove xml =============================================
    # =======================================================
    open(xml_name, 'w').close()
    
    # write xml =============================================
    # =======================================================
    with open(xml_name, "a", encoding="utf-8") as f:
        f.write("<?xml version='1.0' encoding='UTF-8'?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0' xmlns:georss='http://www.georss.org/georss'><generator version='7.00' uri='https://www.blogger.com'>Blogger</generator>")
    
    # number for post id
    post_number = 1
    
    # selector 2 false
    use_selector2 = 0
    
    for sitemap_post in sitemap_post_tags:
    
        # get post url
        post_url = sitemap_post.findNext("loc").text
    
        # requst url post
        req_detail = requests.get(post_url)
    
        # define beautifulsoup (html.parse)
        soup_detail = BeautifulSoup(req_detail.content, "html.parser")
    
        # if selector 2 true ==========================================
        # =============================================================
        if use_selector2 == 1:
    
            # use selector 2 ===============================================================
            # ==============================================================================
    
            post_title = soup_detail.select_one(selector2_title)
            post_date = soup_detail.find(selector2_date, property=selector_date_property)
            post_category_select = soup_detail.select(selector2_category)
            post_content = soup_detail.select_one(selector2_content)
    
        else:
    
            # use selector 1 ===============================================================
            # ==============================================================================
    
            # get title dynamic
            post_title = soup_detail.select_one(selector_title)
            if repr(post_title) == 'None':
                print("Selector Title Tidak Ditemukan, Mencoba Menggunakan Selector 2nd")
    
                #check again use selector 2nd :)
                post_title = soup_detail.select_one(selector2_title)
                if repr(post_title) == 'None':
                    print("Selector2 Title Tidak Ditemukan, Periksa Kembali Kodenya")
                    break
                else:
                    # set selector 2 true
                    use_selector2 = 1
                # end if
            #end if
    
            # get date dynamic
            post_date = soup_detail.find(selector_date, property=selector_date_property)
            if repr(post_date) == 'None':
                print("Selector Date Tidak Ditemukan, Mencoba Menggunakan Selector 2nd")
    
                #check again use selector 2nd :)
                post_date = soup_detail.find(selector_date, property=selector_date_property)
                if repr(post_date) == 'None':
                    print("Selector2 Date Tidak Ditemukan, Periksa Kembali Kodenya")
                    break
                #end if   
            #end if
    
            # get category dynamic
            post_category_select = soup_detail.select(selector_category)
            if len(post_category_select) == 0:
                print("Selector Category Tidak Ditemukan, Mencoba Menggunakan Selector 2nd")
    
                #check again use selector 2nd :)
                post_category_select = soup_detail.select(selector2_category)
                if len(post_category_select) == 0:
                    print("Selector Category Tidak Ditemukan, Periksa Kembali Kodenya")
                    break
                #end if
            #end if
    
            # get content dynamic
            post_content_select = soup_detail.select_one(selector_content)
            if repr(post_content_select) == 'None':
                print("Selector Content Tidak Ditemukan, Menggunakan Selector 2nd")
    
                #check again use selector 2nd :)
                post_content_select = soup_detail.select_one(selector2_content)
                if repr(post_content_select) == 'None':
                    print("Selector Content Tidak Ditemukan, Periksa Kembali Kodenya")
                    break
                #end if
            #end if                       
    
        #end if
    
        # after get all data create variable 
        post_title = get_title(post_title)
        post_date = str(post_date['content'])
        post_category = get_category(post_category_select)
        post_content = get_content(post_content_select,use_selector2)
        #print(post_title)
        #print(post_date)
        #print(post_category)
        #print(post_content)
        #exit()
    
        # write xml =============================================
        # =======================================================
        with open(xml_name, "a", encoding="utf-8") as f:
            f.write("<entry><id>post-" + str(post_number) + "</id><published>" + str(post_date) + "</published><updated>" + str(post_date) + "</updated><category scheme='http://schemas.google.com/g/2005#kind' term='http://schemas.google.com/blogger/2008/kind#post' />" +
                    post_category + "<title type='text'>" + str(post_title) + "</title><content type='html'>" + post_content + "</content><author><name>Scraper</name></author></entry>")
        # end with
    
        # write post number for log script run or no
        print("%d__%s" % (post_number, post_url))
    
        # post number plus
        post_number += 1
    
    # end for
    
    # write xml =============================================
    # =======================================================
    with open(xml_name, "a", encoding="utf-8") as f:
        f.write("</feed>")        
        
    

  4. selanjutnya silahkan anda simpan dan jalankan filenya menggunakan text editor IDLE bawaan python dengan mengklik kanan file python tersebut kemudian pilih edit with idle, lihat gambar dibawah ini :

  5. untuk menjalankan programnya anda tinggal klik menu run lalu run module atau bisa juga pencet f5 pada keyboard
  6. jika berhasil akan muncul seperti ini :

  7. kalian bisa melihat hasil scrapenya sejajar dengan file python yang anda jalankan

File xml ini bisa langsung di import ke blogger.

Untuk penjelasan kodenya mungkin lain kali atau kalian bisa mengutak atiknya sendiri.

Jika kode tidak bekerja bisa jadi situs target melakukan update terhadap template yang dipakainya sehingga selector tidak berfungsi, untuk memperbaikinya dengan melakukan tindakan update selector

Referensi :
https://stackoverflow.com/questions/14587728/what-does-this-error-in-beautiful-soup-means
21 Komentar
Python

Artikel Terkait

Komentar

  1. Anonim
    Anonim 7 Oktober 2019 02.29
    ada vidionya ngga om, soalnya saya masih pemula.
    1. Riedayme
      Riedayme 7 Oktober 2019 03.35
      nyusul om, sekalian penjelasan kodenya
  2. Diyana Harmitha
    Diyana Harmitha 7 Oktober 2019 09.42
    kalau untuk scraping lebih dari 1 sitemap gimana bang ? misalnya dari https://apkhome.net/sitemap_index.xml trus semuanya dibaca satu persatu untuk discrape. thanks
    1. Riedayme
      Riedayme 7 Oktober 2019 09.51
      bisa saja dibuat seperti itu tapi akan sangat lama prosesnya ketarget yang jumlah postnya ada 70ribu, 1000 saja sudah lama apalagi 70ribu bisa nunggu lama. makannya saya buat per sitemap post saja tidak ke indux sitemapnya
    2. Riedayme
      Riedayme 7 Oktober 2019 10.36
      contoh kodenya : https://pastebin.com/raw/g8kwzYvt
  3. Diyana Harmitha
    Diyana Harmitha 7 Oktober 2019 10.51
    terima kasih bang keren bener, ilmu mahal ini.
  4. Jessica Haryanti Blog
    Jessica Haryanti Blog 14 Oktober 2019 13.53
    ini ko saya udh copas, ada eror katanya di line 57
    1. Riedayme
      Riedayme 14 Oktober 2019 21.43
      oke terima kasih nanti saya perbaiki dulu
  5. Anonim
    Anonim 15 Oktober 2019 21.35
    eror gan
  6. Riedayme
    Riedayme 16 Oktober 2019 08.51
    sudah saya fix, permasalah ada pada template targer yang berubah rubah mungkin karena dia tau kalau situsnya discrape makannya dia rubah :)
  7. bosnangka
    bosnangka 18 Oktober 2019 21.07
    Di batasi per seribu ya gan
    1. Riedayme
      Riedayme 20 Oktober 2019 12.28
      tergantung sitemapnya kalau jumlahnya ada 10ribu ya segitu dapetnya, sitemap yoast emang per1000
  8. alam spy
    alam spy 18 Oktober 2019 22.51
    Ada keterangan begini bang "Selector Title Tidak Ditemukan, Mencoba Menggunakan Selector 2nd
    Selector2 Title Tidak Ditemukan, Periksa Kembali Kodenya" gimana mengatasinya
    1. Riedayme
      Riedayme 20 Oktober 2019 12.27
      wah error berarti selectornya, nanti saya kasih tutorial cara ganti selectornya
  9. masboon
    masboon 9 November 2019 15.24
    gan eror terus ya..selextor tidak di temukan mulu
    1. Riedayme
      Riedayme 10 November 2019 05.55
      belum di benerin sama saya, nanti saya coba cek
    2. masboon
      masboon 16 November 2019 07.17
      sip gan di tunggu
  10. Diyana Harmitha
    Diyana Harmitha 12 November 2019 22.50
    ajarin ganti selectornya bang, biar bisa dipake di web lain. thanks
    1. Riedayme
      Riedayme 13 November 2019 06.32
      sebenernya mau buat tapi belum sempat
  11. Si Admin
    Si Admin 12 Februari 2020 04.05
    belum update lagi kah ?
    1. Riedayme
      Riedayme 13 Mei 2020 05.44
      sepertinya tidak saya update lagi.