[빅데이터 분석] 크롤링

전공/Data Analysis

[빅데이터 분석] 크롤링 - (2)

으녜 2021. 4. 12. 21:20

728x90

정적 웹 페이지 크롤링

beautifulsoup 패키지 이용 (pip install beautifulsoup4로 설치)

'''

frome bs4 import BeautifulSoup

html= '''
<h1 id="title">검색엔진</h1>
<div class="top">
<ul class="search">
    <li><a href="https://www.naver.com">Naver</a></li>
    <li><a href="https://www.google.co.kr">Google</a></li>
    <li><a href="https://www.daum.net">Daum</a></li>
</ul>
</div>
'''

soup = BeautifulSoup(html,'html.parser')
print(soup.prettify())

'''

[Mac으로 html 파일 열기]

포맷을 리치텍스트(기본설정)에서 일반텍스트로 변경 후 코드 결과를 html 파일로 저장

html 문서 열기

각 검색 엔진 이름을 클릭하면 해당 사이트로 이동

태그 검색

검색 예시 (attrs,re,.string)

*참조 문서

Beautiful Soup Documentation — Beautiful Soup 4.9.0 documentation

Non-pretty printing If you just want a string, with no fancy formatting, you can call str() on a BeautifulSoup object (unicode() in Python 2), or on a Tag within it: str(soup) # ' I linked to example.com ' str(soup.a) # ' I linked to example.com ' The str(

www.crummy.com

[실전] : 할리스 커피 가맹점 정보 크롤링해보기

크롤링 허용 정책 확인 : www.hanbit.co.kr/robots.txt

'''

from bs4 import BeautifulSoup
import urllib.request
import pandas as pd
import datetime


def hollys_store(result):
    for page in range(1,59): #여러 페이지
        Hollys_url = 'https://www.hollys.co.kr/store/korea/korStore.do?pageNo=%d&sido=&gugun=&store=' %page
        print(Hollys_url)
        html = urllib.request.urlopen(Hollys_url) #url open
        soupHollys = BeautifulSoup(html,'html.parser') #parsing
        tag_tbody = soupHollys.find('tbody') #tbody 찾기
        for store in tag_tbody.find_all('tr'): #분해하기
            if len(store) <=3:
                break
            store_td = store.find_all('td') #dataframe 만들기를 위한 list로 만들기
            store_name = store_td[1].string
            store_sido = store_td[0].string
            store_address = store_td[3].string
            store_phone = store_td[5].string
            result.append([store_name]+[store_sido]+[store_address]+[store_phone])
    return 

def main():
    result = []
    print('Hollys store crawling >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>')
    hollys_store(result)
    hollys_tbl = pd.DataFrame(result, columns=('store','sido-gu','address','phone')) #dataframe 만들기
    hollys_tbl.to_csv('hollys.csv',encoding='cp949',mode='w',index=True) #csv로 저장
    del result[:]

if __name__ == '__main__':
    main()

'''

728x90