Learning Basic Spider Libraries
This article studies basic web scraping libraries, including Python urllib and requests. It introduces HTTP request construction, exception handling, URL parsing, regular expression usage, and how to extract information from the Maoyan movie ranking page. It also emphasizes advanced usage such as request headers, cookies, proxy settings, and session persistence.
Web Crawling Basics
A web crawler is an automated program used to obtain information from web pages. Its basic workflow includes sending HTTP requests to retrieve page source code, extracting the required data, and saving it. Since web pages are built from HTML, CSS, and JavaScript, crawlers need to handle both static and dynamic pages. Sessions and cookies maintain user state, while proxy servers can hide the real IP address. Common request methods include GET and POST, and response status codes indicate request results. Crawlers should follow anti-scraping constraints and use proxies and proper headers to improve efficiency.
Getting Started with Regular Expressions
Regular expressions are powerful text pattern-matching tools used to describe and match specific string patterns. They include literal characters, special characters, character classes, and metacharacters, and are widely used in many programming languages and text processing tools. Regular expressions can be used for data validation, text replacement, and substring extraction, offering strong flexibility and functionality. Common metacharacters and features include character matching, quantifiers, boundary matching, and grouping, which help users process text efficiently.
Learning Basic SciPy Usage
SciPy is an open-source Python library built on NumPy and is widely used in mathematics, science, and engineering, providing functions such as optimization, linear algebra, integration, and interpolation. It can be installed with pip, and modules such as scipy.optimize and scipy.sparse can be used for optimization and sparse matrix processing. SciPy also supports graph structures and spatial data processing, provides multiple distance calculation methods, can interact with Matlab, and can perform significance testing and statistical analysis.
NumPy Study Notes 2
This article introduces many NumPy features, including bitwise operations, string operations, mathematical functions, statistical functions, sorting and conditional filtering, byte swapping, array copies and views, the matrix library, linear algebra, file input/output, and integration with Matplotlib. It provides detailed function descriptions and sample code to help readers understand and apply various NumPy capabilities.
NumPy Study Notes 1
NumPy is a Python extension library that supports multidimensional arrays and matrix operations and provides rich mathematical functions. Its main features include the powerful ndarray object, broadcasting, and integration with C/C++/Fortran. NumPy is often used together with SciPy and Matplotlib to form a strong scientific computing stack. It can be installed with pip, supports multiple data types, and provides rich array creation and manipulation features, including slicing, indexing, and broadcasting.
Running pyspider on Windows 11 with Docker
If installation problems occur when using pyspider on Windows 11, Docker can be used as an alternative installation method. This post provides examples using Docker commands and docker-compose. After startup, you can verify whether pyspider is running correctly by visiting http://localhost:5000/.
Pandas Basics
Pandas is an open-source data analysis library for Python that provides two main data structures, DataFrame and Series, for handling structured data. It supports data cleaning, transformation, analysis, and visualization. After installing Pandas, you can create and operate on Series and DataFrame with simple code, including basic operations, data filtering, and attribute access. Pandas also supports reading and processing CSV and JSON files and provides data cleaning features such as handling missing values and duplicate data.





