數(shù)據(jù)分析必會(huì)的 23 個(gè)python庫(kù)

萬(wàn)里潮涌 2024-02-16 發(fā)布于浙江

展開全文

大家好，我是小寒。

1. numpy

NumPy 是 Python 中強(qiáng)大的數(shù)值計(jì)算庫(kù)，提供對(duì)大型多維數(shù)組和矩陣的支持，以及對(duì)這些數(shù)據(jù)結(jié)構(gòu)進(jìn)行操作的數(shù)學(xué)函數(shù)。

import numpy as np

# Creating a NumPy array
arr = np.array([1, 2, 3, 4, 5])

# Calculating mean and standard deviation
mean_value = np.mean(arr)
std_dev = np.std(arr)

print(f'Mean: {mean_value}, Standard Deviation: {std_dev}')

2. Pandas

Pandas 是一個(gè) Python 數(shù)據(jù)操作和分析庫(kù)，提供 DataFrame 等數(shù)據(jù)結(jié)構(gòu)，以實(shí)現(xiàn)高效的數(shù)據(jù)處理、清理和探索。

import pandas as pd

# Creating a Pandas DataFrame
data = {'Name': ['John', 'Alice', 'Bob'],
        'Age': [25, 28, 22]}
df = pd.DataFrame(data)

# Displaying the DataFrame
print(df)

3. Matplotlib

Matplotlib 是 Python 中的多功能 2D 繪圖庫(kù)，廣泛用于創(chuàng)建靜態(tài)、交互式和動(dòng)畫可視化。

import matplotlib.pyplot as plt

# Creating a simple line plot
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

plt.plot(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Simple Line Plot')
plt.show()

4、Seaborn

Seaborn 是一個(gè)基于 Matplotlib 的數(shù)據(jù)可視化庫(kù)，專為美觀且信息豐富的統(tǒng)計(jì)圖形而設(shè)計(jì)。

import seaborn as sns

# Using Seaborn to create a scatter plot
tips = sns.load_dataset('tips')
sns.scatterplot(x='total_bill', y='tip', data=tips)
plt.title('Scatter Plot using Seaborn')
plt.show()

5、SciPy

SciPy 是一個(gè)用于數(shù)學(xué)、科學(xué)和工程的開源庫(kù)，通過提供優(yōu)化、集成、插值等附加功能來(lái)擴(kuò)展 NumPy。

import scipy.stats

# Performing a t-test
data1 = [1, 2, 3, 4, 5]
data2 = [2, 4, 6, 8, 10]

t_stat, p_value = scipy.stats.ttest_ind(data1, data2)
print(f'T-statistic: {t_stat}, p-value: {p_value}')

6、Scikit-learn

Scikit-learn 是一個(gè) Python 機(jī)器學(xué)習(xí)庫(kù)，提供簡(jiǎn)單高效的數(shù)據(jù)分析和建模工具，包括各種分類、回歸、聚類和降維算法。

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Load iris dataset as an example
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create and train a Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)

# Make predictions on the test set
predictions = model.predict(X_test_scaled)

# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)

print(f'Accuracy: {accuracy}')

7、Statsmodels

Statsmodels 是一個(gè)用 Python 估計(jì)和測(cè)試統(tǒng)計(jì)模型的庫(kù)，提供全面的統(tǒng)計(jì)模型和假設(shè)檢驗(yàn)。

import statsmodels.api as sm

# Performing linear regression
x = [1, 2, 3, 4, 5]
y = [2, 4, 5, 4, 5]

x = sm.add_constant(x)
model = sm.OLS(y, x).fit()

# Displaying regression summary
print(model.summary())

8、Dask

Dask 是 Python 中的并行計(jì)算庫(kù)，支持使用并行處理和任務(wù)調(diào)度來(lái)處理大于內(nèi)存的計(jì)算。

import dask.dataframe as dd

# Creating a Dask DataFrame
data = {'Name': ['John', 'Alice', 'Bob'],
        'Age': [25, 28, 22]}
df = dd.from_pandas(pd.DataFrame(data), npartitions=2)

# Performing a simple computation
result = df['Age'].mean().compute()
print(f'Mean Age: {result}')

9、Bokeh

Bokeh 是一個(gè) Python 交互式可視化庫(kù)，針對(duì)現(xiàn)代 Web 瀏覽器進(jìn)行演示，為數(shù)據(jù)探索提供優(yōu)雅的交互式可視化效果。

from bokeh.plotting import figure, show
from bokeh.io import output_notebook

# Creating a simple Bokeh plot
output_notebook()

x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

p = figure(title='Simple Bokeh Plot', x_axis_label='X-axis', y_axis_label='Y-axis')
p.line(x, y)

show(p)

10、NLTK

NLTK（自然語(yǔ)言工具包）是一個(gè)功能強(qiáng)大的庫(kù)，用于處理人類語(yǔ)言數(shù)據(jù)，提供用于標(biāo)記化、詞干提取、標(biāo)記、解析等任務(wù)的工具。

import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt')

# Tokenizing a sentence
sentence = 'Natural Language Processing is fascinating.'
tokens = word_tokenize(sentence)

print(tokens)

11、Beautiful Soup

Beautiful Soup 是一個(gè) Python 庫(kù)，用于從 HTML 和 XML 文件中提取數(shù)據(jù)，提供了一種便捷的方式來(lái)抓取 Web 數(shù)據(jù)。

from bs4 import BeautifulSoup

# Parsing HTML content
html_content = '<html><body><p>This is a paragraph.</p></body></html>'
soup = BeautifulSoup(html_content, 'html.parser')

# Extracting text from the paragraph tag
paragraph_text = soup.find('p').text
print(paragraph_text)

12、Plotly

Plotly 是一個(gè)用于交互式可視化的 Python 圖形庫(kù)，適合創(chuàng)建交互式繪圖和儀表板。

import plotly.express as px

# Creating a Plotly scatter plot
df = px.data.iris()
fig = px.scatter(df, x='sepal_width', y='sepal_length', color='species', title='Iris Dataset')
fig.show()

13、Altair

Altair 是 Python 中的聲明式統(tǒng)計(jì)可視化庫(kù)，允許用戶使用簡(jiǎn)潔直觀的語(yǔ)法創(chuàng)建各種交互式可視化。

import altair as alt
import pandas as pd

# Creating a simple Altair chart
data = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [2, 4, 6, 8, 10]})
chart = alt.Chart(data).mark_point().encode(x='x', y='y').properties(title='Altair Chart')

chart

14、Vaex

Vaex 是一個(gè)用于惰性、核外 DataFrame 的 Python 庫(kù)，可以高效處理大型數(shù)據(jù)集，而無(wú)需將它們完全加載到內(nèi)存中。

import vaex

# Creating a Vaex DataFrame
data = {'Name': ['John', 'Alice', 'Bob'],
        'Age': [25, 28, 22]}
df = vaex.from_dict(data)

# Displaying the Vaex DataFrame
print(df)

15、Geopandas

Geopandas 是 Pandas 的擴(kuò)展，專為處理地理空間數(shù)據(jù)而定制，可以有效地操作和分析地理數(shù)據(jù)集。

import geopandas as gpd
from shapely.geometry import Point

# Creating a GeoDataFrame with points
geometry = [Point(-74.0059, 40.7128), Point(-73.9862, 40.7306)]
gdf = gpd.GeoDataFrame(geometry=geometry, crs='EPSG:4326')

# Plotting the GeoDataFrame
gdf.plot()
plt.show()

16、Folium

Folium 是一個(gè) Python 庫(kù)，可簡(jiǎn)化交互式地圖的創(chuàng)建，使交互式地圖上的地理空間數(shù)據(jù)可視化變得輕松。

import folium

# Creating a Folium map
map = folium.Map(location=[37.7749, -122.4194], zoom_start=10)

# Adding a marker
folium.Marker(location=[37.7749, -122.4194], popup='San Francisco').add_to(map)

# Displaying the map
map

17、Xarray

Xarray 是一個(gè)用于處理帶標(biāo)簽的多維數(shù)組的 Python 庫(kù)，為處理復(fù)雜的數(shù)據(jù)集提供了強(qiáng)大而靈活的數(shù)據(jù)結(jié)構(gòu)。

import xarray as xr

# Creating a simple xarray dataset
data = xr.DataArray([[1, 2], [3, 4]], dims=('x', 'y'), coords={'x': [0, 1], 'y': [0, 1]})

# Displaying the xarray dataset
print(data)

18、LightGBM

LightGBM 是 Python 中的梯度增強(qiáng)框架，針對(duì)大型數(shù)據(jù)集進(jìn)行了優(yōu)化，能夠訓(xùn)練極其快速和準(zhǔn)確的模型。

import lightgbm as lgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Loading Iris dataset and splitting it
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2)

# Training a LightGBM classifier
model = lgb.LGBMClassifier()
model.fit(X_train, y_train)

# Making predictions and calculating accuracy
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)

print(f'Accuracy: {accuracy}')

19、Keras

Keras 是 Python 中的高級(jí)神經(jīng)網(wǎng)絡(luò) API，通過用戶友好的界面促進(jìn)深度學(xué)習(xí)模型的開發(fā)和實(shí)驗(yàn)。

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Creating a simple Keras model
model = Sequential()
model.add(Dense(units=64, activation='relu', input_dim=10))
model.add(Dense(units=1, activation='sigmoid'))

# Displaying the model summary
model.summary()

20、Arrow

Arrow 是一個(gè)用于處理日期、時(shí)間和時(shí)間戳的 Python 庫(kù)，提供更直觀且人性化的 API 來(lái)處理時(shí)態(tài)數(shù)據(jù)。

import arrow

# Getting the current time in a specific timezone
local_time = arrow.now()
utc_time = arrow.utcnow()

print(f'Local Time: {local_time}')
print(f'UTC Time: {utc_time}')

21、NetworkX

NetworkX 是一個(gè)用于創(chuàng)建、分析和可視化復(fù)雜網(wǎng)絡(luò)的 Python 庫(kù)，適用于圖論和網(wǎng)絡(luò)分析中的任務(wù)。

import networkx as nx
import matplotlib.pyplot as plt

# Creating a simple graph
G = nx.Graph()
G.add_nodes_from([1, 2, 3])
G.add_edges_from([(1, 2), (2, 3)])

# Visualizing the graph
nx.draw(G, with_labels=True, font_weight='bold')
plt.show()

22、Dash

Dash 是一個(gè)用于構(gòu)建交互式 Web 應(yīng)用程序的 Python Web 框架，對(duì)于創(chuàng)建數(shù)據(jù)可視化和儀表板特別有用。

import dash
from dash import dcc, html

# Creating a simple Dash app
app = dash.Dash(__name__)

app.layout = html.Div(children=[
    html.H1(children='Dash Example'),
    dcc.Graph(
        id='example-graph',
        figure={
            'data': [{'x': [1, 2, 3], 'y': [4, 1, 2], 'type': 'bar', 'name': 'Bar Chart'}],
            'layout': {'title': 'Dash Bar Chart'}
        }
    )
])

if __name__ == '__main__':
    app.run_server(debug=True)

23、PyCaret

PyCaret 是一個(gè)使用 Python 編寫的開源低代碼機(jī)器學(xué)習(xí)庫(kù)，旨在通過自動(dòng)化各種任務(wù)（例如特征工程、模型選擇和超參數(shù)調(diào)整）來(lái)簡(jiǎn)化機(jī)器學(xué)習(xí)工作流程。

from pycaret.datasets import get_data
from pycaret.classification import *

# Loading a classification dataset
data = get_data('diabetes')

# Setting up PyCaret environment
exp1 = setup(data, target='Class variable')

# Comparing different models
compare_models()

注意：免費(fèi)整理了100 個(gè)數(shù)據(jù)分析相關(guān)的 python 庫(kù)（還有精美的pdf版本）

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購(gòu)買等信息，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊一鍵舉報(bào)。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來(lái)自：萬(wàn)里潮涌 > 《finebi分析》

舉報(bào)/認(rèn)領(lǐng)