Регрессионный анализ охвата Instagram аудитории

In [55]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from sklearn.model_selection import train_test_split
from sklearn.linear_model import PassiveAggressiveRegressor

Загрузка датасета

In [56]:
data = pd.read_csv("Instagram.csv", encoding = 'latin1')
data.head()
Out[56]:
Impressions From Home From Hashtags From Explore From Other Saves Comments Shares Likes Profile Visits Follows Caption Hashtags
0 3920.0 2586.0 1028.0 619.0 56.0 98.0 9.0 5.0 162.0 35.0 2.0 Here are some of the most important data visua... #finance #money #business #investing #investme...
1 5394.0 2727.0 1838.0 1174.0 78.0 194.0 7.0 14.0 224.0 48.0 10.0 Here are some of the best data science project... #healthcare #health #covid #data #datascience ...
2 4021.0 2085.0 1188.0 0.0 533.0 41.0 11.0 1.0 131.0 62.0 12.0 Learn how to train a machine learning model an... #data #datascience #dataanalysis #dataanalytic...
3 4528.0 2700.0 621.0 932.0 73.0 172.0 10.0 7.0 213.0 23.0 8.0 Here’s how you can write a Python program to d... #python #pythonprogramming #pythonprojects #py...
4 2518.0 1704.0 255.0 279.0 37.0 96.0 5.0 4.0 123.0 8.0 0.0 Plotting annotations while visualizing your da... #datavisualization #datascience #data #dataana...

Исключаем пустые значения из колонок

In [57]:
data.isnull().sum()
Out[57]:
Impressions       1
From Home         1
From Hashtags     1
From Explore      1
From Other        1
Saves             1
Comments          1
Shares            1
Likes             1
Profile Visits    1
Follows           1
Caption           1
Hashtags          1
dtype: int64
In [58]:
data = data.dropna()
In [59]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 99 entries, 0 to 98
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Impressions     99 non-null     float64
 1   From Home       99 non-null     float64
 2   From Hashtags   99 non-null     float64
 3   From Explore    99 non-null     float64
 4   From Other      99 non-null     float64
 5   Saves           99 non-null     float64
 6   Comments        99 non-null     float64
 7   Shares          99 non-null     float64
 8   Likes           99 non-null     float64
 9   Profile Visits  99 non-null     float64
 10  Follows         99 non-null     float64
 11  Caption         99 non-null     object 
 12  Hashtags        99 non-null     object 
dtypes: float64(11), object(2)
memory usage: 10.8+ KB

Анализ профиля Instagram

Анализ охвата аудитории

In [60]:
plt.figure(figsize=(10, 8))
plt.style.use('fivethirtyeight')
plt.title("Распределение вовлеченности со стартовой страницы")
sns.distplot(data['From Home'])
plt.show()
/home/vladimirkiselev/miniconda3/envs/yt/lib/python3.10/site-packages/seaborn/distributions.py:2619: FutureWarning:

`distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).

In [61]:
plt.figure(figsize=(10, 8))
plt.style.use('fivethirtyeight')
plt.title("Распределение вовлеченности со по хэштегам")
sns.distplot(data['From Hashtags'])
plt.show()
/home/vladimirkiselev/miniconda3/envs/yt/lib/python3.10/site-packages/seaborn/distributions.py:2619: FutureWarning:

`distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).

In [62]:
plt.figure(figsize=(10, 8))
plt.style.use('fivethirtyeight')
plt.title("Распределение вовлеченности с Ленты")
sns.distplot(data['From Explore'])
plt.show()
/home/vladimirkiselev/miniconda3/envs/yt/lib/python3.10/site-packages/seaborn/distributions.py:2619: FutureWarning:

`distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).

In [63]:
home = data["From Home"].sum()
hashtags = data["From Hashtags"].sum()
explore = data["From Explore"].sum()
other = data["From Other"].sum()

labels = ['From Home', 'From Hashtags', 'From Explore', 'From Other']
values = [home, hashtags, explore, other]

fig = px.pie(data, values=values, names=labels, title='Вовлеченность попстами аудитории Instagram из разных источников')
fig.show()

Анализ контента

In [64]:
text = " ".join(element for element in data.Caption)
stopwords = set(STOPWORDS)
wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(text)
plt.style.use('classic')
plt.figure( figsize=(12,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
In [65]:
text = " ".join(element for element in data.Hashtags)
stopwords = set(STOPWORDS)
wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(text)
plt.figure( figsize=(12,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Анализ взаимосвязей

In [69]:
#!pip install statsmodels
figure = px.scatter(data_frame=data, x="Impressions",
                    y="Likes", size="Likes", trendline="ols",
                    title="Relationship Between Likes and Impressions")
figure.show()
In [70]:
figure = px.scatter(data_frame=data, x="Impressions",
                    y="Comments", size="Comments", trendline="ols",
                    title="Relationship Between Comments and Total Impressions")
figure.show()
In [71]:
figure = px.scatter(data_frame=data, x="Impressions",
                    y="Shares", size="Shares", trendline="ols",
                    title="Relationship Between Shares and Impressions")
figure.show()
In [72]:
figure = px.scatter(data_frame=data, x="Impressions",
                    y="Saves", size="Saves", trendline="ols",
                    title="Relationship Between Post Saves and Impressions")
figure.show()
In [73]:
correlation = data.corr()
correlation["Impressions"].sort_values(ascending=False)
Out[73]:
Impressions       1.000000
Likes             0.896277
From Hashtags     0.892682
Follows           0.804064
Profile Visits    0.774393
Saves             0.625600
From Home         0.603378
From Explore      0.498389
Shares            0.476617
From Other        0.429227
Comments          0.247201
Name: Impressions, dtype: float64

Анализ конверсий

conversion = (Followers/Profile_Visits) * 100

In [74]:
conversion_rate = (data["Follows"].sum() / data["Profile Visits"].sum()) * 100
conversion_rate
Out[74]:
31.17770767613039
In [75]:
figure = px.scatter(data_frame=data, x="Profile Visits",
                    y="Follows", size="Follows", trendline="ols",
                    title="Relationship Between Profile Visits and Follows")
figure.show()

ML-Модель для предсказания охвата Instagram аудитории

Используем регрессию

In [78]:
x = np.array(data[['Likes', 'Saves', 'Comments', 'Shares', 'Profile Visits', 'Follows']])
y = np.array(data['Impressions'])

xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2, random_state=42)
In [79]:
model = PassiveAggressiveRegressor()
model.fit(xtrain, ytrain)
model.score(xtest, ytest)
Out[79]:
0.9275754559204714
In [80]:
features = np.array([[294.9, 233.9, 1.0, 9.9, 126.0, 54.0]])
model.predict(features)
Out[80]:
array([9255.41845053])