Data Analysis Hacks
Hacks
- make your own data using your brian, google or chatgpt, should look different than mine.
- modify my code or write your own
- output your data other than a bar graph.
- write an 850+ word essay on how pandas, python or irl, affected your life. If AI score below 85%, then -1 grading point
- answer the questions below, the more explained the better.
Questions
- What are the two primary data structures in pandas and how do they differ? Series is one dimensional and dataframe is 2 dimensional
- How do you read a CSV file into a pandas DataFrame? Pandas read_csv() function to read a CSV file.
- How do you select a single column from a pandas DataFrame? df['column-name']
- How do you filter rows in a pandas DataFrame based on a condition? df[df['column-name']=value]
- How do you group rows in a pandas DataFrame by a particular column? df.groupby('name')['name']
- How do you aggregate data in a pandas DataFrame using functions like sum and mean? df.groupby['column-name'].mean()
- How do you handle missing values in a pandas DataFrame? df.dropna()
- How do you merge two pandas DataFrames together? pd.merge(left=dfname1, right=dfname2, left_on='column-name1', right_on='column-name2)
- How do you export a pandas DataFrame to a CSV file? df.to_csv('file_name.csv')
- What is the difference between a Series and a DataFrame in Pandas? Series is one dimensional and dataframe is 2 dimensional
import pandas as pd
# read the CSV file
df = pd.read_csv('datasets/books.csv')
df = df.drop(columns=['bookID', 'isbn', 'isbn13', 'language_code', 'publication_date', 'publisher'])
print(df.head())
import matplotlib.pyplot as plt
rating_groups = ['0-1', '1-2', '2-3', '3-4', '4-5']
rating_counts = pd.cut(df['average_rating'], bins=[0, 1, 2, 3, 4, df['average_rating'].max()], labels=rating_groups, include_lowest=True).value_counts()
plt.bar(rating_counts.index, rating_counts.values)
plt.title('Number of books in each rating group')
plt.xlabel('Rating group')
plt.ylabel('Number of books')
plt.show()
# create a scatter plot of number of ratings vs. rating
plt.scatter(df['ratings_count'], df['average_rating'])
plt.title('Ratings count VS Average rating')
plt.xlabel('Ratings count')
plt.ylabel('Average rating')
plt.show()
Data Analysis/Predictive Analysis Hacks
- How can Numpy and Pandas be used to preprocess data for predictive analysis?
- Pandas is a tool that is commonly utilized for the purpose of analyzing data, whereas Numpy is often preferred when dealing with numerical data due to its ability to perform diverse mathematical operations. Pandas offers a wider range of features to handle a greater variety of data types. Both Pandas and Numpy can be used to cleanse, standardize, and transform data to prepare it for predictive analysis.
- What machine learning algorithms can be used for predictive analysis, and how do they differ?
- Linear regression is utilized to forecast continuous outcomes based on the linear association between the independent and dependent variables. Decision trees serve as a modeling tool for decisions and their probable outcomes. Random forests are particularly effective when handling large amounts of data. Neural networks are designed to mimic the workings of the human brain in terms of decision-making processes. Support vector machines are employed to identify the most optimal boundary between various classes within a dataset.
- Can you discuss some real-world applications of predictive analysis in different industries?
- There are various scenarios in which predictions may be required, such as predicting the temperature or weather forecast, forecasting which team will emerge victorious in a sports game, predicting whether an individual has a medical condition or not based on an image or scan, determining whether a user will find a video appealing or not, and making predictions about stock market trends.
- Can you explain the role of feature engineering in predictive analysis, and how it can improve model accuracy?
- Feature engineering is the process of selecting and manipulating variables in order to create a predictive analysis model. This technique can potentially enhance the accuracy of the model, as it enables the identification of crucial information that may be used to highlight patterns within the data.
- How can machine learning models be deployed in real-time applications for predictive analysis?
- Real-time applications, such as TikTok, can utilize machine learning models to suggest videos that align with a user's interests. For instance, as a user interacts with the app by liking certain videos, a machine learning model can be employed to analyze their preferences and make recommendations based on that data.
- Can you discuss some limitations of Numpy and Pandas, and when it might be necessary to use other data analysis tools?
- Pandas tends to utilize more memory compared to Numpy, which is generally more memory-efficient. Pandas exhibits better performance when working with a larger number of rows. However, when dealing with extremely large datasets, alternative tools may need to be considered. Additionally, the syntax of Pandas can be complex at times. Furthermore, Pandas has a steep learning curve, limited documentation, and is not compatible with 3D matrices.
- How can predictive analysis be used to improve decision-making and optimize business processes?
- By enabling businesses to make more informed decisions, predictive analysis can facilitate the optimization of various business processes. This can be accomplished by forecasting the most probable outcomes for a business, which can, in turn, aid in the tailoring of customer experiences to increase profits.
from skimage import io
photo = io.imread('../images/waldo.jpg')
type(photo)
plt.imshow(photo)
import matplotlib.pyplot as plt
plt.imshow(photo)
photo.shape
plt.imshow(photo[210:350, 425:500])
Numpy functions
In the example below, np.arrange creates an array with even numbers between 0 and 10 with a step of 2. The resulting array has values [0, 2, 4, 6, 8]. It is particularly useful when you need to create an array with a sequence of values that have a regular spacing between them. This function can be used to create arrays of integers or floats, and the resulting array can be used for mathematical operations like addition, subtraction, multiplication, and division. It is also used in creating plots, animation, and simulations.
import numpy as np
# create an array of even numbers between 0 and 10 (excluding 10)
arr = np.arange(0, 10, 2)
print(arr)
sin, cos, tan, natural log, and log10 of a 1D array.
array = np.array([1, 2, 3])
# calculate sin
sinx_x = np.sin(array)
print(sinx_x)
# calculate cos
cos_x = np.cos(array)
print(cos_x)
# calculate tan
tanx_x = np.tan(array)
print(tanx_x)
# calculate natural log
ln_x = np.log(array)
print(ln_x)
# calculate log10
log10_x = np.log10(array)
print(log10_x)