个人买卖网站怎么做网站大全全部免费-万宁市网站建设公司-Seo优化

个人买卖网站怎么做,网站大全全部免费,营销型网站.,国内网站建设的趋势是怎样的原文#xff1a;towardsdatascience.com/how-to-reduce-python-runtime-for-demanding-tasks-2857efad0cec 数据科学家面临的最大挑战之一是处理极大数据集或高度复杂的机器学习/深度学习模型时 Python 代码运行时间过长。许多方法已被证明可以有效提高代码效率#xff0c;例…原文towardsdatascience.com/how-to-reduce-python-runtime-for-demanding-tasks-2857efad0cec数据科学家面临的最大挑战之一是处理极大数据集或高度复杂的机器学习/深度学习模型时 Python 代码运行时间过长。许多方法已被证明可以有效提高代码效率例如降维、模型优化和特征选择——这些都是基于算法的解决方案。另一种应对这一挑战的方法是在某些情况下使用不同的编程语言。在今天的文章中我不会专注于基于算法的改进代码效率的方法。相反我将讨论既方便又容易掌握的实际技术。为了说明我将使用在线零售数据集这是一个在 Creative Commons Attribution 4.0 International (CC BY 4.0)许可下的公共数据集。您可以从 UCI 机器学习仓库下载原始的在线零售数据集。该数据集包含在英国注册的非实体店在线零售商在特定期间发生的所有交易数据。目标是训练一个模型来预测客户是否会进行回购以下 Python 代码用于实现这一目标。importpandasaspdfromsklearn.model_selectionimporttrain_test_splitfromsklearn.ensembleimportRandomForestClassifierfromitertoolsimportproduct# Load dataset from Excel filedatapd.read_excel(Online Retail.xlsx,engineopenpyxl)# Data preprocessingdatadata.dropna(subset[CustomerID])data[InvoiceYearMonth]data[InvoiceDate].astype(datetime64[ns]).dt.to_period(M)# Feature Engineeringdata[TotalPrice]data[Quantity]*data[UnitPrice]customer_featuresdata.groupby(CustomerID).agg({TotalPrice:sum,InvoiceYearMonth:nunique,# Count of unique purchase monthsQuantity:sum}).rename(columns{TotalPrice:TotalSpend,InvoiceYearMonth:PurchaseMonths,Quantity:TotalQuantity})# Create the target variablecustomer_features[Repurchase](customer_features[PurchaseMonths]1).astype(int)# Train-test splitXcustomer_features.drop(Repurchase,axis1)ycustomer_features[Repurchase]X_train,X_test,y_train,y_testtrain_test_split(X,y,test_size0.2,random_state42)# Model trainingclfRandomForestClassifier()clf.fit(X_train,y_train)# Define different values for parametersn_estimators_options[50,100,200]max_depth_options[None,10,20]class_weight_options[None,balanced]# Train the RandomForestClassifier with different combinations of parametersresults[]forn_estimators,max_depth,class_weightinproduct(n_estimators_options,max_depth_options,class_weight_options):clfRandomForestClassifier(n_estimatorsn_estimators,max_depthmax_depth,class_weightclass_weight,random_state42)clf.fit(X_train,y_train)accuracyclf.score(X_test,y_test)results.append((n_estimators,max_depth,class_weight,accuracy))由于处理了 541,909 行数据运行代码需要一些时间。在电子商务或社交媒体等行业数据科学家经常处理更大的数据集——有时是数十亿甚至数万亿行数据具有更多特征。还有结构化和非结构化数据的组合文本、图像或视频——这些各种类型的数据无疑增加了工作量。因此应用一些技术来优化代码效率至关重要。我将坚持使用在线零售数据来简化解释。在介绍这些技术之前我测量了运行整个 Python 脚本、读取在线零售数据和训练机器学习模型所需的时间。importtime# Function to calculate and print elapsed timedeftime_execution(func,*args,**kwargs):start_timetime.time()resultfunc(*args,**kwargs)elapsed_timetime.time()-start_timereturnresult,elapsed_time# 1\. Full Python code execution timingdefcomplete_process():# Load dataset from Excel filedatapd.read_excel(Online Retail.xlsx,engineopenpyxl)# Data preprocessingdatadata.dropna(subset[CustomerID])data[InvoiceYearMonth]data[InvoiceDate].astype(datetime64[ns]).dt.to_period(M)# Feature Engineeringdata[TotalPrice]data[Quantity]*data[UnitPrice]customer_featuresdata.groupby(CustomerID).agg({TotalPrice:sum,InvoiceYearMonth:nunique,Quantity:sum}).rename(columns{TotalPrice:TotalSpend,InvoiceYearMonth:PurchaseMonths,Quantity:TotalQuantity})customer_features[Repurchase](customer_features[PurchaseMonths]1).astype(int)# Train-test splitXcustomer_features.drop(Repurchase,axis1)ycustomer_features[Repurchase]X_train,X_test,y_train,y_testtrain_test_split(X,y,test_size0.2,random_state42)# Model training with parameter combinationsresults[]forn_estimators,max_depth,class_weightinproduct(n_estimators_options,max_depth_options,class_weight_options):clfRandomForestClassifier(n_estimatorsn_estimators,max_depthmax_depth,class_weightclass_weight,random_state42)clf.fit(X_train,y_train)accuracyclf.score(X_test,y_test)results.append((n_estimators,max_depth,class_weight,accuracy))returnresults# Measure total execution timeresults,total_timetime_execution(complete_process)print(fTotal execution time for the entire process:{total_time}seconds)# 2\. Timing the Excel file readingdefread_excel():returnpd.read_excel(Online Retail.xlsx,engineopenpyxl)# Measure time taken to read the Excel file_,read_timetime_execution(read_excel)print(fTime taken to read the Excel file:{read_time}seconds)# 3\. Timing the model trainingdeftrain_model(X_train,y_train):results[]forn_estimators,max_depth,class_weightinproduct(n_estimators_options,max_depth_options,class_weight_options):clfRandomForestClassifier(n_estimatorsn_estimators,max_depthmax_depth,class_weightclass_weight,random_state42)clf.fit(X_train,y_train)accuracyclf.score(X_test,y_test)results.append((n_estimators,max_depth,class_weight,accuracy))returnresults# Measure time taken to train the model_,train_timetime_execution(train_model,X_train,y_train)print(fTime taken to train the model:{train_time}seconds)https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/6680c96a3ff5a703c1e00b62aa6c96fd.png作者截图整个过程需要近 20 秒其中大约 18 秒用于读取数据文件。解决方案一启用 GPU 并设置内存增长与 CPU 相比GPU 非常适合处理大型数据集和复杂模型如深度学习因为它们支持并行处理。有时开发者会忘记设置内存增长这会导致 GPU 在启动时尝试为模型分配所有内存。那么内存增长是什么为什么在使用 GPU 时它如此重要内存增长是一种机制允许 GPU 按需增量分配内存而不是预先保留一大块内存。如果没有设置内存增长并且模型较大可能没有足够的可用内存这可能导致“内存不足”错误。在多个模型同时运行的情况下一个模型消耗了所有 GPU 内存阻止其他模型访问 GPU。简而言之正确设置内存增长可以启用高效的 GPU 使用增强灵活性并提高大数据集和复杂模型训练过程的鲁棒性。在启用 GPU 并设置内存增长后代码的表现如下importtensorflowastffromsklearn.model_selectionimporttrain_test_splitimportpandasaspdfromitertoolsimportproductimporttime# Enable GPU and Set Memory Growthgpustf.config.experimental.list_physical_devices(GPU)ifgpus:try:forgpuingpus:tf.config.experimental.set_memory_growth(gpu,True)exceptRuntimeErrorase:print(e)# Function to calculate and print elapsed timedeftime_execution(func,*args,**kwargs):start_timetime.time()resultfunc(*args,**kwargs)elapsed_timetime.time()-start_timereturnresult,elapsed_time# Read Excel Filedefread_excel():returnpd.read_excel(Online Retail.xlsx,engineopenpyxl)# Complete Process Functiondefcomplete_process():# Load dataset from Excel filedataread_excel()# Data preprocessingdatadata.dropna(subset[CustomerID])data[InvoiceYearMonth]data[InvoiceDate].astype(datetime64[ns]).dt.to_period(M)# Feature Engineeringdata[TotalPrice]data[Quantity]*data[UnitPrice]customer_featuresdata.groupby(CustomerID).agg({TotalPrice:sum,InvoiceYearMonth:nunique,Quantity:sum}).rename(columns{TotalPrice:TotalSpend,InvoiceYearMonth:PurchaseMonths,Quantity:TotalQuantity})customer_features[Repurchase](customer_features[PurchaseMonths]1).astype(int)# Train-test splitXcustomer_features.drop(Repurchase,axis1)ycustomer_features[Repurchase]X_train,X_test,y_train,y_testtrain_test_split(X,y,test_size0.2,random_state42)# Model training with parameter combinationsresults[]n_estimators_options[50,100]max_depth_options[None,10]class_weight_options[None,balanced]forn_estimators,max_depth,class_weightinproduct(n_estimators_options,max_depth_options,class_weight_options):clfRandomForestClassifier(n_estimatorsn_estimators,max_depthmax_depth,class_weightclass_weight,random_state42)clf.fit(X_train,y_train)accuracyclf.score(X_test,y_test)results.append((n_estimators,max_depth,class_weight,accuracy))returnresults# Measure total execution timeresults,total_timetime_execution(complete_process)print(fTotal execution time for the entire process:{total_time}seconds)# Measure time taken to read the Excel file_,read_timetime_execution(read_excel)print(fTime taken to read the Excel file:{read_time}seconds)# Measure time taken to train the modeldeftrain_model(X_train,y_train):results[]n_estimators_options[50,100]max_depth_options[None,10]class_weight_options[None,balanced]forn_estimators,max_depth,class_weightinproduct(n_estimators_options,max_depth_options,class_weight_options):clfRandomForestClassifier(n_estimatorsn_estimators,max_depthmax_depth,class_weightclass_weight,random_state42)clf.fit(X_train,y_train)accuracyclf.score(X_test,y_test)results.append((n_estimators,max_depth,class_weight,accuracy))returnresults _,train_timetime_execution(train_model,X_train,y_train)print(fTime taken to train the model:{train_time}seconds)https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/41779e68936315a1f6d72dd9b63bb9d8.png作者截图训练模型所需的时间显著从 1.9 秒减少到 0.6 秒。但观察到读取 Excel 文件的时间并没有大幅减少。因此需要另一种技术来提高加载和处理数据的效率——使用数据管道预取进行磁盘 I/O 优化。解决方案二使用数据管道预取进行磁盘 I/O 优化当读取非常大的数据集时磁盘输入/输出可能会成为瓶颈。TensorFlow 的tf.dataAPI 通过允许异步操作和并行处理有效地优化了输入管道并提高了数据加载和处理的效率。这种解决方案减少数据加载和处理时间的原因在于它通过最小化读取大型数据集相关的延迟并通过与并行数据处理对齐创建了一个从磁盘到处理管道的连续、优化的数据流。使用tf.data加载Online Retail.xlsx数据的更新代码如下importtimeimportpandasaspdimporttensorflowastf# Function to calculate and print elapsed timedeftime_execution(func,*args,**kwargs):start_timetime.time()resultfunc(*args,**kwargs)elapsed_timetime.time()-start_timereturnresult,elapsed_time# Function to load and preprocess dataset using tf.datadefload_data_with_tfdata(file_path,batch_size):# Define a generator function to yield data from the Excel filedefdata_generator():datapd.read_excel(file_path,engineopenpyxl)for_,rowindata.iterrows():yielddict(row)# Create a tf.data.Dataset from the generatordatasettf.data.Dataset.from_generator(data_generator,output_signature{col:tf.TensorSpec(shape(),dtypetf.float32)forcolindata.columns})# Apply shuffle, batch, and prefetch transformationsdatasetdataset.shuffle(buffer_size1000).batch(batch_size).prefetch(tf.data.experimental.AUTOTUNE)returndataset# Load and preprocess dataset using tf.data.Datasetfile_pathOnline Retail.xlsxbatch_size32dataset,data_load_timetime_execution(load_data_with_tfdata,file_path,batch_size)print(fTime taken to load and preprocess data with tf.data:{data_load_time}seconds)https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/6cfd76ad7335de0046907d4faec4bb5b.png作者截图加载数据所需的时间显著从 18 秒减少到 0.05 秒。正确定义batch_size是必要的因为每一步处理的数据大小会影响内存消耗和计算效率。如果没有设置批大小它可能默认为 1这会使数据加载和处理或模型训练非常低效。当设置批大小过大或过小时可能会导致训练效率低下、内存错误、收敛速度变慢或模型性能不佳。结论GPU 非常适合处理极其庞大的数据集和高度复杂的模型但如果没有适当的参数设置其优势几乎无法发挥。启用 GPU 内存增长优化了 GPU 的使用并防止了内存错误。使用数据管道预取进行磁盘 I/O 优化显著减少了数据加载和处理时间。这些技术共同为克服日常工作中遇到的挑战提供了实用且影响深远的解决方案。

个人买卖网站怎么做网站大全全部免费

自己做网站知乎国际旅游网站设计报告

免费做网站页头图深圳网站备案拍照点

免费入驻的跨境电商平台广州网站优化排名系统

百度站长工具综合查询wordpress教育插件

做it的中国企业网站网站编辑有前途吗

怎么做淘宝一样的网站我的电脑做网站服务器吗

个人买卖网站怎么做网站大全全部免费

自己做网站 知乎国际旅游网站设计报告

免费做网站页头图深圳网站备案拍照点

免费入驻的跨境电商平台广州网站优化排名系统

百度站长工具综合查询wordpress教育插件

做it的中国企业网站网站编辑有前途吗

怎么做淘宝一样的网站我的电脑做网站服务器吗

自己做网站知乎国际旅游网站设计报告