i using r
cbind ~11000 files using:
dat <- do.call('bind_cols',lapply(lfiles,read.delim))
which unbelievably slow. using r because downstream processing creating plots etc in r. fast alternatives concatenating thousands of files columns?
i have 3 types of files want done. this:
[centos@ip data]$ head c021_0011_001786_tumor_rnaseq.abundance.tsv target_id length eff_length est_counts tpm enst00000619216.1 68 26.6432 10.9074 5.69241 enst00000473358.1 712 525.473 0 0 enst00000469289.1 535 348.721 0 0 enst00000607096.1 138 15.8599 0 0 enst00000417324.1 1187 1000.44 0.0673096 0.000935515 enst00000461467.1 590 403.565 3.22654 0.11117 enst00000335137.3 918 731.448 0 0 enst00000466430.5 2748 2561.44 162.535 0.882322 enst00000495576.1 1319 1132.44 0 0 [centos@ip data]$ head c021_0011_001786_tumor_rnaseq.rsem.genes.norm_counts.hugo.tab gene_id c021_0011_001786_tumor_rnaseq tspan6 1979.7185 tnmd 1.321 dpm1 1878.8831 scyl3 452.0372 c1orf112 203.6125 fgr 494.049 cfh 509.8964 fuca2 1821.6096 gclc 1557.4431 [centos@ip data]$ head cpbt_0009_1_tumor_rnaseq.rsem.genes.norm_counts.tab gene_id cpbt_0009_1_tumor_rnaseq ensg00000000003.14 2005.0934 ensg00000000005.5 5.0934 ensg00000000419.12 1100.1698 ensg00000000457.13 2376.9100 ensg00000000460.16 1536.5025 ensg00000000938.12 443.1239 ensg00000000971.15 1186.5365 ensg00000001036.13 1091.6808 ensg00000001084.10 1602.7165
thanks!
for fast reading of files, can use fread
data.table
, rbind
list
of data.table
using rbindlist
specifying idcol=true
provide grouping variable identify each of datasets
library(data.table) dt <- rbindlist(lapply(lfiles, fread), idcol=true)
Comments
Post a Comment